When I woke up last Friday, I wasn’t expecting to see 116 up and down alerts from our notification system. Not the most encouraging start to the day. Turns out, some planned but poorly communicated network core work was being done at one of our remote sites, and VMware HA kicked in and restarted everything. The on-site and on-call guys had leapt into action to disable VMware HA for the duration of the planned work, but a spirited discussion over who was to be blamed for the outage had broken out at layer eight. Wagons had been circled, accusations had been thrown, and I hadn't even had my coffee yet.
The ESX hosts in question had a single pair of NICs carrying VMkernel, Service Console, and VM Network traffic. Not ideal, to be sure, but it's the best we could do with the hardware we had. This site has no server distribution layer, so each NIC ran directly into one of two core 6509s. Teaming was configured with both adapters active, balancing based on port ID, failover detection set to Link Status only, and Failback enabled. See below.
Only one core switch was taken offline, so why did an isolation response kick in? Googling and poking around VMware Communities yielded this KB article: http://kb.vmware.com/kb/1003804. My Network Guy counterpart confirmed that the ports in question did not, in fact, have Portfast enabled. This was an oversight in that site's configuration—this was enabled everywhere else.
This seemed to tie a neat little bow around the issue, but were still scratching our heads over why Spanning Tree would cause an issue when a core switch went down. Spanning tree would only kick in when a port state goes up. A shared lightbulb went off above our heads and we went back to check timestamps on the alerts. The down alerts weren't issued when the switch went down, they were issued when the switch came back up.
We had failback enabled. This meant that as soon as ESX detected the original link was available, it would redirect traffic over it. Without Portfast enabled, the link would register as enabled, but be in blocking mode for 30-50 seconds while spanning tree converged. 30-50 seconds being longer than the default 15 second isolation response time, our VMs were powered down. The online documentation confirms this, stating that Link Status only detection "Relies solely on the link status that the network adapter provides. This option detects failures, such as cable pulls and physical switch power failures, but not configuration errors, such as a physical switch port being blocked by spanning tree or misconfigured to the wrong VLAN or cable pulls on the wrong side of a physical switch."
Going forward, we're obviously going to enable Portfast on all ESX-facing ports. We're also going to disable failback for the Service Console portgroup. And we're going to give the Network team a good rap across the knuckles with a ruler, reminding them that in a post-virtualization world, we really need to coordinate these things better. We prefer to treat HA as a safety net, disabling it when we know maintenance is going on.