Wednesday, November 28, 2012

XenServer Management and Jumbo Frames

In a word, don't do it.

Perhaps some additional background would help. :)

We maintain many XenServer pools, most of which consist of four "compute" servers attached to a shared storage array.  Each server has two ethernets acting as a management network bond, as well as two ethernets acting as a bond for VM traffic.  The VM traffic is VLAN-tagged, the management traffic is not.

We had recently upgraded all of our pools to XenServer 6.1, a little faster than we typically would have so that we could gain access to some of the cool, new features (e.g., inter-pool VM migration).  Life is good, everything works fine.  Until it came time to apply a couple of patches.  After applying a patch I would reboot the server, at which point it would momentarily re-contact the pool and then disappear.  The Xapi services on the host would not respond, and the pool master would not acknowledge the node's presence.  SSH connectivity to the node worked, however.

This issue proved to be pre-existing, as in the patches were not what caused the problem.  I tried rebooting a node that had vanilla XS 6.1 and it exhibited similar symptoms.  It was just coincidental that the servers had not been rebooted until it came time to apply patches.

After some experimentation and trial and error, I was able to [reliably] get the node back online by performing an "emergency network reset" and rebooting.  However, the node would rejoin successfully only until the next reboot, whereupon it became a case of rinse, lather, repeat.

Further trial-and-error showed that if I removed the management bond entirely and ran all management traffic through a single interface, reboots worked properly and as expected (i.e., the system would seamlessly rejoin the pool).  Recreate the bond and the problem re-manifested.


After a period of tearing out my hair over this, I noticed the MTU setting.  We typically configure our VM traffic bonds with an MTU of 9000 so that customers can use so called "jumbo frames" within their VMs.  Without putting too much thought into it, we had also been configuring our management bonds with MTU=9000 as well.  On a hunch, I re-created the management bond, but this time with a default MTU of 1500.  Rebooted the node and....SUCCESS!  It correctly re-joined the pool after a reboot.

So, the moral of the story seems to be that if you have XenServer 6.1 installed on a system with a bonded management interface, ensure that bond has the default MTU of 1500.  Jumbo frames seem to make it unhappy for reasons unknown to me.  We've had these bonds enabled for quite some time -- this behavior seems to be new with version 6.1.  I haven't yet contacted Citrix to see if they are aware of the issue or not, but I thought I would at least document the issue here, in case someone else out there runs into similar problems.  I know that my many, many google searches on the matter ended up being fruitless.

The silver lining in this particular cloud is that throughout all this mess, all of our virtual machines stayed online and had no issues whatsoever, so our customers were never even aware there was a problem!  That has to count for something...