In addition, some systems were responding well enough, but had very high context switch rates. The lowest rate I saw was about 500,000 context switches a second, but the highest I saw was over 2.8 million switches a second!
Almost all of the poorly-performing systems were virtual servers, and the performing-okay-but-high-context-switch-rate systems were all physical servers. I hypothesized that the problem was the same one, but the physicals had more CPU power available to them -- most of our VMs don't have more than 2-3 vCPUs, whereas the physicals have upwards of 16 in some cases, if you count hyperthreads.
As it turns out, it was being caused by some weird (sorry, I'll try to keep from adding any more of this technical jargon...) kernel interaction when it processed the leap second that occurred today. For more details, I turn you now to the blog entry that helped me narrow the problem down and provided me with a simple fix:
For the record, should this link ever stop working, he said:
The fix is quite simple – simply set the date. Alternatively, you can restart the machine, which also works. Restarting MySQL (or Java, or whatever) does NOT fix the problem. We put the following into puppet to run on all our machines:
$ cat files/bin/leap-second.sh
# this is a quick-fix to the 6/30/12 leap second bug
if [ ! -f /tmp/leapsecond_2012_06_30 ]
/etc/init.d/ntpd stop; date -s "`date`" && /bin/touch /tmp/leapsecond_2012_06_30
His solution was a lot more elegant than mine, which was to simply reboot the system. :) It was also a lot easier to apply prophylactically to our entire fleet.