“Wait… is that billions?”

Ironically, after yesterday’s post about Oracle’s caviler attitude toward dealing with communities built around open software, a piece of died-in-the-wool proprietary software stepped out of Oracle’s (ever growing) stable of software, neighed a couple of times, stamped it’s feet and kicked me in the head.  Specifically, one of our 11g database servers shot up to one minute load averages a little above 2,000.  Yes, two zero zero zero.

Despite the comically high load, our M5000 server running Solaris 10 withstood the onslaught and the system was still quite usable.  A testament to the engineering teams at Sun if I ever saw one.  A quick jump to old performance standbys, mpstat, iostat, prstat and vmstat revealed many involuntary context switches and a run queue in the thousands.  My DBA colleague and I went about the business of diagnosing the problem and he promptly ran an AWR report.  Below was what we found.

Oracle 11g AWR Report

"Wait, is that billions?"

“Wait, is that billions?” I asked naively.

“Yeah… something’s wrong.” my DBA colleague replied in his usual calm, understated way.

Oracle Support confirmed that, yes, four billion mutex waits in a span of an hour appeared to be the cause of our pain.  Luckily for us, this is not undiscovered country.  A quick Google search later revealed that 11g is notorious for this particular type of pain.  The fix, of course, was a patch.  Specifically, 10411618: Add different wait schemes for mutex waits.

Annoyingly, had Oracle been gracious enough to add DTrace probes to their enterprise products, we could’ve saved a lot of heartache with a one-liner.

Also, here’s a good intro to Oracle DB mutexes and latches.

This entry was posted in computers, solaris. Bookmark the permalink.