Too Much Data!
One of the challenges when working with Windows systems is that there are LOTS of objects and counters. Generally they are poorly documented. If you look at the trade press, the articles that are written about Windows performance analysis are usually sketchy, designed for the novice.
Over the years we have taken the two-pronged approach: 1) Collect more data than you need, and 2) Cross-correlate that data to confirm that the counters are meaningful. This approach has worked well for us across many architectures.
Balanced? Not!
This recent Windows project has shown us how things can get out of hand. Here are some of our discoveries when looking at these systems.
Identical Boxes, Identical Software, Vastly Different Load
The first thing that jumped out of the data was a memory issue. Nine application servers at each of two sites should have been performing identically. However, the memory statistics showed that three of the application servers in the north had a severe memory problem. The stats said that these servers had committed memory over 150% of real memory. That in itself is not necessarily a problem. However, these same servers were experiencing page-out rates from 300 to 900 per second. The page-outs were happening continuously, with highest rates during the peak transaction times. That is a problem!
The other servers in the group had commited memory less than 100% of real, and had page-out rates peaking at much less than our three problem servers.
A little further investigation showed that a disk utility which was running on all systems had for some reason grown out of whack. On all of the other systems, it was using 4MB of main memory. On the three servers in questions it was using 2GB!
Thrashing
Those levels of page-outs have a severe impact on the response of the system. When a transaction came in to the complex, it became the luck of the draw whether the transaction hit a good server or a bad one. I can guarantee you that if it hit one of the challenged servers, it would take quite a bit longer to process. The application would be competing with the disk utility for memory for programs and data, and this would take time. Since this is an interactive application with real people waiting for the web response, the consequences are real.
Solution?
The client chose to reboot these servers and the broken service cleared itself up. Since we work with this client on an "arms-length" basis, i.e no access to the systems, we never found out the exact cause of the problem, whether it was a bad control file or operator screwup. The task for everyone concerned with these systems is to make sure it doesn't happen again.
Recent Comments