While traveling last week and presenting at a couple of regional conferences, I was asked about how to find a millisecond-level problem. The questioner worked for an exchange, and was frustrated that most of his transactions were properly processed in the 3-5 ms. range, but occasionally some took longer. He asked how he could use his tools to try to find what's causing the longer transactions.
We chatted a bit further and I admitted that I didn't have a solution for him.
Then I got to thinking about the issue. He's using a fast modern processor, but an older operating system, designed for commercial transaction processing. The OS is only recently updated to allow symmetric multiprocessing, and documentation isn't yet available on where the locks are or how deeply they're placed and their granularity.
The challenge is that every OS will occasionally turn out the lights on the application and proceed to do its own work. That work can be substantial. For example, periodically an OS will:
- Scan its memory page tables, looking for expired or modified pages. It will destroy the expired pages, and flush out and perhaps destroy the modified pages. With gigabyte memories, there are a lot of entries to check.
- Scan its disk cache pages, doing the same.
- Scan other internal OS tables, doing cleanup.
And that's only if the application is stable, not starting processes, opening/closing files, paging, or doing disk I/O. If the application is actively doing any of the above, the OS will be busier and will have greater latency.
What do I mean by OS latency? It can be defined as the time between the instant that an application's requested data is available for processing and the instant that it actually is assigned a processor and allowed to run.
Let's say that I'm waiting on a disk I/O. That I/O completes, causing an interrupt indicating that the disk block is in memory. The event flags will be set indicating that the request is complete. How long between that moment and the moment that the process can start processing the data is the latency.
A lot can happen between those events as mentioned above. And at different levels. The OS may continue processing the requests in the cache or queue for that disk. If the disk is backed up, there may be significant processing to properly order and schedule the requests in the queue. There may be other interrupts from other disks or comm devices which need processing. Then, if the interrupt processing is complete, process scheduling has to sort out which processes are ready to run, and allow them to proceed. In the OS under discussion, disk and comm handlers are individual processes, typically running at a high priority, above the application. They have to do their work before the application gets time.
All this will add up. I wouldn't be surprised that the maximum latency delivered by this OS, even on a lightly-loaded system, could be on the order of 10 ms. or more. I don't think that OS designers publish those figures for commercial OS's.
Let's also keep in mind that transaction queuing is a factor. Unless you have a single source for the transaction stream, presenting them one-by-one, you have the probability that multiple transactions will arrive at the same instant. Don't forget that you're at the mercy of the arrival rate of those transactions. A 100 TPS average rate over a minute can mean that you had 6,000 transactions during that minute, arriving at even intervals, or that you had 6,000 transactions arriving in one second and nothing for the other 59 seconds.
The truth is somewhere in between, of course. But if your application design and capacity planning are for evenly-spread transactions, expect that some transactions will occasionally be slow.
My recommendation is that you be generous in time-stamping the transaction at each stage as it goes through its path. When the transaction has finished, record its information to disk and look at the data periodically to see what slowdowns are happening. Some OS's will report on time-in-queue for messages between processes, and this should be recorded and analyzed.
Another possible way of discovering OS latency is by using performance tools, if their data-gathering granularity goes down to the resolution that you're looking for. Let's say that you set simple data collection for CPU on a ten-second interval. What you will probably see is that successive samples will vary from the clock by milliseconds, or more. This can be taken as evidence of latency. Of course, such sampling is affected by process priority and everything else happening on the box.
I was chatting with friend about this issue. He had experience with another exchange regarding a similar problem. His conversation with me was quite proactive, saying that if they could reproduce the problem (there's the rub!) there was a probability that the vendor engineers could find the cause. In his particular experience, the exchange was able to reproduce the problem, and he was able to trace the cause to a piece of the IP stack that probably dated back to the days of 300 baud modems. The buffering at this level of the stack was totally insufficient, causing queuing and dropped packets. His team fixed the stack and the client was gratified.
On the other hand, when you're working with transactions that require such stringent response requirements, you probably need to find a very simple, low-overhead OS. I'm reminded that Data General years ago (remember, I'm a greybeard) had an RTOS (Real Time OS) with a guaranteed maximum latency. In today's world there are OS's designed for embedding into devices for real-time (sub millisecond) response.
Hofstadter's Law: It always takes longer than you expect, even when you take into account Hofstadter's Law.
Recent Comments