During one of our monthly web conferences with a client, the subject of service levels came up. They were getting poor service from one of their back-ends, and we found the source of the problem and quantified it.
Before I talk about how we found it, let’s talk about computer service level management in general.
What does “service level” mean? It seems there is no single definition.
For some, service level means availability. For example, 99% uptime. But there is a further distinction required: Are we talking about box (operating system or hardware) availability or are we talking about application availability? One can have 100% availability for the box, but only 97% application availability if you have a 4-hour “planned outage” each week for database maintenance.
Payments systems and customer-facing systems require 100% application availability.
If you’re writing a specification regarding service level management based on availability, you may also want to include limits on downtime incidents per period and mean time to recover or repair. Nothing is more frustrating than something that breaks frequently even though it may have a quick recovery.
Payments systems and customer-facing systems require two additional measurements of service level: transaction success and response time. Both are extremely important.
Response time is a traditional measure of service level: i.e., 98% of the transactions must finish within 2 seconds. Note that this is a distribution, not an average. An average could be stated as “the average response time must be less than 2 seconds.” Averages are terrible specifications of response. For example, one could have three transactions taking 3 seconds each and three taking 1 second each. The average response in this case is 2 seconds. But this scenario wouldn’t meet the 98% criteria, since its distribution (3 at 3 secs and 3 at 1 sec) would be only 50%.
Transaction success rates should also be tracked. In the real world of multi-stage transaction paths, it’s possible to interrupt a transaction somewhere in the path, deny it, and return it within the required response time. A denial or reversal reason code is always useful.
Happily, in the payments industry the standard packages* capture the data we need in their logs. From these logs we can determine which institution or site sent the transaction, what response and success we gave them, to whom we sent the transaction, and what kind of response and success we’re getting from them.
*We have log analyzers for BASE24, eFunds Connix (both on Tandem – HP NonStop), ON/2 (Stratus VOS), and Open/2 (Windows, Unix). Other providers are welcome to talk to us.
We report and chart counts, averages, and distributions for the transaction stream as a whole, and for each source and destination (authorizer) of a transaction. When we do response distributions, we categorize each transaction into one of several “buckets:” ideal response, good-enough, heading for trouble, and definitely broken response. We watch for the outliers in the distribution. They provide the early warning that something’s going wrong. If they’re there, we want to know why they’re there, and we will drill down to see when they occurred and who or what caused them.
Take a look at this page of example charts to see the progression of a problem diagnosis.
The bottom line: We saw the outliers increasing in our client’s transaction response. We were able to identify that it was an external problem, caused by increasingly slow or inconsistent response from one of their back-ends. We were able to identify the back-end, and provide a history of the problem going back several months. And we were able to identify the times of day that the back-end was having the problem.
If only I can get that back-end switch to hire us to help them!
Recent Comments