Have you heard of the Heinrich Ratio, also known as the Iceberg Model? It was originally postulated for the airline industry. It says that for every 600 incidents, there will be 30 serious incidents, and 1 catastrophic incident. (The numbers vary by commentator.) To put it another way, for every disaster, there were 30 reported incidents which may have lead to the disaster, and 600 unreported incidents.
I firmly believe that the Heinrich Ratio applies to computer technology as well. Most system outages and slowdowns need not happen. They can be predicted. Predicting them doesn't require rocket science, just due diligence.
In the world in which I work, that of customer-facing or money-handling systems (mission-critical, an overused term), an outage of even a minute or two can be extremely expensive. If you're running a point of sale system, ATM network, or a financial switch, 100 transactions per second, with an average value of $10, means $1,000 per second of business through the system. A one-minute system disruption means lost business of $60,000.
It may also mean a lost customer. If I try to use a credit card and it gets rejected and I know that my credit is good, I may think twice about using that card again. Too many rejects and I'll drop the card.
How many times have you tried to use a website for ordering and were frustrated because it was too slow? I recently just walked away from a site and chose a competing product because I wasn't able to get an order placed properly.
Given this, it makes a lot of sense to go attack the "Iceberg."
Yes, I know, all of your production software is perfect, no bugs or issues. However, I suspect that it has some "unique features" that occasionally raise their heads. And of course your operations staff never makes a mistake. That's as it should be.
When we work with a client we search for incidents that could become the base of the iceberg. We look for the intermittent problem which indicates a "unique feature," or for the ill-advised operator action that could lead to trouble in the future. Things like backups being run at the wrong time; a process that goes into a loop for a minute or two, but behaves normally most of the time; batch jobs competing with the on-line system during peak times; communications links that drop and resume unexpectedly.
Each month we talk to the client about what we find, and recommend measures for avoidance.
If we can minimize the base of the iceberg, reducing the base of potential mistakes, we will reduce the risk that a catastrophic outage will occur. We'll save money, and keep our customers happy.
Jon -
Great stuff! This piece really summarizes what the Ban Bottlenecks approach is all about. Fantastic!
-- Andy
Posted by: Andy | 10/23/2009 at 11:11 AM
Jon, Couldn't agree with you more:
"I firmly believe that the Heinrich Ratio applies to computer technology as well. Most system outages and slowdowns need not happen. They can be predicted. Predicting them doesn't require rocket science, just due diligence."
You only need to look to find a potential problem...but there are not many people looking, until it is too late.
Look forward to more.
MartinC
Posted by: MartinC | 10/23/2009 at 02:27 AM