Monitoring vs. Root Cause Analysis – Is There a Unicorn System?

Cory Boisoneau - Manager, Sales & Marketing

March 3, 2016

I often see new IT monitoring products enter the marketplace (yes, I am a nerd and have a Google alert for “root cause analysis”, and yes, I read most all of the results). Most of these products claim to do a wonderful job monitoring your systems, collecting data, and alerting you to unusual activity and failures. Of these claims, I have no question they deliver as advertised. However, most of them also claim to perform root cause analysis, relieving you of the burden of analyzing your organization’s problems. These magical systems seemingly provide the Problem Manager with his/her “problem unicorn” – a system that collects your data, analyzes your problems, and identifies solutions, all without human participation! Unfortunately, much like the mythical unicorn of lore, this system does not exist in reality. We all want to comply with ITIL, but there is no such thing as automated root cause analysis.

Don’t get me wrong – I think monitoring is a valuable and necessary piece of the IT systems puzzle. It provides data that drives our root cause analyses. What a monitoring system cannot do is ask “why?”, or “what caused this?”. Let’s look at an example that may be classified as a “root cause” by an IT monitoring system. The root cause identified is “server over max capacity”. Based on this intel, what would you do to eliminate this cause? How confident would you be that you have eliminated the problem, and that it will not recur? The likely solution of adding more server capacity won’t necessarily prevent this problem from recurring – we must understand why the server was over capacity, why we weren’t alerted before capacity was reached, if old files were not being purged properly, etc.

To find effective solutions that prevent problem recurrence, we need to look beyond the error. We need to look at least two levels deeper, and software simply doesn’t have the capability to meaningfully do that on its own yet. Use your monitoring data to inform your RCAs. It will help you put together your cause and effect chart. You, the analyst, must ask the probing questions in order to discover the true root causes of your problems. Relying on your monitoring system to perform your RCAs is an excellent way to ensure that you experience recurring problems. A few tips for digging deeper, beyond the basic monitoring data:

  1. Error Plus Two Rule – find at least two additional levels of causation beyond the error to ensure that you thoroughly understand the issue and why it occurred. If you don’t understand it, how can you solve it?
  2. Two Questions for Chart Building - we’re not talking 5 Why’s linear analysis – use the two questions for chart building you learned in Sologic RCA training to create “and” relationships via branched causes. If you haven’t yet been to training, what are you waiting for? Start here:

While auto-filling of root causes doesn’t exist just yet, with just a few probing questions and gathering of existing data, developing the cause and effect chart won’t take long. You will be glad that you did!