Monitoring vs. Root Cause Analysis – Is There a Unicorn System?

  • warning: Parameter 2 to securepages_link_alter() expected to be a reference, value given in /var/www/sologic.com/www/includes/common.inc on line 2892.
  • warning: Parameter 2 to securepages_link_alter() expected to be a reference, value given in /var/www/sologic.com/www/includes/common.inc on line 2892.
  • strict warning: Non-static method view::load() should not be called statically in /var/www/sologic.com/www/sites/all/modules/views/views.module on line 879.
  • strict warning: Declaration of views_handler_argument::init() should be compatible with views_handler::init(&$view, $options) in /var/www/sologic.com/www/sites/all/modules/views/handlers/views_handler_argument.inc on line 745.
  • strict warning: Non-static method views_many_to_one_helper::option_definition() should not be called statically, assuming $this from incompatible context in /var/www/sologic.com/www/sites/all/modules/views/handlers/views_handler_argument_many_to_one.inc on line 36.
  • strict warning: Non-static method views_many_to_one_helper::option_definition() should not be called statically, assuming $this from incompatible context in /var/www/sologic.com/www/sites/all/modules/views/handlers/views_handler_argument_many_to_one.inc on line 36.
  • strict warning: Declaration of views_handler_filter::options_validate() should be compatible with views_handler::options_validate($form, &$form_state) in /var/www/sologic.com/www/sites/all/modules/views/handlers/views_handler_filter.inc on line 589.
  • strict warning: Declaration of views_handler_filter::options_submit() should be compatible with views_handler::options_submit($form, &$form_state) in /var/www/sologic.com/www/sites/all/modules/views/handlers/views_handler_filter.inc on line 589.
  • strict warning: Declaration of views_handler_filter_node_status::operator_form() should be compatible with views_handler_filter::operator_form(&$form, &$form_state) in /var/www/sologic.com/www/sites/all/modules/views/modules/node/views_handler_filter_node_status.inc on line 14.
  • strict warning: Non-static method view::load() should not be called statically in /var/www/sologic.com/www/sites/all/modules/views/views.module on line 879.
  • strict warning: Non-static method views_many_to_one_helper::option_definition() should not be called statically, assuming $this from incompatible context in /var/www/sologic.com/www/sites/all/modules/views/handlers/views_handler_argument_many_to_one.inc on line 36.
  • strict warning: Non-static method views_many_to_one_helper::option_definition() should not be called statically, assuming $this from incompatible context in /var/www/sologic.com/www/sites/all/modules/views/handlers/views_handler_argument_many_to_one.inc on line 36.
  • strict warning: Declaration of views_handler_filter_boolean_operator::value_validate() should be compatible with views_handler_filter::value_validate($form, &$form_state) in /var/www/sologic.com/www/sites/all/modules/views/handlers/views_handler_filter_boolean_operator.inc on line 149.
Cory Boisoneau - Manager, Sales & Marketing

March 3, 2016

I often see new IT monitoring products enter the marketplace (yes, I am a nerd and have a Google alert for “root cause analysis”, and yes, I read most all of the results). Most of these products claim to do a wonderful job monitoring your systems, collecting data, and alerting you to unusual activity and failures. Of these claims, I have no question they deliver as advertised. However, most of them also claim to perform root cause analysis, relieving you of the burden of analyzing your organization’s problems. These magical systems seemingly provide the Problem Manager with his/her “problem unicorn” – a system that collects your data, analyzes your problems, and identifies solutions, all without human participation! Unfortunately, much like the mythical unicorn of lore, this system does not exist in reality. We all want to comply with ITIL, but there is no such thing as automated root cause analysis.

Don’t get me wrong – I think monitoring is a valuable and necessary piece of the IT systems puzzle. It provides data that drives our root cause analyses. What a monitoring system cannot do is ask “why?”, or “what caused this?”. Let’s look at an example that may be classified as a “root cause” by an IT monitoring system. The root cause identified is “server over max capacity”. Based on this intel, what would you do to eliminate this cause? How confident would you be that you have eliminated the problem, and that it will not recur? The likely solution of adding more server capacity won’t necessarily prevent this problem from recurring – we must understand why the server was over capacity, why we weren’t alerted before capacity was reached, if old files were not being purged properly, etc.

To find effective solutions that prevent problem recurrence, we need to look beyond the error. We need to look at least two levels deeper, and software simply doesn’t have the capability to meaningfully do that on its own yet. Use your monitoring data to inform your RCAs. It will help you put together your cause and effect chart. You, the analyst, must ask the probing questions in order to discover the true root causes of your problems. Relying on your monitoring system to perform your RCAs is an excellent way to ensure that you experience recurring problems. A few tips for digging deeper, beyond the basic monitoring data:

  1. Error Plus Two Rule – find at least two additional levels of causation beyond the error to ensure that you thoroughly understand the issue and why it occurred. If you don’t understand it, how can you solve it?
  2. Two Questions for Chart Building - we’re not talking 5 Why’s linear analysis – use the two questions for chart building you learned in Sologic RCA training to create “and” relationships via branched causes. If you haven’t yet been to training, what are you waiting for? Start here: http://www.sologic.com/root-cause-analysis-training

While auto-filling of root causes doesn’t exist just yet, with just a few probing questions and gathering of existing data, developing the cause and effect chart won’t take long. You will be glad that you did!