RCA TRAININGRoot Cause Analysis training by Sologic provides the tools, skills, and knowledge necessary to solve complex problems in any sector, within any discipline, and of any scale. Learn More
SOFTWARESologic’s Causelink has the right root cause analysis software product for you and your organization. Single users may choose to install the software locally or utilize the cloud. Our flagship Enterprise-scale software is delivered On Premise or as SaaS in the cloud. Learn More
READ THIS FIRST:
We need to disclose that this EXAMPLE RCA is based upon publicly available information published in a single report by Amazon and not from any independent investigation conducted by Sologic. Sologic has not investigated this incident in any official capacity, and we do not want to imply that we were in any way associated with this event. The only purpose of this root cause analysis report is for it to be used as an example for our students and other interested parties.
A root cause analysis has two primary goals: 1) Organize a wide array of information from disparate sources in a way that makes it easier to understand, and 2) Identify a set of evidence-based solutions to present to decision makers. IT outage reports are often vague and peppered with tech-heavy terms. This style makes them a bit opaque to those outside the industry. But it does not have to be that way – a cause and effect chart provides a nice visual reference to go along with the report. The chart puts the causal interactions into context with respect to time, allowing the reader to see how the event unfolded.
A few thoughts about IT and root cause analysis in general, not necessarily associated with this particular event. It’s been our experience that IT professionals are often extremely intelligent. But at the macro level, IT is relatively new to the world of standardized problem solving. Many ITSM efforts focus first on Incident Management, with the intent of standing up Problem Management at some later point. When a large problem like the one detailed in this example occurs, IT professionals are under extreme pressure to complete the investigation as quickly as possible. Often, new problems have cropped up that need their attention. And their customers are demanding answers. Couple this environment with the fact that these systems are complex and the investigation team is often inexperienced with root cause analysis, and you get the right conditions for a sub-optimal investigation. This is not always the case - just an observation based on our experience.
The problem with this is the continued exposure to risk, even when steps are taken to formally solve the problem. An investment in a formal root cause investigation is supposed to finance a reduction in risk. The risk of problem recurrence is directly related to the quality of solutions implemented by the team, and solution-quality depends on a logical, thorough, and evidence-based root cause analysis. When the consequences of failure are high, an investment in RCA capability pays off in a big way. This investment includes training, software, and consulting (all the things Sologic provides). But equally important is the investment leadership makes in change management. Building capability requires the structure of an RCA program, and this requires recognition by leadership that their success is incumbent upon the collective problem-solving capability of the organization. This is particularly true in IT.
If possible, consider printing the following summary report and follow along with the cause and effect chart as you read the report. Notice the solutions Amazon has put in place, along with which causes they control. What do you think?
Link to: Orignal Amazon Report
On 28-Feb-2017, Amazon Web Services (AWS) experienced a service disruption impacting the US EAST-1 Region. The disruption began at 9:37AM and lasted until service was restored at 1:54PM. The primary system impacted was the Amazon Simple Storage Service (S3). Other services, including the Amazon Elastic Compute Cloud (EC2), the Amazon Elastic Block Store (EBS), and AWS Lambda – all of which rely on S3 – were impacted for several hours longer.
Note that there may well have been additional impacts, but they were not reported publicaly. For the purposes of an example, that's okay. In a real RCA, we would want to document the impacts thoroughly.
S3 Service Unavailable:
The S3 Service went down when a technician was trouble-shooting a problem with the S3 billing system. The technician was attempting to remove a small number of servers, which would not have impacted service availability. But instead, he/she removed a much larger set of servers. This destabilized the S3 Service, bringing the system down. The technician was following an approved set of procedures (what the Amazon report refers to as an “established playbook”), however the technician entered a command incorrectly. It is unclear whether this was simply an error on the technician’s part, or whether there was an issue with the playbook itself. The system apparently has no secondary protections to prevent such a command to be executed, however specific system design parameters were not reported. An actual RCA would travel further down this pathway to identify how systems are designed, risks are identified, and preventive actions implemented.
4:17 Required for S3 to Recover:
A full restart of the Index Subsystem was needed to bring S3 back online (time required = 3:41). Afterwards, the Placement Subsystem required time to recover (time required = 0:36). The S3 system has experienced massive growth in recent years, which has increased its complexity. There is limited recovery history for this system. That is because this system is generally reliable and therefore has not experienced a full restart in many years.