I participated a week or two ago in the DevOpsJRS meetup in Cisco Jerusalem. Our guest speaker was Avishai Ish-Shalom. I always enjoy Avishai's talks, he is a great speaker, a down to earth guy, and I have had the opportunity and pleasure to work with him several times in the past.
One of the slides that he posted included the following:
I am currently involved in an Scrum product team, where we (try and) do retrospectives after each sprint.
For those of you who are not familiar with the Agile methodologies, a short overview and my view on the process.
Making long term plans is quite difficult, and sometimes even impossible in our ever changing world. Things are moving so fast, at such a pace, ever changing. Scrum groups work in sprints. A sprint is a short burst of work, which can be defined by the team, but usually we are talking about 1-2 week bursts.
The team plans the work for each sprint and concentrates only those tasks at hand for that specific sprint. They produce in small increments but continously produce something that adds value.
After the sprint there is a retrospective. The team looks at what went well, what was bad, and how to improve. There is a huge amount of trust needed within in the team in order for this to be productive, and one of the things that are very important is that these are conducted in a blameless manner.
The point of such an exercise is to learn and to improve and not to point fingers.
Back to the root cause. In my previous IT positions whenever there was an outage, we did a root cause analysis to see what caused the problem. We always wanted to pinpoint that one thing that caused the problem.
I completely agree with what Avishai said. There is no such a thing as a Root cause, there are only contributing factors. But this seems to be completely against what you might know and have been accustomed to.
Let try and demonstrate with an example.
A critical application stopped responding.
The outage caused downtime for 1 hour in your organization.
In a regular post mortem and root cause analysis, you would have gone through the motions until you think that you found was that the reason the app went down for a hour.
Why did go down for an hour?
Because the host it was running on was disconnected from the network.
Why was it disconnected?
Because John disconnected the wrong cable when working in the datacenter.
There we found the root cause. It was John's fault.
If we are looking only for a root cause, that would be it.
But remember, there is no root cause, only contributing factors.
Digging down a little deeper will uncover a lot more.
Why did John disconnect the wrong cable?
Because he was already at work for more than 24 hours fighting fires and running from crisis to crisis.
He was tired (contributing factor).
And the cables were not marked correctly. (another factor)
So it was not John's fault. There were contributing factors.
The idea of this exercise is to improve and to understand the possible things that we can learn from this event so that it does not occur again.
Possible answers could be:
Make sure that all cables are marked clearly. It would have helped here.
John was tired, over worked. Why? Because he had too much on his plate, he was overloaded.
Perhaps increase automated processes that will free up more time for John and the team.
Invest in more staff, better equipment, additional training so that John would have a better balance and have time to invest in improvement.
We must embrace outages, because they are they best learning opportunities, and the best way to improve.
I would highly recommend using this method in your next retrospective or post-mortem. I can guarantee you, that this will improve your team, yourself and the way you work.