Fix the problem, not the blame
The sometimes counterintuitive art of looking hard at your mistakes
Anything that can go wrong will go wrong. - Murphy’s Law
If anyone, then, knows the good they ought to do and doesn't do it, it is sin for them - (James 4:17)
Shit happens - unknown
Learning from failure
Failure comes in many forms. Human error, design flaws, and/or unexpected external events. A test of an engineering culture is less whether mistakes are made, but how it responds to them. This article discusses a common mechanism to extract benefit from failure, at the individual, team and org levels.
Post-mortem discussions are a common part of many engineering cultures. They involve a cross functional group gathering to understand what went wrong and how to do better in the future. This idea goes by other names including RCA (Root Cause Analysis - thanks Shawn!!) and COE’s (Correction of Error). The latter has been made semi-famous by its use at Amazon. I’m going to use a lot of Amazon specifics as examples and as “COE” is short I’m going to use that phrase throughout the doc. Apologies if that’s triggering for any readers who ever worked at Amazon (no, I’m not totally joking).
Amazon has published a nice summary of the what and how of COE’s you can click onto and read. Well worth your time if you haven’t seen examples of this process before. Since they’ve already done some of the heavy lifting I’ll treat this article as supplemental material to theirs.
First off - the COE process cannot be about assigning blame or serve as a punishment. Well, of course it can play those roles. But I guarantee that if it does you’ll get crappy learnings and will be wasting your time. The reasoning for that is described in this (possibly apocryphal) story about an IBM VP from way back in the day. After a dreadful incident the VP walks into the president's office expecting to be fired. When the President realizes this is their expectation their remark is said to be “Why would I fire you? I just spent $4 million teaching you how not to screw this up.” If you don’t act in accordance with “no blame” and “no punishment” you’re just not going to get the truthful data you need. Also - you’ll be seen as a total jerk. And you don’t want that - right?
Most COE discussions are centered around a written document. This provides structure, persistence and communications scale to the results. AKA - Writing it down lets learning spread beyond people that are in the room for a live discussion.
I’ll start by laying out a common template for such a a document. It’s always good to tune for your situation and culture. I briefly explain each one and then go deeper in the following sections.
Title of the incident: Unfortunately, eventually you’ll write two or more of these. So it’s good to have a title as a high level reference to easy future conversations. Some places go so deep into the blameless part and title the documents some random words like “Effervescent Fleas” (I assume some intern always writes the name generator). But I suggest calling it something more functional like “All warehouse orders bollocksed up due to lack of immutable updates” or just “Unable to ship due to warehouse DB error.” This means you don’t need to map random names to real events. Also I don’t have to look up how to spell Effervescent to find the document every time.
High level summary as to what happened. I strongly encourage the team get as crisp and clear as possible here.
Impact to the business. This can include customers impacted and how, dollars lost, and everything up to and including (but hopefully not) lives lost. This is an incredibly common spot to have unhelpful tension in the doc creation process. For reasons explained below.
Timeline: This helps provide a common understanding of what happened when. This may seem nit-picky, but picture two COE’s. One where issues were popping up for weeks before someone noticed, and another where a wailing siren kicked everyone into action within 5 minutes of the start of the problem. They’re rather different - right?
Metrics: Another area that can be really misused. All I’ll say is don’t go crazy, include what helps people understand and derive next steps. Leave the rest.
Deep dive into what happened / Analysis of root cause
Where did we get lucky? (optional but fun and surprisingly useful)
Corrective steps to be taken based on our analysis. To state something obvious that doesn’t always get really thought about - don’t write things here you really don’t intend to do.
Common pitfalls
Some gotchas that can keep you from holding an effective, blameless postmortem include;
Focusing too narrowly on what happened vs. the larger pattern.
Looking for hyper specific root cause relationships in complex systems.
Believing it’s impossible to find a root cause due to complexity. Which can cause missing out on basic changes that would help a lot in the future.
Giving the impression the postmortem is punishment.
Intending the COE as punishment.
Answering lots of questions that aren’t that helpful. This is often a case of not realizing that “questions are cheap and answers are expensive.” (thanks Oliver Jones) for teaching me that one. It’s easy in every section of the doc for someone to pipe up with “what about X?” or “can you share more details on Y?” As in almost all things before taking the time to research an answer you should ask if the group will make a different decision going forward if they have more information. If the answer is no - then politely decline to spend more time on the answer. If this is the only thing you take away then I think this article gave you your money’s worth.
Listing every thing you could possibly improve and then doing them all regardless of how much each reduces actual risk.
Performing a pro forma dance to fill in a document and hold a meeting that lacks value or just takes 5x as much time as it should. Possibly because you’re answering a bunch of questions that don’t have leverage for future decisions.
Making “corrections” that rely on best intentions vs. mechanisms. A mechanism is something that doesn’t require people to pay attention at the right moment. For example - we hope people will check their brake pads on a regular basis. But not checked manually. That’s a best intention to do something. However, there is a mechanism that’s not reliant on the owner being super diligent. An overly worn brake pad will make loud screeching noise that increases the chance the issue will be noticed early enough. That’s your mechanism.
Not sharing the document broadly for others to learn and improve your thinking at the same time.
Writing as leverage
Amplify the value of the learnings via sharing. Don’t just have team members discuss some problems, fix them and then move on. Include external voices and ensure there is involvement from differing work perspectives. Product, design, individual contributors and management staff should all participate. Write a document of some sort and share it broadly so others benefit.
If this is new-ish to your org consider having a COE bar raiser role - someone attends COE reviews even if they’re not in the org that had the issue. I’m not suggesting having them block progress as this can quickly come off as more oversight and thus a punishment. But having a post-mortem coach who has done this before, is good at it and can really drive constructive discussion can really help especially in the formative part of a cultural transition.
Some notes on the specific document sections themselves
Summary of the incident
The first part of any CoE should be a succinct, accurate summary of what happened. Clarity here is critical. A long, technical exposition may obscure the root issue. I’ve seen long (allegedly) explanatory passages about how things started, the engineers state of mind leading up to the issue. Literally including the length of a pre-mistake lunch break and where they were going on vacation the next day*. I’m not kidding. Instead, focus on stating clearly what changed, and what broke as a result. For example, “We changed the meaning of a field in the database. A flag indicating drop shipments was lost. This caused packages not to be shipped.” A well-written summary should be understandable even to someone outside the team.
Questions are cheap, answers are expensive: AKA the “impact section”
Many templates ask for an impact summary; how many customers were affected?, how much money was lost?, how many hours were wasted? While finance teams may crave precision, often the value lies more in identifying the type of defect and its severity category than in quantifying exact numbers. This is also usually much faster. Avoid using relative terms (“a small percentage of shipments were delayed,” “warehouse teams experienced a stressful day”) - these don’t always mean the same thing to everyone (don’t ASSuME). Avoid over-investing time into questions with limited downstream value. Remember: answers are expensive, so only seek the ones that matter.
I was recently asked for advice on what to do when stuck in a situation where you suspect you’re being asked for unhelpful but time consuming work in this section. I suggested the following sort of script
“OK, let me ask you a few questions to confirm I understand what you’re asking for?”
Listen closely then repeat it back to check you’ve got it correct.
Think for a pause.
If it’s a no brainer to you then stop reading and do the work. Otherwise…
Respond with “OK, that does sound interesting. I’m happy to do that, but it will take ____ time. Do you have a sense of what type of answer would cause you to make a different decision than we’re planning on now? It will help me figure out how precise to get and thus how much time to take away from other things like A,B, C“
Listen to the response and see if it now makes sense. If not, ask more questions and ultimately decide if it’s worth doing. This is one of those cases where constructive conflict is better than compromise. Compromise in decisions in my view is almost always a painful mistake. It usually results in a “solution” that less effectively addresses the core problem and tends to leave folks more dissatisfied than other approaches. If that sounds interesting please subscribe as I’m planning an article on that topic. And not just because I have a great line from The Beastie Boys to use in it at the top of the article.
When one reads the above script I know it sounds pretty obvious. But I’ve used it enough and gotten enough feedback to feel it’s worth repeating in all its complexity. Practical phrases to encourage questioning of underlying assumptions feel like Jedi mind tricks to me. Sorry, I thought I’d gotten all of the Star Wars stuff out of my system. ;-)
Caveat: I find the “too much detail” about impact often comes into play around exact dollars lost or numbers of customers. It’s often an unnecessary level resolution. But that’s not to say that you shouldn’t really think about the impact beyond “yeah - that really sucked and we don’t want it to happen again." There may be 2nd or 3rd order impacts which will be unearthed by thinking about impact deeply. For example lets say something went wrong that caused a lot of planes to be delayed and flights were missed. Whether it’s 10,000 or 11,221 passengers impacted likely doesn’t matter. It’s a problem and you’re going to invest in it seriously with either number. But it does matter if you also have an extremely crappy process of rebooking that takes 3 hours to complete leaving people in limbo. In that case asking “what happens in terms of impact when a single passenger has a missed connection we caused?” would be a fantastic thing to learn more about.
Timeline of Events
The timeline includes details on what happened, in what order, and which teams/systems were involved. This reconstructs the context and reveals whether delays in detection or response made things worse. It’s important to differentiate between what was known at the time and what was understood in hindsight. Again - not to blame, but to figure out what would have prevented a less painful outcome.
Root Cause Analysis: The Five Whys
The Amazon article I shared about mentions the “Five Whys” technique for getting to the “root cause.” Amazonian’s love the 5-why’s! Doing it takes some practice, but it’s not complex. Like a curious five-year-old asking “why?” repeatedly, you dig until you hit a systemic insight. If someone deleted a production database, ask why. If the answer is they thought it was the development environment, ask why that confusion was possible? Keep going: was it due to poor environment labeling? Lack of confirmation prompts? Inadequate safeguards?
It doesn’t always take five steps, and sometimes you’ll find multiple chains of contributing causes. In complex systems, there’s often no single root cause, but understanding the network of contributing factors is essential. Importantly, you must also explore “why didn’t we catch this sooner?” and “why didn’t the mitigation work?” The goal is not to reach the end of the whys - it’s to reach a point where action can be taken. I won’t claim it’s perfect. In particular, like any form of human endeavor it’s susceptible to overfitting. For fascinating read about the over reliance of “root cause” thinking in complex systems I’d recommend Drift into Failure. But only after you’ve subscribed to this newsletter first. ;-)
Reflection on Luck
Many teams skip this, but it’s a gold mine: what went well? Were you lucky? Did something break, but you caught it just in time? Understanding where luck, not planning, saved the day often reveals hidden risks. For example, if your only alert was someone noticing something strange on a dashboard, you weren’t prepared - you were lucky. Build from that insight. It also helps combat the cognitive bias that just because something didn’t go wrong that it’s not a problem.
Corrective Actions
After understanding the root causes, the next step is to define specific corrective actions. These should directly tie to identified causes. Avoid vague or soft commitments like “we’ll try harder” or “we’ll be more careful.” Those are best intentions, not mechanisms.
Instead, define true mechanisms; systems, checks, or processes that function independently of individual’s vigilance. For instance, in response to faulty code deployment due to mislabeling, a team might implement a rule: all deployments must be reviewed and verified by two developers. That’s a lightweight but effective mechanism. A good CoE finds at least one mechanism that would have mitigated or entirely prevented the issue.
Write down a target date as to when you’ll finish each of the actions you called out. Make sure there’s shared agreement on the future actions, their benefits, and the timeline/investment. If they’re going to be done at some far flung future date then it’s fair to argue they’re not viewed as important. If you’re iterating on operational excellence you should (a) focus corrective actions on things likely to have high leverage in making your system safer, and (b) actually plan to do them. I have a specific team in mind when I write this - they seemed genuinely surprised that after they presented a corrective plan to the CTO I came back a month later and asked for an update on progress. There was a previously unstated belief that their other work to improve some other business outcomes were more important and they’d get back to the COE learnings someday. Which missed the concern that the outage had cost us a good part of a year’s “benefits of new work” so it was worth investing in preventing a reoccurrence. It turned out I was really glad I was curious to ask about how things were going.
If it’s important enough to write down as something to do, not doing it by definition would be risky for the business. Get in the habit of arguing if the things on the list are really worth doing. If not, call out that assumption, encourage disagreement and then make a clear decision. Putting the kitchen sink of all nits onto a corrected action list is not helpful. Taking things off early is way better than leaving lots of things on and not doing them. Taking something off is an actual decision, leaving everything on is not. So if you pride yourself on being decisive - prune that list!
An Example issue: Database Deletion at IMDb
I have unfortunately many, many examples of working the COE process (though fortunately for the reader). So I chose one completely at random - largely because of its connection to getting the blameless part wrong, and secondarily as an example of mechanisms thinking. Over time I may share some more detailed examples. If you’d be interested in me trying to recall a bunch of different COE outcomes in a post - please drop a comment and I’ll try to include that in a future article. If you’d like an external eye on your COE process or to translate/coach in this area with your teams please reach out.
While working at IMDb, a developer deleted an entire production database for a specific business unit. There was no backup. Recovery was possible, but barely. There was a lot of debate about why this had been done - which wasn’t really as helpful as asking “why was this possible.” During the CoE, much discussion centered on the lack of backups, the restoration plan, and ways to prevent this from happening again. But one critical question was missed initially: why was a single developer allowed to take an irreversible action without any safeguard? Thankfully one of the principal engineers commenting quietly, “that could have been me to if I was unlucky,” poked things in a better direction.
A good CoE traces the issue beyond the immediate error. Why was there no backup? Why wasn’t deletion access gated behind a stronger approval mechanism; such as a “two-key” system akin to what we see in movies for nuclear launches? As it wasn’t available as a feature at the time, what would be the next best alternative? When your system allows a single human to create catastrophic failure, the root problem is not just that the human made a mistake. It’s that the system was too fragile
Wrapping this up: The role of culture
COEs only work in high-trust cultures. If people fear being blamed or punished, they will hide mistakes, spin stories, or downplay their roles. It’s a very real and common risk that can snowball things to crazy levels. I somwhat recently spent time adjacent to a traumatic story that gives early events in The Phoenix Project a run for their money. I’m considering that as an encore, tour de force discussion of systems failing. But not today.
Circling back to my earlier article on the Galactic Empire. After the Death Star explodes, it’s easy to imagine Darth Vader’s postmortem as a series of strangulations. No learning occurs in that environment. The same flaw - single points of catastrophic failure - reappears in every subsequent design. Don’t be like the Galactic Empire.
* This mention of engineers going on vacation me of what I refer to as the John S rule of vacation related risk. It’s that risk of an engineer deploying their change right before vacation comes at a risk proportional to how awesome everyone thinks the engineer is. ie; The better an engineer is the more likely their pre vacation work will explode after they’re gone.
The name of the rule is because I worked with this truly stellar engineer named John S. Who at least twice deployed a change right before going on vacation, only to have something go wrong in production related to the change right after he left. Now of course the team discussed the risk of John just finishing his long running work and shipping it before leaving, but decided it was an acceptable risk. If this had been anyone else people likely would have asked him to ship it two weeks later when he was back, or reviewed it so aggressively the mistake might have been caught.
In reality, and fairness to John, this may be a sort of technical “big fish” story that becomes larger and larger in my mind in the retelling. The shorter version is “Don’t have people ship a bucketload of changes right before they leave town.” And just because they’re truly amazing don’t skip this rule. :-)
"(that could have been me to if I was unlucky)" => should this have been in quotes versus parens?
We are going through some foundational migration, and exactly handling with the emotion that folks need to decouple themselves from taking the problem vs. taking the blame.
It is a culture needs to be built.
P.S: "shit happens" is evidently coming from Forrest Gump. :) Proof here: https://www.youtube.com/watch?v=0xgKcMBqc2w