This is not the culture you’re looking for...
Engineering lessons (on failure) from Star Wars
“All of life's riddles are answered in the movies” - Steve Martin (as Davis) in Grand Canyon
Warning: Spoilers ahead if aren’t familiar with the basic premise of Star Wars (or Clerks)
Prologue: On the surface this post may seem about geeking out about Star Wars. But it’s seriously about operational excellence. Which option appeals to you more is a like one of those quizzes in Cosmo labeled “which sort of nerd are you?”
Ever since seeing The Force Awakens I’ve wanted to make a short film set during a Galactic Empire post-mortem on the Death Star’s destruction. Postmortem’s are a common part of engineering culture - where groups such as Engineering, Product and Design gather to understand why something failed (or broke down). Sadly my nonexistent filmmaking skills weren’t up to the challenge. Also I’ve since learned it’s not a singularly original idea. But I took up a variant of that challenge here to hopefully perk up a longer chat discussing operating with resilience. If you’re the sort of person that hates Star Wars, long reads, or slightly snarky writing I suggest skipping ahead to the section “Practices that will help you avoid your own Death Star situation.” Value is in the eye of the beholder - I tried to include something for everyone.
Background from a universe far far away
If the Empire had focused on operational excellence and a trust based culture instead of “The Force” they would have decisively crushed the “rebel scum.” At first this is as provocative a statement as “Han didn’t shoot first” or however one explains midichlorians. My statement though is just clear logically. In a just universe the rebels should have won of course, but the Empire could have made things a lot harder on their opposition. All they had to do was focus on engineering best practices instead of leaning on their evil spirituality and alpha male energy. This article will focus on ideas that can help your own rebel alliance or EvilCorp organization avoid their mistakes.
I saw Star Wars in a theater when it came out. As time has gone it’s dawned on me that not everyone was so affected by Star Wars. I realize now that not all adults have such a strong opinion on the reality that “Han shot first” and deeply fear their children watching the prequels only to ask “I don’t see the problem with Jar Jar?” Even so, I’ve grown up to mostly be a well adjusted person who thinks about Star Wars only slightly more than I think about the Roman Empire*. But there was one fateful evening about ten years ago where my childhood fandom and professional life aligned along one of my favorite topics … how to prevent failure by anticipating and actively learning from it.
On a whim I attended a late, opening night screening of The Force Awakens in 2015. I enjoyed the fan service, tolerated that tall thin kid I disliked in the two episodes I watched of Girls, but ultimately was struck with a blinding moment of clarity. Yes - of course it was true that Kevin Smith had a point blowing up Death Stars had an ethical murkiness to it that I’d missed as a child. But more importantly - the story had nothing to do with a battle between good and evil. Nope, the truth was that
Star Wars was one of the most brilliant (and successful) treatments of the perils of how NOT to run an engineering organization!
Or stated another way - If Darth Vader and The Empire had focused on solid engineering process and culture instead of that spiritual mumbo-jumbo about the force then the Rebellion would have been toast**.
I’m assuming most readers are thinking “Yeah - sure, that’s super obvious - you took a long ass time getting to that point Captain Obvious. No wait … what?!?”
Let’s ask some basic questions to start;
How was the rebellion able to blow up the Death Star not once, but three times? Even given underwhelming technical, and staffing superiority.
How did the Empire manage to build the same single point of catastrophic failure into three systems?
If this single point of failure was technically unavoidable why didn’t The Empire’s system architects not defend it more fully against attack?
Maybe it wasn’t that bad? Let’s look at the Rebels vs. the Dark side’s planet killing machines scoreboard;
Death Star 1: Destroyed by a missile fired into a small but unshielded exhaust port. Causes a chain reaction that destroys the Death Star.
Death Star 2: Unfinished (unshielded) reactor struck by rebels causing a chain reaction destroying the Death Star. Arguably this was not a single point of failure as it required an off-site shield to be disabled - but that mission was accomplished by animatronic teddy bears. So it sort of still counts.
Death Star 3: Single point of failure moved offsite to a thinly guarded facility that was infiltrated by a small number of people. Who then enabled pretty much a combo of the next steps from Death Star 1 and Death Star 2 to happen.
Why the actual heck did the empire build the same single point of failure into three consecutive Death Stars and then defend them relatively lightly or not at all? To me the most likely answer is poor engineering culture. From what we know from the films and basic pattern matching - here’s my hot take on the (true) problems facing the Empire;
Limited acceptance of views that diverged from leadership
A blame based culture that stunted learning
Not questioning closely held assumptions - specifically a belief in mystical principles (The Force) over physics based reality
Single points of failure present, repeated and not hardened
Lack of effective postmortems
A stronger culture of challenging assumptions would have greatly increased the chance of finding the problem early. If the single point of failure was inevitable then would have focused on mitigation - for example by defending it in depth. This would have made it exceedingly unlikely they’d continually lose their investment in world destroying tech so many times in a row. Though admittedly would have made for a less satisfying film.
Learning from Failure as a discipline
Whether you’re running the construction division of a galactic empire or something more prosaic, a critical part of engineering is building things. An equally important but less seen part of engineering is making sure that when you build things, they run continuously. I used to tell people that any kid in high school can write some pretty impressive software, but it’s a different discipline to make that system runs 24/7 without a problem. It’s an old school argument that many folks are re-making these days with the rise of vibe coding.
There’s little evidence that the Empire fostered a culture of learning from failure. There’s at least one memorable scene of counter evidence to the Empire building a supportive trust (and truth) based workspace. Over time I’ve developed mental models that support building solid systems, and keeping things running. Here I’ll be using an expansive definition of operational excellence. Both in the traditional sense (monitoring, alarming, postmortems) but also creating a culture that converges to the best ideas regardless of their source. At home, in the workplace or in government one of the most dreadful things you can create is an environment where it’s considered bad form to contradict “a leader.” At the limits, and even in the short term, nothing good comes of it
Practices that will help you avoid your own Death Star situation (or at least 3 of them in a row)
What are some key practices to build your culture around? I’ll start with bullet points for those who are (understandably) exhausted after wading through the Star Wars.
Make disagreement easy and rewarded: The most toxic culture is one where disagreeing with the boss (or anyone else) is counter to your professional success. Ensure this isn’t the case for your team. Encourage constructive but active disagreement, and remind yourself you’d rather feel dumb now than very very dumb later.
Aggressively find and question assumptions: When that disagreement pops up, carefully mine it for true understanding and strongly discourage compromise (as opposed to finding a truly great solution). To paraphrase a quote from the brilliant Eliyahu Goldratt, “when two people measure a building and one says it’s 100 feet and another says 110 feet, no one decides that the building is 105 feet.” Somewhere there’s an error or different underlying assumptions. Make those visible and lots of good things will happen, often very quickly.
Use the four principles that follow to center operational excellence when working with production systems. Always keep getting better at these
Don’t be broken: Design systems that minimize failure risk and where appropriate degrade gracefully. But check that the “gracefully” truly works - seriously! - ask me how I know…
Don’t be broken without knowing: Ensure you have monitoring and alarming that provably guarantees you’ll be aware of a serious problem. You’ll have to balance false alarms against the risk of not knowing - but the bigger the “failure case” the more you’ll likely want to tolerate a false positive. I strongly recommend business process monitoring vs. endless checking of minor factors.
Don’t be broken long: Think about correcting failure and outages in the fastest safe way. Typically the time to get this right is before you launch something - for example by having a “how would we know it’s broken?” and a “rollback to known safe state” plan. Whenever possible you don’t want to be stuck with screaming customers and bosses and be making “just one more update that I’m sure will work.” Often teams will complain about the presence of a pre-launch change management process step, but answering a few basic questions before a change makes fixing a mistake much faster.
Don’t be broken the same way twice: After things are working again comes the time to learn deeply. After things are working again comes the time to learn deeply. After things are working again comes the time to learn deeply. Not during the outage, never during. Afterward comes the postmortem - which I will touch on briefly below. But I’m going to followup in roughly a week with discussion of the Amazon Postmortem process (aka the COE). It’s incredibly powerful, and as in most cases of power comes with great responsibility. I hope this future article will be especially useful to folks new to Amazon or working at companies with ex-Amazon types. I feel it is one of the most powerful but also misunderstood “one weird trick to improve culture over time” and deserves its own post.
TBH - you can probably stop reading now. You’ve gotten your fill of Star Wars comments and the core framework that supports a culture of high velocity and technical fearlessness. Other than the importance of talking openly and honestly about tech debt. Which I won’t get into today - but I’ll get to it soon.
In case you’re a glutton for punishment context I’ll continue to go deeper on each of the principles…
Welcome disagreement - especially upward
I’ve become fond of telling my daughter “there are two kinds of people in the world - those that want to feel right, and those that want to be right.” (with apologies to whoever I stole that from). The inherent defensiveness created by the former reduces the chance to be the latter. I don’t think we can really remove the desire to be right but it’s possible to reduce it. For me part of that is reminding myself (and others) that I get value from being told I’m wrong. Specifically - “I’d rather feel somewhat dumb that I’m wrong now, than very dumb later when everything goes bad.” Doing this for yourself is hard, doing this for an organization is way harder. Doing this for an organization whose leaders are always right and there’s some dark power in the galaxy that supercharges their correctness and power makes it hard on steroids. Even if those who got something wrong didn’t end up choked to death by an invisible hand by a bad guy who’s outfit is a bit too on the nose. Compared to that your solving for your culture is almost certainly easier.
That leaders often don’t want to hear bad news is pretty common - all the way back to killing the messenger and so on. I think it’s a cliche though because it’s true. I’ve certainly seen it in companies that pride themselves on being direct and “data based.” But even more often in companies that put “nice and kind” in their leadership principles. Bottom line - If what people want to be true is rewarded over truthful discussion of reality, then very very bad things will eventually happen.
Keeping this issue at bay requires demonstrating no-spin truth-telling and openness to disagreeing views. It’s often hard: When asked a question along the lines of “we’ve seen ____ and ____, why do we think we’re still on track?” that needs to be truly, truly welcomed. Even if it’s expressed in a ruder way than you’d like. It always feels rougher/ruder than it is when someone asks you “Are you sure you know where you’re going?” It takes great amounts of courage to ask a tough question to leadership, and everyone is watching how you react. You want your senior leaders to behave the same way? - then reward the asking of the question. You want people to speak up before the company runs into a wall vs. griping in a bar after work? - then reward asking the question. You want people to ask questions without being jerks? - well then read the next paragraph.
One more thing about this disagreeing idea…
Noticing and noting out-loud that the emperor is naked is super important. It’s always better than ignoring the situation. But it doesn’t mean people need to be a jerk about things. Taken too far people are just being mean, trying to score points for being smart, and so on. That results in the other very very important human part of the problem - getting defensive and not really considering the input. Also lots of bad feelings which can become toxic. So encouraging people to disagree has to be carefully balanced with not being super disagreeable. Again - probably grist for another article.
Aggressively find and question assumptions
To quote Benny Hill - “When you assume you make an ASS out of U and ME.” While I’m not going to vouch for any of his other life advice this is a really valuable point. When people disagree, or when different people reach different conclusions, or when something seems unsolvable - the prime problem is often unexamined assumptions. On my to do is is an article around how making assumptions visible and how hacking them makes things so much better. It also helps understand why “that other team is filled with morons that don’t understand what we’re trying to do.” Spoiler: They’re probably not - you just have different underlying assumptions. Asking them what their concerns are and what would have to be true for your “crazy idea to be OK” are fundamental tools.
Cultivate a need for people to deeply understand the system they’re trying to improve (aka the problem they’re trying to solve) at the cause-effect relationship level. This includes laying out (labeling) explicit and implicit assumptions being made - and checking whether they are still true. Often assumptions made at one point in time are translated forward without really thinking when circumstances change. Similarly things are often assumed because we’ve implicitly decided some problem “cannot” be solved and therefore it hides the underlying conflict that needs resolution. Surface all that baggage. Good things will follow.
Finding your assumptions about why the system will work well, how users will behave, or even how things will fall apart are the compass for finding gaps in your design and proactive solutions to them.
The cliffnotes*** guide to the rest of Operational Excellence
Avoid being broken
Engineering teams must always be thinking about failure modes and how the system detects and reacts to them. The expectations for reliability differ based on context—sending a rover to Mars requires an entirely different failure mitigation strategy than running an e-commerce website. By understanding the environment and the potential failure points, you can design for resilience.
Avoiding breakage also requires proactively looking for weaknesses rather than waiting for issues to arise. Actively question assumptions and engage in testing as early as possible “Pre-mortems” can be a valuable tool to formalize such thinking. It’s a process step where the cross-discipline team imagines the future several months past the launch - asking themselves “I just realized everything has gone super wrong. What happened?” The answer to this can differ depending widely - ranging from the website crashed to the website is rock solid but no one used it because no one knows it’s there. The key is to verbalize predictable failure modes and ask yourselves “what would have been true if that bad thing happened, and how can we prevent that (or detect it early)?”
Know if you’re broken
The time between a failure occurring and detecting that failure can be the difference between a minor inconvenience and a full-blown crisis. Detection mechanisms include automated monitoring, alerting systems, and anomaly detection algorithms. A good detection system should not only notify the team when something goes wrong but also provide enough diagnostic information to aid rapid debugging.
There’s a lot of topics that one could dive into here. The top two that I find tend to be simplifying for a lot of monitoring discussions are;
It’s often best to focus monitoring on a relatively small number of business critical things that allow inferring if things are working roughly as expected. For example it’s never good if orders drop to zero. This approach is often a lot simpler and less prone to false positives than hundreds of monitors on minor aspects of the system that may be hard to set an alarm on.
Often there’s a lot of debate focused on the complexity of monitoring for small changes over time - for example a 5% weekly drift in some measure. I’ve found it valuable to not sweat that for a while without first buttoning down “would we know it if all our prices on the website were 80% cheaper than yesterday?” (or $0). This tends in my experience to me the more common failure mode and it’s pretty easy to detect relative to subtle shifts over time.
Don’t be broken long
When things break and you’ve been alerted you’ll want to minimize the damage to customers and the business. This seems blindingly obvious - but it’s relatively easy in the fog of an outage to make decisions that run counter to this goal.
Key ideas;
Prepare before you have failure. This includes preparation of runbooks (ie; how to debug), clear ownership of responsibilities in case of an outage (including how to handle long running ones when someone may need to rest), communication strategies (you don’t want to be re-explaining everything constantly to someone who joined a call), etc.
An often overlooked best practice is to write out ahead of each deployment how we’d detect a problem and what’s the fastest path back to the prior state (ie; how to rollback). Deployments are a common driver of failure (ie; you’re changing something) so a little thinking ahead goes a long way.
In most cases it’s better to rollback to a prior system state than to heroically attempt to fix the new problem while customers are being impacted.
Once things are settled down, discuss what went wrong to build long term resilience.
Don’t be broken the same way twice
The intent of this principle is to learn effectively from failure. This can be tricky as there are gotchas present that keep you from holding an effective, blameless postmortem. Some of these common mistakes include;
The team believes the postmortem is intended as punishment.
The postmortem being intended as punishment.
Focusing too narrowly on what happened vs. the larger pattern.
Being overly focused on finding (and correcting) specific cause and effect relationships which may not be identifiable in truly complex**** systems.
Being too focused on the belief that complex systems make looking for the root cause impossible. Missing out on basic changes that would help a lot in the future.
Listing every possible thing you could improve and then doing them all regardless of how much each reduces real risk.
Doing a pro forma dance to fill in a document and hold a meeting that either doesn’t have much value or has value but takes up 5x as much time as it should.
Making changes for the future that rely on best intentions vs. mechanisms. A mechanism is something that doesn’t require people to pay attention at the right moment. For example - a mechanism in your car is that we hope people will check their brake pads on a regular basis. But if don’t check manually, an overly worn brake pad will make loud screeching noise.
There’s a lot to this so I’m going to do an entire article on the postmortem subject. Having spent 13 years at Amazon and later at three other companies I’ve noticed that Amazon’s postmortem (aka COE) process has spread into the wild. Often befuddling those who didn’t learn to swim in the original culture. While imperfect - the Amazon COE overall is extremely valuable and an underestimated driver of their success. One of the things I most miss from my Amazon is being subscribed to the coe-watchers@ email alias where postmortem summaries of major issues were shared across the entire company to be read by anyone with an interest in learning from failure.
Bringing it all back together
A high-trust, blameless culture fosters true operational excellence. When everyone can openly discuss failures without fear, they develop stronger, more resilient systems. Maybe if The Empire had read this article there would be a chance they could have avoided the embarrassment of so many exploding space stations.
Please subscribe and share this post. After I started writing this article I went down a rabbit hole to produce a short film script covering how I imagine the Empire’s Death Star post-mortem going. I’ve been told by a friend that the content is both funny and realistic to the point of of being very triggering for them. And it includes the phrase “physics isn’t a terrorist conspiracy” which is by itself pretty appealing. I’ve even delved into the world of AI Script->Animation tools in an attempt to produce a video to share.
After that I go a step deeper and re-imagine an alternative universe postmortem where Empire leadership has adopted the ideas from this article. They’re still evil, but a lot more effective.
Both of these will be available to subscribers in the next week. Followed by my Amazon COE tips and tricks type article.
Thanks for reading, and stay tuned… If you like the content please consider sharing it with co-workers and friends.
—--
Postscript
In support of my position that the Empire wasn’t big on psychological safety
Footnotes
* For the record, I almost never think about the Roman empire. Except when wondering why they made Gladiator 2.
** You could argue that’s explained in Rogue One when they try to paper over this plot gap by explaining the failure was actually intentional (and brilliant) design sabotage. Perhaps that explains the first one - but hardly the second or third. Also - stop messing up my flow with your details! ;-)
*** For the kids out there - before the Internet, and before ChatGPT there were “Cliff Notes.” Apparently they’re still out there (I just looked), though I’m a little surprised. Marketed as “study guides” they were mainly purchased due to the quick TLDR like summaries of books you were supposed to slog through in school but maybe didn’t.
**** When I say complex systems here I’m referring to situations where the overall is more than the sum of its parts. Just by understanding how the pieces function you’re unlikely to be able to accurately predict the system’s real world behavior. Which often can be surprisingly dependent on small changes in ongoing or initial conditions.
I've also consulted and worked with a few different orgs, over the past 5 years, and have been pleasantly surprised to see that COE culture has spread. There are orgs that don't know what "P90" means when we talk about latency, and think "multi-AZ" means we need to run a second datacenter in Arizona. But they recognize the value of a good RCA (root cause analysis) process, and have document templates with a very familiar structure (the "five whys" and "lessons learned" etc).
(RCA, root cause analysis, seems to be the consensus acronym -- I think most folks outside of the Amazon universe interpret COE as "center of excellence".)