Mediations on Tech Debt
Frame the problem's impact on Throughput and/or Business Impact to get buy in on change
Every system is perfectly designed to get the results it gets - W. Edwards Demming
All happy families are alike; each unhappy family is unhappy in its own way - Tolstoy

Preface:
For engineers who already grok tech debt, KTLO, etc maybe this article will help translate things so everyone in the org can reason with you about technical debt tradeoffs the way you truly want them to. For folks who don’t identify as engineers but work with them - I think this post might be even more valuable to you.
I set out to bring the topic of tech debt away from platonic ideals and toward a structured approach to discussing it in terms of outcomes of throughout and business impact. Nailing this means that maybe, someday you can tell a story about how the product teams at your company are the biggest evangelists for a replatforming changes engineering has been wanting for years. It can happen, I’ve lived it more than once. :-)
What do you get with this article - besides way too many words?
An tech debt joke that seemed funny to me at the time.
What is ‘tech debt” and should anyone care?
Why you should (usually) not apologize for tech debt.
A two part approach to labeling and reasoning about tech debt in terms of impact
Part 1: Tactical suggestions for labeling tech debt problems to get people excited about true improvements. Even if you think no one cares about that stuff. Actually, especially if you think no one cares.
Part 2: Flipping the script: Tech debt matters when it’s a drag on velocity or profits. Reset goals into the throughput or financial loss world and watch people line up to help you get time to fix things.
Exploring the “just assign 35% of engineering time to tech investments” anti-pattern
Feel free to jump to the section that interests you the most. I tried to have the sections build on each other, but I think you can get something out of them independently. Just remember to subscribe to have this and future long form stuff sitting in your inbox for when you’re bored. :-)
A brief humor break
With apologies to Kevin Smith, and the SAT’s - I’d like to start off with a multiple choice question. The answer seems pretty obvious, but given discussions I’ve had as an engineering leader joining a number of new teams I’m not sure we’re all on the same page here. As illustrated above, imagine a four-way intersection. In the center is a crisp, new hundred dollar bill. At the ends of each street are four entities;
A developer from a super successful startup that truly has NO technical debt.
Santa Claus
The Easter Bunny
A developer from where business is booming but tech debt is a pain.
Assuming everyone starts at the same time and races to grab the prize, who is going to get the $100 dollar bill? I’m guessing most of you chose (d). And I certainly hope so, because the other three are FIGMENTS OF YOUR IMAGINATION!
Building things always involves trade offs. In the case of software we generally believe (with some validity) that we can come back and fix things later. Therefore, if there’s a faster path towards achieving value a customer will pay for, then there’s a logical argument toward taking it. Even if you create some “technical debt” to do so. That debt might mean that it (a) may take you more time to maintain the new systems on a daily basis than you’d like, (b) make future building harder than ideal for any one dev (reducing technical fearlessness) or (c) make it harder for more devs to code together (this tends to become more noticed as teams grow quickly from a few to many).
My hypothesis is that one doesn’t find too many startups (or post startups companies) with the ivory tower standard of a “clean” codebase because the companies that focused primarily on that dimension of quality ceased to exist. Therefore it’s much more common to talk with engineers who at any given moment in time can point to numerous “undesirable” aspects of their codebase.
I’ve certainly had people argue this point with me - but really just one person. I feel there’s some saying about there being an exception that proves the rule - so even if they’re actually right, I stand by my thesis.
Sure, I know folks say “our PM’s just want to get things out and they don’t care about quality!” or “The CEO doesn’t care how often we get paged if revenue goes up!” or “they always say they’ll let us go back and fix things but they never do!.” All those things may or may not be true, but if you focus on things you control and talk about them in terms of business context then there often (maybe always) is a good path out.
What is this technical debt you speak of?
There’s lots of definitions of “tech debt” often with analogies to other real world situations - such as deferring maintenance on your home. These are fine - but we can be more specific. A functional definition is that technical debt is what causes time to be spent that’s not helping build new value for customers, or make it easier/faster to build more value for customers in the future. You can probably also include the work you didn’t do that makes things brittle and cause systems to lose you money - by not being able to take orders or misprices your bids in auctions.
Tech debt is the tax your team is paying on current and future throughput (often called velocity) or increasing the likelihood of defects that cause negative customer impacts/financial loss.
Tech debt happens, no need to be apologetic.
I suspect that most engineering teams think everyone has “better” codebases than they do. Much like when you look at other people you might think they have less messed up lives than yours. But once you get close to enough people one realizes everyone has issues. Pretty much the same difference with Tech Debt.
After working at numerous teams across Amazon, a later stage startup, and a small public company I got used to people being apologetic about the their state of technical debt. Come to think of it, when I was part of the primary build team I later also apologized to others about it. Sometimes the “problems” were really only locally perceived. But sometimes top management would agree. Complaining that easy sounding things were hard, and sometimes hard things felt impossible. There had been complaints about velocity, and it was common for folks to bemoan the current state as “wow, it never used to be so slow to build…”
I think they expected new folks, especially new managers to really question the decisions they had made. While at the same time there was an unstated, although often stated hope that as a new leader I’d find a magical fix. Folks seemed vaguely surprised when I was sympathetic to the lack of velocity, but lacked the “OMG, why is so bad!?” focus they expected. In case it’s not obvious - dunking on your new team’s tech debt is a great way to the opposite of building trust. If you’re ex-Amazon it’s also a great way to play into “those Amazon assholes” stereotype. Doesn’t help in any way.
The best way to approach such a “wow, our system is a mess” disclosure is to remind oneself and others that the decisions that lead to today’s problems very likely may have served an important purpose when they were made. Then to ask about what was important when the tech choices that are chaffing were made.
A lot of issues are things that maybe suck now but probably made sense at the time - both tech debt and bottlenecks. In that they evolve from choices that were made under earlier constraints (say the need to ship to land a paying customer) but as time has gone on those assumptions might not be valid anymore. For example, it may be worth going slower now to be able to go faster later given there are actual customers - but the current system is keeping the business from onboarding 10x the current volume.
That’s the “logical reasoning” approach to why I’m not surprised. There are lots of experimental inputs too. These include;
One of my Amazon colleagues had remarked that much of the platform “runs but by the grace of G’d.” Because Amazon is so incredibly reliable as an overall platform from outside it’s easy to miss the systems that are held together by heroic efforts, late night pages, and systems pushed beyond their original intent by 10+ years. I haven’t been at Amazon for a number of years now, so I suppose it might all be working perfectly now with AI fixing everything or something. But I’d be wiling to bet a coffee or a beer that’s not universally the case. Do reach out and let me know - I’d love to get up to date on this.
I’d worked at IMDb for several years which was a monster business, but had some of the most entertaining tech debt I’d seen. Including not having a database mastering the core data at the INTERNET MOVIE DATABASE (spoiler - lots of flat files), and a search system where one of the excitedly discussed innovations was to go from rebooting the SOLR instance weekly to daily. Which I want to be clear was not really a bad idea under the circumstances.
This isn’t a small sample bias - I’d spoken to engineers from literally hundreds of companies during interviews. It’s rare to not hear an epic tech debt story if one probes even a little.
In summary:
Everyone accumulates some things that aren’t perfect.
As people explain what they’re not happy with in terms of their technical systems it’s your job to understand “why?” and “how?” without assuming ill intent or incompetence. Especially while joining an organization.
There are often good reasons for accumulating tech debt and it’s valuable to understand what those assumptions were. That will help you adjust those prior assumptions that led to it - if needed. Listening with a truly open mind may leave you agreeing with the existence of the tech debt.
Step 1: Label problems in other folks’ world
The nature and impact of the tech debt must be made visible. But how you make it visible often matters more than you might think.
Ideally, the problem would be expressed in units of measure that mean something to those outside the team dealing with it. Some examples of visibility statements are below. But really what you want is to (a) determine how to best talk about the impact of the debt and (b) get that impact updated somewhere continuously like a weekly business review where there are continual ways for people to ask about it.
Functional clues that problematic tech debt exists:
Things break a lot. Or keeping things from breaking is taking up more and more time.
A lack of technical fearlessness. This is a snazzy way to say that engineers are somewhat hesitant or downright afraid to make changes.
Engineers cannot change something without lots of cross team discussion. Sometimes the debt is in the broader architecture and it feels though like it’s an org structure problem.
As team size has grown, working on the same codebase seems to be harder. Sometimes “it wasn’t this hard before” is because architecting a codebase that lots of people can independently contribute to is often a different structural problem than building something 3-5 devs can tear away at.
“How problematic” is something to gauge along with the counterfactual - “what if these things weren’t true would you be able to do more of?” How valuable to the business (now and in the future) is this alternative opportunity?
If you believe that there should be change, you need to align everyone on the same goals around what needs to change. Or things generally will not change. If a few brilliant engineers could lead their own rebel alliance and sort things out they likely would have by now. (Crap sorry - I don’t know where these Star Wars references come from sometimes). On how to drive effective goals around anything, including Tech Debt, I’d recommend my prior article on goal setting.
My big recommendation here is that the goal for reducing tech debt should be framed in terms of business impact, not happier devs. I’m 1000% in favor of happier devs - that’s an important input to your business whether you accept it or not. Burnout is real, turnover is super expensive - and that’s not even factoring in how much it hurts future growth if word on the street is that your shop is a crappy place to build. That said … Technical debt restructuring (or migrations, or whatever) really needs a LOT of cross functional buy in for them to be successful. So the closer you get to something others care a LOT about in your goal the better.
Some high level examples of what people in different functions may care a lot about - and thus what are interesting ways to express goals:
CTO / CTO / CFO
Team throughput. I pay $XX / year - what value is being created on that investment.
Effective number of engineers building (example: reduce keeping the lights on activity from 10 engineers / month down to 2 by ____). Similar to above - but often easier to reason about.
Reduction in breakages that cost the business money (pricing errors, unable to take orders, etc). Don’t underestimate these - they can be very big sometimes before they get noticed.
Operational Partner teams (say accounts payable)
Reduction in errors that cause them to have to do manual work (example: Reduce PO’s with an error requiring a manual interaction of any kind from 20% to < 1%)
Customer facing defects (which tend to drive up their variable costs)
PM partners
Ability to build faster (improvement in being able to deploy one new major algorithm every 3 months down to < 2 weeks)
Increased available engineering capacity for building.
Reduced defects that take up their time answering internal/external questions
Just some starting suggestions. It’s actually relatively easy to figure this out.
Go to each possible partner for your dev org.
Ask them what sucks for them
ask them to roughly swag and quantify it.
Explain how your tech debt work addresses these problems (assuming it’s true).
The more curious listening you do the more likely you’ll be able to figure out if your tech debt creates pain in “their world.” The higher level in the org’s priorities your tech debt’s downside actually impacts the more likely you’ll have a clear path for everyone wanting to reduce it.
Of course if you don’t find much value to others in the org with this process then one should step back and ask “is this tech debt really important to resolve, or is it just interesting (ie; a mild pain in the ass)?”
What’s surprising about talking to so many teams this way is that (a) you’ll get an incredibly sharp understanding about what each group/person sees/assumes as the most important thing to work on for the company, and (b) sometimes you’ll find that what you think is a minor tech debt issue is really a giant pain point constraining corporate growth. Or you might just discover something high impact you can address even if it has nothing to do with tech debt - which is also pretty nice.
Step 2: Flip the script and focus on throughput improvements instead
My core thesis is that we care about tech debt to the extent that it either (a) reduces team throughput (amount of valuable stuff they can get done in a period of time), and/or (b) Causes issues that cost the company money now or in the future (lost customer orders, mis-priced offers, people canceling their subscription, etc).
To throw out some not especially precise definitions:
Throughput = (speed you can do stuff now) * (improvements you make in the future that make work easier) * eng capacity.
Eng Capacity = Number of Devs you have on staff - KTLO (Time spent keeping things running) - Time spent fixing broken stuff - Time spent coordinating with other teams
Starting from the very top down - I have a six factors that if present increase the chances for high throughput engineering in the long term. This may not 100% really have to do with tech debt - but I think they make everything better so I’m putting theme here for now. Let me know if they sound interesting to stand on their own - I’m leaning towards separating them out into a short post.
Global inputs that drive impact: Clarity and shared understanding about what’s important globally (bottleneck, success). This is the key step of creating shared context.
A simple and clear prioritization approach: Transparency on how work is prioritized - what has to be true in the future. This isn’t a giant spreadsheet, it is a simple control rule. For example; We will prioritize things that keep our pricing model from being broken, then improving performance in the 3 of 15 pricing segments we’ve agreed are most impactful to our current business, and if there is any time left things that let us go faster (or reduce time spent) on the first two.
Shared Trust that people can voice when there are problems and have the outcome be constructive
Shared Visibility (and ownership) as to where work is - focusing on the most important things. The org should continually work to make all work visible over time. A counter example is I worked somewhere not that long ago where “updating dependencies” took something like 20-30% of every team’s time. Lots of people complained. But there was no shared visibility across all teams - only when we started doing some simple reporting was there immediate vision that this was a top problem for throughput.
Invest in inputs that drive technical fearlessness, letting people build faster in the future. These could be architecture, process, or a combination. If you don’t have current best practices like continuous deployment, loosely coupled systems, etc. then it’s almost always a good idea to move in the direction asap. Basically it’s usually important to do things that will clearly let you build faster tomorrow.
Local ownership/autonomy in how work is done. Once you have super strong alignment and have robustly shared context about what’s globally important, then you can really reap the benefits of higher velocity as teams can safely make their own decisions in keeping with the global/greater good.
What tends to slow things down are either lack of clarity/vision (impacting motivation or focus), or unaddressed constraints as to where time is going. I’m of the view that for most teams you can build a histogram of where you’re time goes using the following buckets
Broken stuff: Things you were interrupted by because they broke - or people thought they might be broken (think questions about the prices from your pricing system).
KTLO (Keeping the lights on activity) - Time you had to take just to keep things running (package/dependency updates, compliance updates, answering questions about your poorly documented API etc) - also could include those pesky pricing questions depending on how you bucket things.
Dealing with coupling: Time spent on dependency discussions (how you organize people, systems). An example: if you cannot change your database because you’ll break other people’s stuff downstream, then you spend an inordinate amount of time discussing boring but critical changes to your tightly coupled schema. Don’t laugh - this still happens all the time.
Time spent on building new things that are slow because of your current systems/architecture. This is the roughest one to estimate - it’s more of a swag. If say you out of 10 devs spend roughly 6 of their time building, but you feel you could go 30% faster with some changes then capture that as a bucket for improvement. It’s not perfect given the next bucket - but I’m just looking for a way to highlight typical problems and opportunities.
Time invested so we can go faster or be safer in the future (better monitoring/alarming, extensibility of current architecture, toolset, CD, testing, etc).
Time building new product features
The first four bullets are drags on throughput (building new things that drive customer and business value - now and in the future), the last two are positive throughput, or increase throughput. Once you’ve refined the histogram segments with group discussion, then I recommend measuring just enough to scope the size of each bucket.
At this stage you’ve found how much time is spent outside the two last categories (going faster in the future, building stuff we want now). Evaluate if there’s enough drag to care about changing right now. There may not be - or it may make more sense for any number of reasons to work on them later. If you’ve made that call at this stage you likely built enough shared context and trust for everyone to understand the decision, and feel heard.
If you do decide that there’s benefit in reducing the drag on throughput - then congrats - you’re about to pay down some technical debt! Focus on finding the biggest constraint on your throughput and what things that you control can be changed to make it better. Most systems have key constraints to throughput that once identified allow you to have disproportionate benefits. This is often referred to as the Pareto principle that suggests there exists 20% of work that can drive 80% of improvement.
If you take only one thing away from this:
High confidence bets that free up people or lets us build our roadmap hugely faster should generally not be ignored. Every person you take off perpetual oncall makes people happier and is equivalent to hiring another teammate.
There’s no magic formula. Figuring out what investment and how requires getting into the weeds. If you do something a lot and it sucks then fixing it will have a lot of yield. If things break all the time and trust/customers are hemorrhaging then you probably don’t need to overthink whether you fix stuff. Tech debt costs time later (and integrates), not making customers thrilled now can make a recovery harder or impossible, and fearless/joyful building lets us win in the longer term. Thinking in a truly 360 degree manner like most things requires accepting complexity and getting people to share the same context.
Or maybe you can just vibe code your way out of this stuff now. Time will tell. :-)
Postscript: The “We allocate X% of engineering time to tech debt” Antipattern
I’ve always loved the idea of an Antipattern-> something that seems like a good idea but usually isn’t. Today’s example is “let’s agree that the tech team gets some percent of their time budget to allocate as they see fit for technical systems improvements. This way they don’t have to negotiate anything with Product.” For reasons I don’t really understand I’ve observed that number often gets set at 35%. Followed quickly by a debate about whether that 35% should include or not include KTLO activity (I’m not actually kidding). Usually this idea is arrived at because of a breakdown of trust that the engineering team “will be allowed” to write quality code or fix tech debt.
Most of the time this is a mistake. Though like anything else I have one exception that proves the rule. Before I jump in I just want to point out that I’m going to satirically dunk on “the PM” as the bad guy in the following examples. First off I’m using “the PM” to be the stand-in for whatever party not on the local engineering team is being blamed for the need to wall off 35% of their time. It’s often some other business leader, their senior manager, or even the CTO. The point is I’m specifically suggesting we don’t assume poor intent/judgement on their part - but I need to call this bogeyman something.
Reasons why this is an Antipattern, ie; is not a good idea
it assumes 35% is the right number. If you used the approach in the main article maybe you would have agreed on 80% or 10%.
It assumes an inherent misalignment between what the “product owner” cares about and what the team cares about.
It assumes the engineering and the product teams cannot resolve their different perspectives. This means in practice that the engineering team has a challenge improving how they translate “their” goals into something aligned with business goals
It suggests that someone else tells the engineers how to write code (ie: can veto needed changes). This is a different discussion - but engineering teams should not accept that thesis that “the PM” is in charge of such a decision. For the same reasons it’s a mistake to assume the engineers shouldn’t have input into product thinking.
Basically these are actually all the same. This separation of “goals” between Tech and Product is wrong. For maximum results they need to share the same goals.
Producing more of something that’s paid for now and in the future
Making customers happier now and in the future
Making the speed at which the team can build to support those two aims better now and in the future
Making the team’s experience (and that community they serve) better now and in the future.
Usually it’s better to start something and finish it to maximize throughput. So spreading out fixing things slowly is usually worse than waiting to fix them, and then going all in without multitasking.
Forcing people outside the local eng team to participate often unearths valuable information and better ideas. I have in mind an example where after truly understanding the problem the PM came up with a better solution than the tech team. They asked for contractor help to cover KTLO so the team could go all-in on fixing the root cause which was wasting bandwidth. The engineering team themselves didn’t even consider this possible.
At the risk of repeating some of the examples from the core article, here are some discussions of converging to a shared problem.
Problem: The engineers want to invest in reducing the super annoying thing they always have to do -> today it takes 4 out of 7 engineers to keep the lights on. It’s estimated that given 3 weeks of work we can get that down to 3 engineers and with another 4 weeks down to 2 engineers. Which means that after a 7 week investment we will have 5 engineers continually working new features vs 3 today.
Are we sure the PM would object to this? Do we think it’s better to stick this within the 35% and not worry the PM’s pretty little head about it?
Maybe they would object - but in their objection comes learning. Inviting them to say “no” to an ask that’s framed in something they usually care a lot about (engineering capacity) gives you a chance to see what made them say no. Maybe they don’t trust the engineering estimates - good, let’s discuss. Maybe they know the company will go bankrupt if they don’t get this over the finish line in two weeks. Not good - but I bet you’d want to know that too.
Oh - I almost forgot - I did say there’s one time where this 35% thing is OK. If everything is really as messed up as can be and you’ve tried all the things in this article then maybe this 35% thing is worth buying some wiggle room to prove how much better engineering can make things. But feel free to give me a call first. I’m happy to try to talk it through with you and see if there’s a better way.