One metric to rule them all -- Order Defect Rate

A secret weapon of Amazon's 3P success

Apr 27, 2025

TLDR;

“I’d rather be vaguely right than precisely wrong” - John Maynard Keynes

Amazon’s success owes a lot to their evolution from a straight retail business to a trusted everything store marketplace. The “trusted” part being especially critical. It’s easy to squander that trust with buyers, or sellers or both. From roughly late 2004 onward two core tenets powered over 6 years of evolution; (1) “the buying experience should be as good from third-party sellers as it is from Amazon retail”, and (2) “Both buyers are sellers are our customers; but if we have to choose we will land on the side of buyer experience.”

Years of successful execution built on these tenets which were operationalized by

Customer experience continually improving via clear/measurable expectations and shared context. This differs from adversarial Fraud protection, where secrecy is a core principle.
A single customer experience measurement (order defect rate) that unified differing interests under a single standard (while being not quite statistically 100% sound). An order was considered “defective” if it had a negative feedback, an A-Z claim, or a credit card service chargeback. This provided a way to measure the state of the marketplace (slightly delayed in time) and segment it by location/product line/etc.

While I didn’t know it at the time, this clarity maximized alignment between teams while creating high autonomy of action1. This is a magic combination that brought smart and fast improvements. Marketplace experience improved by over 75% in a few year period and continued beyond that. Eventually Amazon introduced structural changes to ship 3P seller products from Amazon fulfillment centers (FBA), further accelerating the market and making trust easier to manage. Order Defect Rate continued to track how FBA and other bets impacted customer experience. Moving closer and closer to the goal of getting buyers to “just click the buy box and everything will go well.”

I learned a lot professionally in this journey to ODR. I’d understand if everyone isn’t interested in the full set of war stories. I’ve tried to summarize key points below.

The specific ODR measure was designed both to give us a good read for high level customer experience and proactively address common objections that had made enforcing standards difficult with other methods. Directionally correct and easy to understand is often a better tool than ideal/correct measures from a mathematical sense.
- The importance that the tracking metric overcome all objections to its use is inversely proportional to how messed up/bad the current state is. Don’t sweat the 5% problems if a huge numbers of orders (or other events) have an undesirable outcome. I suppose technically this is a corollary of above.
Achieving early, top down alignment on the importance of the goal is easy to overlook. Not doing it at the start of things will almost certainly slow down results by a 10x+ multiple. People often avoid this work though this like going to the dentist. Probably it’s human nature to take the easier way on things, and talking with your group/peers instead of the head honchos often feels simpler. But without that top down alignment getting people on board scales poorly. This step is hard, do it as soon as you’ve gotten confidence in your personal defect rate type measure.
- This also insulates you from doing a ton of work in a situation where there’s unresolvable top level objections between senior leaders. I’ve seen it, it’s not pretty - and it’s very very disheartening to your team. In the worse case you can make huge progress only to be asked to reverse it. I’ve got the psychic scars from a job at a company that rhymes with Beta to prove it. Again - do the hard but high leverage stuff early.
Make the scoreboard visible in all directions. It took us a while to figure that out. A key input to building trust for buyers was making it really easy for sellers to see where they stood against our expectations. Duh’ - I know!
Doing something that feels very simple and obvious is almost always the output of a huge amount of banging your head into a wall. It only looks simple at the end. It’s normal to doubt yourself when your long journey hits on something that “seems obvious.” It’s actually a good sign.

Sometimes if you nail something it sticks around for a long time. While I’m sure the systems have changed a ton since we built them originally (at least I hope so) ODR lives on almost twenty years later. Below is a snapshot of Google summarizing ODR.

The longer story

Background

Today’s topic is a a story that, to my knowledge, hasn't been widely shared. How a scrappy bunch of operational teams and engineers transformed Amazon’s third-party seller marketplace by introducing stricter performance management. Our mission was simple to explain, yet challenging: make buying from a third-party seller as trustworthy as buying directly from Amazon. The lessons learned along the way should be valuable to broader areas - I’ve seen it be applicable to lots of other systems problems including ad marketplaces, creator monetization platforms, freight brokering and supply chain systems.

I joined Amazon in 2004 on a team known as Seller Policing, a somewhat mysterious and not particularly well-understood group dedicated to making Amazon's marketplace safe and trustworthy. At the time, Amazon had recently transitioned from competing with eBay through auctions to adopting a model they internally called the Single Detail Page. On this page, multiple sellers offered the same product, with one seller-chosen by algorithms-prominently placed in the "Buy Box."

I don’t have much personal insight into how well that auctions business was going before that point - but my understanding was underwhelming at best. Before I joined Amazon they had hit upon the idea of listing multiple listings, basically the same product sold by multiple parties on what's called a single detail page. If you lookup a composition notebook, you’re likely to see many different sellers’ offers against it. The Amazon Marketplace model has evolved a lot over the years, and as “brand manufacturers” have joined you’ll see a lot less of these “many sellers, same item” offers. But that’s not how things were back at the start.

As the user interface evolved it hid most offers behind some buttons, with the “one chosen” option available as the “Buy box”. At the time we’d list a few other options below the buy box, and then a later page (which almost no one went to) beyond that if there were more than 3-4 sellers. While there’s a lot of complexity being skipped over - that made up the innovation known today as the Single Detail Page (SDP).

This innovation had caught fire - I’m guessing at least a year or so before I joined. The results showed it was clearly a better solution for these types of listings than auctions. The sales were going up and to the right, tons of people were using it to buy, and sellers people were joining the platform in order to sell. It looked like a big success.

There was just one controversial problem - the customer experience wasn't consistently on par with the exceptional experience of buying from Amazon directly. This was a huge risk - and to their credit it was well understood as such by Amazon’s top leadership. When customers bought something from Amazon, especially back then, and it didn't work out from seller Joe Blow, this random seller was not likely to get the blame. Most folks just blamed Amazon directly - a reasonable conclusion.

It was becoming clear that Amazon urgently needed a more sophisticated seller performance system to protect customer trust and marketplace integrity.

I tried not to take much dramatic license in the following sections. But please forgive me if I get the scale a bit off or simplify beyond what you feel is reasonable. I definitely did not do all of this work myself. I partnered with many others - it was a dream team of sorts. Sean O’Neil in particular and I banged away at this Order Defect Rate concept for some time. Much of the hard work and key insights came from operational teams that were deep into enforcement and policy (Jana Lipscomb in particular), and the many many engineers made these systems work against all odds (hat tip to the first two Seller Performances technical magicians - Thomas Park and Amit Jain). I could list names for days of the incredible folks I got to work with in that period. We also couldn’t have survived the inevitable screwups if folks like Joseph Sirosh hadn’t had our backs through it all.

I’d love to make adjustment to get things more correct as needed - so do let me know where I screwed up!

Direct Fraud Challenges: The Plasma TV Scam

In the early marketplace era, Amazon faced serious fraud challenges on the seller side. There had been a standout buyer-fraud team for many years that protected the company against things such as credit card theft. In online transactions where the customer wasn’t physically present with a card the e-commerce company was responsible for fraud orders that slipped through. If you really want to point to who ensured that Amazon became today’s behemoth it’s that fraud team - engineers, ML scientists, and operational investigators. But that’s a yarn to be spun by others, as I arrived late for that party. They do have some crazy stories though…

A notorious, and truly annoying seller side fraud example was the plasma TV scam. At the time, plasma TVs often retailed for around $8,000 (also, plasma TV’s were a thing). Fraudulent sellers would list these TVs at dramatically lower prices-perhaps $800-to lure in unsuspecting customers. Buyers trying to purchase these offers would soon receive direct communications from the supposed seller explaining a dubious reason why the transaction couldn't occur through Amazon’s normal channels. Sellers would urge buyers to use alternative payment methods, such as wiring money via Western Union. Unfortunately, customers who complied found themselves without their TV, without their money, and without recourse.

Amazon’s marketplace had inadvertently made this easier at first by allowing sellers to display email addresses publicly on listings, enabling these scams to flourish even without a buyer attempting to checkout. Although this loophole was quickly removed and improved buyer warnings added, the damage had been done. Customers lost money, trust eroded, and the Amazon brand risked significant harm.

Early in the marketplace these sorts of direct fraud attempts stood out and attracted a ton of internal attention. However, not every problematic seller had criminal intent. Some were merely careless, disorganized, or had significantly lower service standards than Amazon expected. But in the data, distinguishing outright fraud from poor seller performance was challenging. Over time our response evolved to create (mostly) independent focus on fraud but also what we eventually began to refer to as “performance” problems.

Early Marketplace Financial Risks and Amazon’s Response

That buying from third party sellers was a different risk wasn’t a surprise to the designers of the marketplace. To mitigate customer concerns, Amazon had implemented the “A-to-Z Guarantee,” promising customers reimbursement if their transaction went wrong. While a comforting safety net for buyers, it proved costly as problematic sellers escaped financial accountability in early system designs. Sellers were paid very quickly after a sale. Making financial recovery difficult when issues arose weeks later. Sellers could vanish, leaving Amazon to cover mounting claims.

Amazon's Seller Policing teams were focused on reducing financial risks, not directly improving the customer experience. Teams addressed the basic lack of accountability - writing code to deduct owed claims from outgoing disbursements. This was an early example of folks making the scorecard visible as an effective motivation strategy. The senior engineer driving this had a very large, and very fun thermometer pasted on their door. Onto which they colored in the temperature level as more funds were recovered. Imagined prizes to be won for each successive level of dollars recovered were sketched on to different points on the gauge. Given Amazon’s frugality I’m pretty sure there were no actual prizes for employees - but only those involved know if Jason Kilar contributed the “hug” that I believe was tied to the top of the meter’s top award.

To amplify the effectiveness of this new approach of grabbing funds before they went out the door, one of the next things out team built involved slowing payment disbursements for higher-risk sellers, a controversial decision internally. It had already been done by operational teams as a mechanism - but it wasn’t automated nor a broad policy for classes of sellers. Many sellers resented the move to scale this up (I believe we introduced it to newer sellers and those with fast moving changes in sales), feeling unfairly targeted and financially strained.

Conflicts grew between Amazon's retail and third-party marketplace teams. This undercurrent created significant difficulty in aligning business areas around seller performance goals. I was slow to recognize this, and too slow to address. Only much later when Jeff Wilke set 3P account manager’s top down goals to include our eventual performance measurement did I realize how much disagreement there had been. Basically because so much got so much easier in talking between groups. It was a really big lesson:

When faced with a big problem, always look proactively for differences of perspective and ask (a) what evidence do we have that everyone is focused on the same goal and if not (b) what it would take to ensure they were.

Buyer experience as a first order goal

I’m going to skip over a lot more detailed history involving how Toys R’ Us divorced from Amazon, leaving 3P sellers to pick up toy-selling overnight right before Christmas. As well as a Friday night S-team meeting that became a discussion on grossly raising the bar on the customer experience aspect of things. I was called into my boss's office and told “Rich, good news. You're in charge of Seller Performance for Christmas. We want to make sure buyers have a great experience. Don’t mess it up!” In response I asked “what does that mean?” Best as I remember it the response was “We're not too sure. Let us know by Monday.” That was a very Amazon thing at the time, and it’s a great opportunity if you can get a job where people trust you with stuff they probably shouldn't be trusting you with. That moment in my boss Joseph's office was probably one of the most important bits of luck in my career and led to some of the most satisfying professional experiences and struggles I've had.

It was a true moment of top down clarity “we want one team to drive on making sure buyers have the best possible experience - go figure it out.” There was a clear message that what was expected was a more continuous top-grading (aka “firing”) of underperforming sellers. An ask to draw a hard line and remove people who were underperforming. It was fairly difficult to get yourself fired as a seller previously - which made it easy to speculate that our low bar was a core reason for the anecdotal bad buyer experiences.

What didn’t exist was any consensus as to who was good vs. bad. The marketplace was awash with different perspectives. The most common way people thought about what made a good or a bad seller at the time was their seller feedback ratio. Anytime we sold an item, we would ask for feedback on the seller. Due to concerns about respondents skewing negative there were understandable arguments against using the percentage of negative feedback for firing Sellers.

From a longer term trust and operating perspective it was important to gain consensus as to whether someone had some bad luck, the system is stacked against them, or they're really a crappy seller relative to their peers

Since we didn’t have anything better, we started by drawing a bright line that when a seller had a LOT of feedback and a LOT was negative that would result in termination right away. It was not a very fun meeting when I tried to explain that plan to category managers for larger sellers. Seller account managers (AM’s) were the unfortunate go-between on our quickly evolving performance standards and the people who worked day to day with sellers helping them grow.

Another of the many important lessons I learned through this period was the unique value AM’s brought to performance management. I initially saw these account managers as ~~pains in the asses~~ bottlenecks. But over time I realized they were in positions with really valid concerns about the relationships. Even in a subset of cases where I felt they were not seeing the negative impact on buyers well, their incentives and relationships made this almost inevitable. The work of aligning them had to be explained in their own “world of pain and challenges.” No one is going to suddenly want to be the heavy and overnight be asked to tell their clients they’re being shown the door. Getting them to understand the full context in a way that makes sense in their mental models/world was key.

In that first sprint to topgrade some sellers based on the metrics we had (negative feedback and claim rates) we got a ton of pushback about how the metrics were super, duper unfair. One thing we did do in these group meets was acknowledge the unfairness of some of the things, but also tried to set context how in some cases they were “fair enough” to see there was a large problem. For example, if you'd sold 1000 items and you've gotten 250 pieces of feedback and they were all negative, then you probably weren't doing a great job. The first set of removal actions was based on this insight, getting the worst of the worst off while we bought some time to act with more quantitative nuance - and not incite a mass rebellion from internal teams with conflicting short term goals.

Thankfully, this bought us enough time to figure out something better.

Defining a Seller Performance Metric:
Order Defect Rate (ODR)

The breakthrough came from discussions and whiteboard sessions with colleagues, most notably my peer Sean O’Neill. We agreed on the need for a clear, empathetic metric that both sellers and internal stakeholders could accept as fair. We asked ourselves: what specifically makes a customer's experience negative enough that all stakeholders would agree it's a problem? After considerable debate, we defined an "order defect" as an order that resulted in at least one of three negative outcomes:

An A-to-Z Guarantee claim filed.
- The buyer filed a claim of dissatisfaction with Amazon
A credit card service chargeback.
- The buyer filed a claim of service related dissatisfaction with their credit card company
Negative seller feedback.
- The buyer responded to the seller feedback survey from Amazon with a 1 or 2 star rating (out of 5).

If none of these issues arose, we assumed the transaction was satisfactory. Order Defect Rate (ODR) - the percentage of orders with at least one defect in a certain amount of time - was born.

ODR seemed sort of obvious once we said it out loud, but it was a big shift in that most people using basic reasoning could agree on its value. Since sellers received credit whenever customers did not actively complain it invalidated the largest objection to performance enforcement based on negative feedback (ie; participation bias). In hindsight the neutral language of calling issues with orders “defects” probably helped too. We’re not blaming - we’re just realistically labeling the state of a specific order.

Another positive was that we could compute the ODR for any subset of the marketplace (ie; just electronics orders), individual sellers, or whole geographic regions (ie; Japan vs. the US). Because claims and feedback takes time to come we couldn’t compute ODR for orders a few days or a week ago. We could compute ODR for any time range, as long as it had occurred 3-4 weeks prior. “Assuming a positive experience” in the absence of a negative feedback or claim neutralized a lot of the complaints about prior approaches. The metric was very seller friendly in that sense. For internal debate, we could then ask “given that you agree the metric provides the seller the benefit of the doubt, let's discuss why they might be still worse off than this similar cohort?…”

From a performance management perspective it was straightforward to segment sellers who were (for example) two standard deviations away from normal, so you could make distinctions between super crappy sellers and others. Since ”you get credit for things when no one said anything bad” it really reduced the argument that “yeah, my seller has a 10% defect rate, but that's some problem in bias, even though everyone else with similar sales volumes has a 0.1% defect rate.”

We certainly could have come up with a measure that had more scientific rigor. For example, maybe the downstream impact of an AZ claim is 3x worse than a negative feedback. But the directional “truthiness” of the measure is a big part of why it caught on. Simple to explain, simple to understand - it just feels fair. And since there were a LOT of defects to start with, waiting for something that was perfect mathematically would of best just taken a lot longer and at worst would have caused people to reject it due to not really understanding it.

Introducing ODR to Marketplace Sellers

Initially, we approached marketplace risk management with the mindset of fraud prevention. The goal being catching fraudsters unaware and maintaining secrecy as to how it was done. However, unlike criminals, marketplace sellers are business partners. Treating them as adversaries risked undermining trust and harming relationships. As time went on it became increasingly clear that improving customer experience meant shifting the mindset from "seller fraud" to "seller performance." Unlike fraud prevention, performance management requires transparency, clear communication, and fairness to incentivize sellers positively.

As we talked to more and more sellers unofficially and we started to look the sort of complaints coming in about suspensions, buy box, etc we realized, “hey, the advantages we gained in talking with account managers, the internal constituents of the sellers, could be extended by talking directly with the sellers.”

This forced us to question our assumptions with respect to performance management policy. We realized we were still opaque about what the expectations were. Which made us realize that when you set up performance systems and you don't state the goals, then people don't know how to change their behavior. So they probably won’t. *

I think most people accept that firing someone out of the blue for underperforming is a bad idea. Not for committing fraud, threatening behavior, etc - sure - those are things you would deal with decisively. But for more boring underperforming, it's a bad idea. You’d worry - what if my employee doesn’t actually understand the bar?, and could meet it if they knew the expectation. With respect to Sellers the same thing applied - meaning we should be transparent about what bar we expected. And that was the next phase of our work.

In parallel, there was a lot of hardcore quant work going on such as building machine learning systems to predict order defect rate for sellers. Probably though the most impactful fraction of our work for an extended period of time was updating the policies, the tools for sellers, and the communications around exposing order defect rate to the community. Including producing a scorecard on SellerCentral, where sellers went to understand their account along many axes. The scorecard included their current ODR, and explained why we care.

People intuitively seek a sense of fairness, so just telling them that there was a number, which is what we did first, made them concerned that we might use that number against them. The first set of questions we received was essentially “am I meeting the standard? And we said, “we know that answer, but We don't want to tell you, you might do something weird.” But not too long after, thinking about the principles we were developing we realized, “crap, we’re wrong here.”

Sellers needed clearly understood standards, to see their performance transparently, and have opportunities to improve before termination. Thus, we introduced operating "zones":

Green: ODR under 0.5%, performance acceptable.
Yellow: ODR 0.5%-1.0%, warnings, risk of losing Buy Box.
Red: ODR over 1.0%, loss of Buy Box, potential termination.

Making these expectations explicit transformed seller behavior. Sellers actively monitored their performance, proactively improved operations, and ceased accusing Amazon of unfairness. This didn’t happen overnight, and our really incredible operational/investigations team had to deal with a lot of complexity in the middle. But over time we started to see a continually improving trend across all marketplaces. That was above what was explainable by our topgrading/firing of the worst sellers based on ODR (which was also did).

Seller Reactions and Procedural Justice

Meeting directly with sellers, attending seller conferences, and transparently communicating standards was critical to shifting seller attitudes. Initially skeptical and angry, sellers ultimately appreciated knowing Amazon’s expectations. A fun moment of validation came when discussions on seller forums shifted. Previously filled with angry complaints about arbitrary suspensions, a notable thread after ODR transparency implementation instead resulted in sellers themselves defending Amazon’s fairness:

"Those Amazon guys threw me off! They claim my performance was terrible," one seller began.
Another quickly responded, "Wait, didn't you get a warning first?"
"Yes, I guess I did," admitted the first seller.
Another chimed in, "Wait, didn't you get a second warning when your ODR didn’t improve?"
"Yes, that’s right" admitted the first seller.

More chimed in, affirming that Amazon provided ample warnings, time to improve, and clear criteria for performance on the site. Community members defended Amazon's actions because they understood them-illustrated a profound shift in procedural justice and trust. It was a very satisfying and proud moment for our team.

I’m not claiming that sellers then and now didn’t have reasons they disliked the platform or distrusted Amazon. Just that it was better, and having internal tenets that it was important to keep making it better were high leverage decisions.

From great measurements comes great goals

Once we got onto this metric then we were able to reduce our Seller Performance program objectives to

We will steadily reduce negative customer experiences, as measured by Order Defect Rate (ODR)
We will reduce bad debt rate in the marketplace, as measured by A-Z claims and chargebacks that we were not able to collect restitution for from sellers

That became our team identity, subject to the optimization that we should maximize total long term sales. I include the long term sales part as otherwise people would point out, you can optimize defect rate by simply shutting down the marketplace. To which I got pretty good at saying, yes, I understand if we will not do stupid or pathological things to drive the number. I always tried to smile when I said this - though that wasn’t always easy. ;-)

Since we’re on the topic of goals with pathological solutions. My experience suggests that such stark and (hopefully) theoretical conflicts within a goal are often best managed based on trust. Yes, we could destroy the marketplace by focusing on defect rates to the exclusion of all else. But as long as we believe the goal of our business is health now and in the future we probably wouldn’t (<insert terrifying joke about AI maximizing paper clip production here>).

Therefore, don’t spend a ton of time on a super complex complex equation that ensures you don’t shut down the whole marketplace to reduce ODR. Instead we should confirm that there is agreement on the principle that we want

Growing sales
Customers to be happy (low defect rate)
Not losing money (keep bad debt low from unrecovered claims and chargebacks)
Maintain sellers on our platform

The trust that we all want these things to be managed can then come from (a) agreement in these all being important, (b) agreement on tripwires as to how we’d know if we’re violating them - for example booting off sellers > X% of the market would require additional review, and (c) capturing operating tenets in writing to ensure everyone truly knew how the teams felt about 1-4 above. Tenets are amazing - and I plan to write more about them in the future.

Beyond ODR

Assessment and engagement with sellers over performance didn’t end with ODR. The state of data continued to evolve over the years.

Astute readers may be wondering if

Those three order defects were enough to capture all negative buyer experiences?
Given their lagging nature, was anything else considered as signals of likely discontent?

These are great questions, and ones we postponed in the early days when problems were more common. At that point even imperfect measures were enough as it was a target rich environment. Not that you were truly “likely” to have a negative experience. But just focusing on sellers and segments with a problem supported improving things a lot. So more subtle issues weren’t our main focus.

Over time though we did explicitly add two things we knew buyers didn’t like to seller facing scorecards. Cancel rate and late ship rate. I’ll describe each situation;

Cancel rate - We’d discovered an odd data quirk when tracing down sellers who looked good but buyers complained about. When a seller canceled an order before shipping it that “order” ceased to show up in the denominator of the ODR calculation. I’ve always doubted this was an intentional marketplace design and more of a idiosyncrasy of how underlying marketplace data models were setup. Seller initiated cancellation rates generally shouldn’t be high. Seller cancellations after a purchase (but before shipping) typically were due to somewhat slopping inventory management practices. For example listing more than they physically had or selling on multiple platforms and not keeping the Amazon set inventory level up to date automatically (or fast enough). When a buyer bought something and then had it canceled by the seller it usually wasn’t a great feeling.

Late shipment rate - this one is an example of us realizing there were behaviorally visible factors that increased the probability that a buyer would later complain. Having something you ordered shipped later than expected was one such thing. Also - because sellers provided tracking info and indicated when they shipped it was a factor we felt could be quantified early in the order process.

There were very active debates after defining these measures as to whether we should include them in the ODR measurement. Ultimately we decided to not include them - as they were inferences that the order had an issue, whereas feedback, claims and chargebacks were all direct complaints. Rather than mix apples and oranges we included cancel and late ship rate in seller facing scorecards along with expected performance levels. It appears from what I can tell this logic has persisted.

Reading documentation on seller central it appears that valid tracking rate (percent of orders with a tracking number provided) and on time delivery rate (orders arriving by the promise data using tracking info) were similarly added.

Our VP at the time Joseph Sirosh had a brilliant observation after Cancel Rate (CR) and Late Ship Rate (LSR) has been established for the seller community. It was that as we got better and better at ODR than perhaps we should start looking at the inverse or defects, what percentage of orders were totally perfect for buyers. For example a perfect order might include the absence of all the clearly problematic things (ie direct complaints) but also nothing that could even remotely signal any dissatisfaction. A “perfect buying experience” would be the buyer ordering, then it shipping and arriving on time, not being canceled, not requiring a return and not even requiring the buyer to contact the seller at all. Any deviation from this idealized state would make the order imperfect. It signaled an unachievable but highly desirable vision as a next level of buyer experience to strive for. An ideal both 3P as well as Amazon retail orders.

This “Perfect Order Percentage” (POP) concept was also empowered by our relatively new ability to track all buyer/seller communication due to other key infrastructure projects we’d done. This gave us insight on direct contacts, quality of the communication, and time it took to respond. It’s a great standalone story and look forward to exploring it more in a future article.

Joseph’s idea to one up ODR always struck me as the right next state, and I was a little jealous he’d thought of it first. :-) From a supply chain quality measure perspective I think it’s an ideal mental model. My recollection on what happened next here is less clear. Searching through online discussions it appears that a subset of the ‘perfection’ defects may be used in this manner to aggregate into a POP score. But what I found online seems more from seller supporting companies vs. Amazon. So I’ll leave learning more about the state of POP as an exercise for the reader2

Summary: Long-Term Impact of ODR and Performance Transparency

Transparent, fair, and consistent performance management was transformative for Amazon's marketplace. Order Defect Rate became a robust metric that sellers (mostly) trusted, actively managed, and responded to positively. Negative customer experiences dropped dramatically, buyer trust surged, and marketplace health improved significantly.

Eventually, new innovations like Fulfillment by Amazon (FBA) further enhanced trust by taking structural control of areas where bad experiences could arise - such as fulfillment and customer support. ODR represented a turning point, a moment when we realized managing seller performance was not just about policing bad actors but about clearly aligning incentives, expectations, and transparency. To the best of my knowledge this approach remains foundational today, illustrating the power of clarity in expectations, directionally aligned performance metrics and a widely held sense of fairness (procedural justice) in marketplaces.

Also we made mistakes, so, so so many mistakes. 😀 You haven’t lived until you tried to stop your performance script in realtime from throwing off the wrong people. Or hacked together a mountain of technical debt using Amazon’s Data Warehouse for things it really was not intended for. But those are stories for another day.

* Postscript: Two tips that will pay for years worth of this newsletter’s subscription 😀

Don’t hide what you want in a healthy marketplace

This has become a pet peeve of mine: Setting up market incentives and then going out of one’s way to hide them. For example, trying to make certain types of freight more attractive to carriers due to pricing incentives, but leaving it up for carriers to discover what we’d like them to do.

While I do have faith in the wisdom of the crowd and all, I’ve never understood why we’re sometimes hell bent on avoiding a nudge. Or to paraphrase Kramer simulating Moviefone - “why don’t you just tell me what behaviors you want to see?”

My general view being that if you don’t feel comfortable explaining to all sides in the marketplace why you want them to act a certain way then something probably isn’t totally kosher.

Sometimes just stating a new policy gets 80%+ compliance without having to wait to build/enforce it

Many times companies wait to roll all a new policy, new operational enforcement processes, and new production code all at once. Even if they had to wait 3-6 months to do this after deciding what the new policy should be. It’s often worth considering

“if I announce the new policy without being able to implement it, will I get some immediate behavioral benefits anyway?

If so, why not just announce the changes now while I’m still changing the code?”

This can help in a case where say you want all the video creators on your platform to stop doing X, but you haven’t quite figured out how to patrol for X at scale. Or you want everyone on your platform to remove their fake reviews at the penalty of losing their account - but aren’t sure how much manual review this is going to cost you. It’s possible just telling people you’re going to launch this with a vengeance on a future date will get 80%+ compliance. If in the end you miss your technical enforcement launch date by a few days or even months you’re probably still way ahead of the game.

While I don’t think The Art of Action is my absolute favorite book on goal setting and concepts like “commander’s intent” I do love it’s phrasing on the false tradeoffs between top down alignment and local autonomy.

Yes - that’s right, I always wanted to write that phrase someday. ;-)

A Random Walk Through Tech

Discussion about this post