The 80/20 rule: Tokenmaxxing edition

Harm reduction mechanisms for recovering token addicts

Jun 18, 2026

The smartest guy in the room (spoiler: not me)

A few weeks ago Hitesh Jain wrote a popular blog post “Claude seemed slow, so I X-rayed my logs.” Breaking down his (extensive) Claude Code logs was that a disproportionate amount of his spend came not from generating new tokens, but from the model re-reading the tokens he’d already paid to generate. Over and over again. It’s a good read, aligned with (his company) Coral Bricks’s technical goals to drive that toward zero. Hitesh is onto something big - and his ability to work at the lowest level of transformer infra will take a sledgehammer (in a good way) to the problem.

Interesting, important in the big picture, but not something I was super worried about in my own usage. At least not with my little Claude Max plan - clearly subsidized by corporate enterprise sucker’s paying the full API rate and then some.

That confidence lasted maybe two weeks … until it finally dawned on me that I am in fact also one of those Enterprise suckers truly paying by the token, all the tokens.

Deja vu

In my current job I’m very lucky to serve as a sort of champion of champions for AI adoption. I don’t think of it as AI adoption (a solution) but as being a champion of higher throughput in development, more time to explore more paths, and more joy (a problem)1. Or as Kim and Yegge memorably put it - the pursuit of “the good FAAFO.”

One of the jobs I grabbed for myself in this role was generating a view of AI adoption over time. That lets us measure progress in adoption and utilization - but it’s also a great guide to meeting interesting people as they appear on the leaderboards.

We’re not toxenmaxxing2 - that has always struck me as somewhere between an extreme solution to accelerate change and extremely dumb. Instead, I’ve been pitching tenets about how we’re approaching these things - and I’m happy to share as long as my boss gives me the OK (please ask if there’s interest). For this, the relevant tenet focuses on optimizing for impact and learning speed before cost efficiency (the early risk is under-learning, and not overspending). However, I’ve been watching some of our most prodigious vibe coder’s token bill rise pretty quickly and wondering how closely I should check in3. At the end of the day I thought “I’d feel pretty silly if someone asked ‘do you know if that $5000 so far could have been less’ and I never really checked.” So I started to prep for those conversations.

While I was thinking about it, the universe provided the solution. A serious AI user in finance pinged me to share the results of a recent query they’d issued to Claude. They’d asked “I just hit my monthly $500 limit4. Dude! where did all those tokens go!?”5 The core point of the response - 63% of their spend was “re-reading the conversation context on every turn.” Boom - Hitesh flashback. And then my thinking - well, if I cannot wait for Coral Bricks to solve this at the hardware layer and eliminate cache read costs what else can I do6?

So I asked the king of the hill, Mr. $6000 in tokens7 himself to run the same query. For both Claude and Codex - same deal. Maybe it’s worth stepping back and checking in on why?

The re-read tax

The banner summary, like much AI writing is perhaps a bit of an exaggeration. They have at times ~~hallucinations~~ a gift for fiction. But in this case the observation is pretty true. Just try it yourself - with special bonus points if you address Claude as “Dude!” - I’m sure LLM’s love that8.

I’m getting a bit tired out from spewing all my the lame cultural references, which are totally not as we’ve established through math, based in the 80’s. So I’m going to let my friend Claudio write the next part;

Why a long session costs more than a long answer

Both Claude Code and Codex work the same way: every turn re-sends the entire conversation so far. Prompt caching makes that re-send cheap per token, but it does not make it free, and the volume is enormous. In the account we reviewed, output was 10 million tokens for the month; cache reads were 2.57 billion. The model generated once and re-read its own context roughly 255 times for every token it produced.^* That ratio is volume, not cost: re-reads bill so cheaply per token that the same 255:1 lands as only about 5× the output cost - which is why it shows up as roughly half the bill rather than all of it.

The trap is that this cost is invisible and compounds. Cost tracks context size × number of turns, and context keeps growing as files, tool output, and back-and-forth pile up. So a session that runs twice as long doesn’t cost twice as much — each later turn re-reads a bigger pile than the turns before it, and the curve bends upward. One 1,729-message session in that account ran to ~$673 by itself, about a quarter of the month, on 1.7M tokens of output.

The Codex picture from the other report is the same, described with different vocabulary: a handful of long, useful threads that kept re-reading accumulated context. Neither tool burned the budget by talking too much. Both burned it by remembering too much, for too long, out loud.

What can we do about that?

I asked Claude for some guidance for what options we had - with a focus on reducing token spend without hurting value creation/impact. Now I’ll let Claudio take the controls again for a minute.

Three levers that hold for both tools

Cap the running context

Reset between unrelated tasks; summarize within one. This is the dominant lever because it attacks both terms at once — it shrinks the current pile and stops it being carried forward forever. The built-in auto-summarize on both tools fires near the context limit, far too late; by then you’ve already paid for hundreds of bloated re-reads. Manual and early wins.

Keep exploration out of the main thread

The threads that ballooned in both reports were research — repo mapping, vault scans, video and screenshot analysis — dumping huge volume into a context that then followed every later build turn. Subagents run in their own context and return only a summary, so the 200K tokens of scanning never enter the thread you keep working in. Same logic for reads: grep or glob to the section you need instead of slurping whole files. This shrinks the per-turn multiplier and improves answers by cutting noise.

Trim the always-on tax

The system prompt, the project instructions file (CLAUDE.md / AGENTS.md), and every connected MCP server’s tool definitions get re-read on every turn of every session. A bloated 40KB instructions file or three unused MCP servers is a fixed levy against all those billions of reads. This is the cheapest fix on the list: a one-time edit that compounds across every user and session. Audit it once.

Honorable mention - route by difficulty.

Mechanical work doesn’t need the top model or maximum reasoning. In the data I reviewed, Fable 5 pulled ~$566 from only 11% of tokens because it prices at 2× Opus. This is a rate lever to use to your advantage.

BTW - and this is Rich again. One thing I’ve been meaning to try but haven’t gotten to, but now am thinking I should is trying something like OhMyClaude to make the routing choices easier. Also … obviously, that name is great! LMK if any of you readers are using it or give it a whirl.

Mechanisms vs. best intentions

Everyone knows that when you join Amazon they insist you get that saying about best intentions not working and to always go with mechanisms tattooed on the body part of your choice9.

While alarming (although a mechanism technically) it’s good advice. I could send out a note to everyone doing high token burn coding at my company and ask them to keep an eye on the re-read tax. That is a classic best-intentions solution. Since I had 30 minutes before the next meeting I tried to come up with something better. Since I hadn’t hit my “win a free sticker from Rich” $500 limit yet, I decided to use Claude to brainstorm some solutions.

Deliver Claude did - I’m going to let them finish the story up below. From the insights/ideas it’s relatively straightfoward to nudge Claude to write up these mechanisms as skills with hooks that trigger to make the re-read cost more visible and alert on “you’re about to create a whole mess of re-read tokens.” Both once pointed out are likely to remind the user to follow some best practices. If my boss is feeling generous I’ll share the structure - but really, it only cost a few dollars (which I can now see more clearly as I code) to convert the below content into useful /skills.

Some musing on mechanisms to address re-read

Claude Code’s strength here is observability: it has a fully programmable status line and a mature hooks system, so you can make cost visible and put soft guardrails in place without anyone changing how they work.

Session cadence

/clear when switching to unrelated work (full reset). /compact to stay on one task but collapse the transcript into a summary so the prefix stops growing. /context shows current usage; /cost shows session spend. Route mechanical work with /model to Sonnet or Haiku and reserve Opus for hard reasoning.

Make the cost visible — the status line

The single highest-ROI change, because the whole problem is that the bill is silent. A status-line script receives session JSON on stdin every turn and prints one footer line. The fields that matter are context_window.current_usage.cache_read_input_tokens (the per-turn re-read — the burn signal), context_window.used_percentage, and cost.total_cost_usd. Put the re-read number on screen, colour it amber past 100K and red past 200K, and most of the behaviour corrects itself. Full script in the appendix; wire it up in ~/.claude/settings.json.

Soft guardrails — hooks

A UserPromptSubmit hook reads the transcript, and once a session runs long, injects a note telling Claude to wrap the current sub-task, write a short handoff, and recommend /clear or /compact. The model delivers the nudge in plain conversation — no popup, no forced restart. A PreToolUse hook on Read adds a speed bump on large whole-file reads, asking for confirmation (not blocking) and pointing at grep or offset+limit instead, since one careless read becomes a recurring per-turn tax.

Conclusion

If I get any time between meetings this week or next, I’m hoping to find a bit powerful, closed loop as a mechanism. But once you see this issue, or in my case have Hitesh explain it to me, forget about it, then see it again twice in one day - it’s (now) very hard to unsee. A decent start for your curiosity based, constructive conversation with your dev teams about did we really need all those tokens to build that?

I’m super curious what other mechanisms people have been using/intenting/dreaming about on how to balance the tokens/value ratio on their teams? Please share - or invite me to coffee to discuss. Much like your Claude account - the first ~~taste~~ session is free.

Thanks Convoy for giving me the “love problems not solutions” vocabulary to express that. Still doesn’t quite make up for the financial choice I made there. But that and the many truck/fuck related pun items of swag maybe is a start. yeah … not really…

No, this is not that thing that the NY Times and everyone else suddenly discovered about the kid who may have some issue who really, really cares about looking hot but found a new even creepier way than just going to the gym and dieting. This is the other NY Times article with “maxxing” in it. Wish those wack kids with their made up words would get off my lawn. That would be totally dope.

The torn part was that for example in the case of the #1 spend I’d spoken to them a week before and was completely blown away at the leverage he was getting.

Yes - we set a $500 monthly limit because, again - would you want to be the one who told their boss someone’s Ralph loop ate next quarter’s travel budget (or just budget)? What happens when you hit the limit? well, you have to send a note to our personal Mr. Wolf who expedites raising your limit. Later I try to swing by and give them a commemorative sticker for leaning in (and destroying a rainforest maybe). That’s a joke - I’ve been meaning to order the stickers but haven’t gotten to designing a “Thanks for releasing the token krakken” stickers in honor of an old in joke with Tito.

OK, that wasn’t exactly the prompt - it was (checks notes) “can you do an analysis of where all my tokens went? I just burned my whole monthly corporate allocation.”But I couldn’t pass up the opportunity to slip in a Dude, Where’s my Car! reference.

I love a great movie title - even if it doesn’t quite live up to others in the divine trinity of dumb/great names that includes Hot Tub Time Machine and Harold and Kumar Go to White Castle. It’s been a while, so I cannot tell you how it stacks up the on-again off-again contender for entry to that hallowed group - Cocaine Bear. Yes - I know most of these are actually not nearly as good as the name. Even if I felt that Harold and Kumar hit personally close to home based on my high school years of being the designated driver who often shuttled very enthusiastic friends in various altered states to White Castle (which was not in the gentlest of neighborhoods). Sorry - all grist for another post - or Substack.

Don’t bet against them. Seriously don’t.

Yes - I wrote $5000 before, but this was later in the day, and I’m trying to be accurate.

Just don’t speak to it like Pauly Shore. That is I’ve learned, just a bridge too far. Just 99.99999% of humanity except for in a very weird moment in time - LLM’s are deeply allergic to Mr Shore. It’s just science.

Yes, obviously that’s a joke. Amazon is barely paying for coffee in the break rooms, they’re definitely not covering your tattoo art.

A Random Walk Through Tech

Discussion about this post

Ready for more?