Three Types of Token Waste—And How to Eliminate Them

In this Article

Architectural Decisions That Slashed Our AI Token Waste by 30%

The prompt was 4,200 tokens long. It didn’t need to be.

I was working with a mid-sized SaaS company last spring—a team building an internal knowledge assistant for their support staff—and their monthly API bill had quietly become a small crisis. Not catastrophic, but the kind of line item that starts attracting CFO attention. When I pulled up their actual prompt logs, the problem was obvious within about ten minutes: they were feeding the model the same boilerplate context on every single call, re-explaining the same business rules, re-attaching the same reference documents, over and over, thousands of times a day. They were paying to teach the model things it had already “learned” in the previous request.

That’s the core insight this article is built around: **most token waste isn’t random. It has a shape. And once you see the shape, you can architect around it.**

Here’s what a deliberate set of architectural changes—not model swaps, not prompt shortcuts, not vague “optimization”—actually looked like when we applied them systematically. The result was a 30% reduction in token consumption over six weeks, with no measurable drop in output quality. Let me show you how we got there.

The Waste Taxonomy

Before you can cut tokens, you have to categorize where they’re going. We broke consumption into three buckets, which I’ve started calling the **Three Rs of Token Waste**: Repetition, Redundancy, and Retrieval Sprawl.

*Repetition* is the easiest to spot. It’s the boilerplate—system prompt sections that never change, company policy text pasted in wholesale, formatting instructions that could live in a template. In this team’s case, a static 800-word policy document was being injected into every prompt. Every. Single. One.

*Redundancy* is subtler. It’s context that’s technically relevant but disproportionate to the task. Asking a model to summarize a ticket and handing it the full 3,000-word conversation thread when the last four exchanges contain everything it needs. The model can handle it—but you’re paying for the whole thread.

*Retrieval Sprawl* is the sneaky one. It shows up when teams implement RAG—Retrieval-Augmented Generation, which means pulling relevant documents from a database and injecting them into the prompt at runtime—without tuning how much they retrieve. The default “grab the top five chunks” setting sounds reasonable until you realize chunk three, four, and five are adding 1,800 tokens of marginally related text that isn’t moving the needle on answer quality.

Once you name the three buckets, you can measure them. And once you can measure them, you can attack them in order of impact.

The Fixes, Ranked

**Fix one: Static context externalization.** This one is almost embarrassingly simple, which is probably why teams skip it. Any context that doesn’t change between calls shouldn’t live in the prompt—it should live in the model’s fine-tuning, or in a compressed system prompt that’s been aggressively edited for density. We took that 800-word policy document and distilled it into 140 words of structured, high-signal instructions. Same behavioral outcome. 660 tokens gone, per call, forever.

To be clear, fine-tuning isn’t always the right answer here—it introduces its own costs and maintenance overhead. But for stable, high-frequency context, it’s worth running the math. If a piece of context appears in 10,000 calls a month and costs 500 tokens each time, that’s 5 million tokens monthly just for one static block. The economics of a one-time fine-tuning run start looking very different.

**Fix two: Conversation window management.** Most production chat applications pass the full conversation history with every turn. That’s often necessary for coherence—but “full history” doesn’t have to mean “verbatim transcript.” We implemented a lightweight summarization layer that compressed older turns in the conversation into a rolling summary, keeping only the most recent three to four exchanges verbatim. The model retained context. The token count dropped. This single change accounted for roughly 40% of our total reduction.

The tricky part is the summarization itself costs tokens too. You have to be intentional about when you trigger it—typically after a conversation exceeds a threshold length—and use a smaller, cheaper model for the summarization step. That asymmetry matters: a small model doing compression work so a large model can do reasoning work is often the right division of labor.

**Fix three: RAG chunk tuning.** This required the most experimentation, but the payoff was real. We reduced the default retrieval from five chunks to two, then ran a two-week evaluation comparing answer quality scores (assessed by a human rater on a 200-question test set) between the old and new configurations. Quality held at the two-chunk setting for roughly 85% of query types. For the remaining 15%—complex, multi-part questions—we built a classifier that flags those queries and routes them to a three-chunk retrieval path.

That classifier is itself a small model call, which adds a token cost. But it’s a tiny fraction of what we were spending on unnecessary retrieval. The net is strongly positive.

What We Didn’t Do

I want to be honest about the tradeoffs, because a lot of “optimization” content glosses over what you give up.

We did not aggressively compress the verbatim recent-turn context in conversations. Some teams go further than we did—summarizing everything, keeping almost no raw transcript. In testing, that started introducing subtle coherence errors, the kind where the model’s response is technically responsive but slightly off-tone, like it missed a nuance from two turns back. We decided the token savings weren’t worth the quality degradation.

We also didn’t move to a smaller base model, even though the cost difference is substantial. The team’s use case—nuanced support guidance, often involving ambiguous policy questions—genuinely required the reasoning capability of a frontier model. Swapping to a smaller model would have saved tokens in the most literal sense (smaller models often have lower per-token pricing) but would have required significant prompt engineering to compensate, and the output quality ceiling was lower. That’s a legitimate architectural decision for some teams. It wasn’t the right one here.

The honest version of “we cut costs 30%” is: we cut costs 30% while preserving the output quality the team actually needed. Those two things have to be evaluated together.

The Bigger Frame

Token economics is still a young discipline. Most teams are in the “pay the bill and move on” phase—which is exactly where cloud computing was in 2012, right before FinOps became a real function with real practitioners. The parallel isn’t perfect, but it’s instructive. Compute costs didn’t get managed until someone decided to manage them deliberately, with tooling and measurement and accountability.

The teams that are going to win the next phase of AI cost efficiency aren’t the ones who find a clever prompt trick. They’re the ones who build measurement infrastructure first—logging token counts by call type, by user segment, by feature—so they can see the shape of their waste before they try to cut it. You can’t architect around a problem you haven’t mapped.

Thirty percent sounds like a big number. In practice, it came from three relatively unglamorous decisions: edit the static context ruthlessly, summarize old conversation turns with a cheap model, and tune your retrieval to what you actually need. None of it required a new model, a new vendor, or a major engineering sprint.

It required looking at the logs.

That’s usually where the answer is.

Cliff in black suit profile picture
About The Author

Cliff Worley

Cliff Worley is a keynote speaker and “Future Translator” who helps leaders and teams turn AI anxiety into action. Mentored early on by Daymond John and later Head of Portfolio Marketing at Kapor Capital, Cliff has spent his career making “the future” something people can actually use. He’s spoken for organizations like Amazon, Cisco, Uber, and Intel, and writes the AI Playtime newsletter — practical, jargon-free tools for leaders who’d rather build than wait and see.

Get the Cheat Sheet
Cliff profile

Get the weekly strategies I use to win.

You may also like…