Cost: The Economics of Running Agents
The economics of running an agent, on one premise — context is compute, so the input context an agent reprocesses each turn is the dominant cost driver, not output generation. Four composable levers manage that spend — reduce input context, cache the stable prefix, route by model tier, and batch the non-urgent work — and they stack rather than compete. Cache economics are stated as ratios and the model ladder qualitatively, because the underlying pricing surface is volatile.
On this page
ch24 surfaced the numbers; this chapter asks what they mean. Cost is the third operational surface, and it has a single organizing premise: the money is in the input. An agent does not spend its budget mostly on what it writes — it spends it on the context it reprocesses every turn. Once you see that, four levers fall out, and they are not rivals competing for the same fix — they stack. This chapter states the premise, then each lever, then composes them into one cost discipline.
Context is compute
Start with where the money actually goes, because the intuition is usually wrong. It is tempting to picture an agent’s cost as the text it produces — the long answer, the generated code, the written report. But generation is the small side of the ledger for an agent: output tokens are individually pricier than input, yet an agent reprocesses far more input than it ever writes, so the input dominates the bill — the context the model has to read and re-read on every turn.
The premise has a first-party name. Context is “a critical but finite resource for AI agents” [Official] Effective context engineering for AI agents · Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, and Jeremy Hadfield (2025)T1-official original — a budget, not a free input slot. The reason is mechanical: like a person with limited working memory, an LLM has a finite attention budget “that they draw on when parsing large volumes of context.” [Official] Effective context engineering for AI agents · Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, and Jeremy Hadfield (2025)T1-official original Every token in the window is a token the model spends attention parsing, and every token parsed is billed.
What makes this the cost story for agents — rather than for a one-shot prompt — is accumulation. An agent runs a loop: it calls a tool, the result lands in the context, it reasons over the now-larger context, calls another tool, and so on. The conversation history, the tool outputs, the system instructions — all of it is reprocessed each turn. So the input the agent pays for grows as the run goes on, and on a long trajectory the input side dwarfs anything the model writes back. That is the sense in which “context is compute”: the tokens you feed the model are the spend you engineer down.
Prompt-cache economics: a read-vs-write asymmetry
If input context is the spend, the first lever is the one that makes repeated input cheap. Prompt caching stores a prefix of the context server-side so it does not have to be reprocessed from scratch on the next turn — and its economics are a sharp asymmetry between writing the cache and reading it.
The two figures are first-party and worth stating exactly. Writing the 5-minute cache costs about 1.25× the base input price [Official] Prompt caching · AnthropicT1-official original — a one-time premium you pay to populate it. Reading from the cache (a cache hit) costs about 0.1× the base input price [Official] Prompt caching · AnthropicT1-official original — roughly a tenth. And the cache “has a 5-minute lifetime” that “is refreshed for no additional cost each time the cached content is used,” [Official] Prompt caching · AnthropicT1-official original so under sustained traffic the timer keeps resetting and the prefix stays warm for free.
The break-even falls straight out of those numbers. Because a hit costs roughly a tenth of the input price, “caching pays off after just one cache read for the 5-minute duration.” [Official] Pricing - Claude API Docs · AnthropicT1-official original You pay the 1.25× write once; the very first reuse already comes back at 0.1×, and every reuse after that is gravy. The design move this licenses is structural: stabilize a long shared prefix — system instructions, tool definitions, a fixed document set — so it is written to the cache once and then read many times across the run.
The ~10× gap between a cold read (full input price) and a warm read (0.1× of it) is the economic core of “context is compute.” It is what makes carrying a large, stable context affordable at all: the first turn pays to write it, and every turn after rides at a tenth of the price.
The multi-agent token multiplier — a modeling input, not a verdict
The second thing the cost surface has to handle is the one the orchestration chapters deferred: multi-agent systems are expensive, and the honest question is when that expense is worth paying. Anthropic’s first-party measurement on its own research system is concrete: “agents typically use about 4× more tokens than chat interactions, and multi-agent systems use about 15× more tokens than chats.” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original A multi-agent topology burns through tokens fast, because each sub-agent runs its own context window — and on the cost surface, that ~15× is a number you have to plan around.
But the same measurement reframes the burn. On their benchmark, “token usage by itself explains 80% of the variance” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original in performance — token spend was the single dominant driver of how well the system did. Read together, the two findings turn the multiplier from an indictment into a lever: if spending more tokens is the strongest predictor of doing better, then on a high-value, genuinely parallelizable task the 15× spend can be the rational choice, not waste. The cost-modeling question is not “is multi-agent wasteful?” — it is “does this task’s value clear the 15× multiplier?”
This is also why this chapter, not the orchestration ones, is the honest home for the ~15×. Whether to build a multi-agent topology is an orchestration question; what it costs and when that cost is justified is an economics question — and economics is the surface that can hold “expensive” and “worth it” in the same hand without collapsing into a slogan either way.
Cheapening the spend: model tiers and the Batch API
The first two levers reduce how much input you pay for. The last two reduce the price of what you do spend — without touching the architecture at all.
The first is model-tier routing. Anthropic’s guidance is to “Choose Haiku for simple tasks, Sonnet for most production workloads, and Opus for the most complex reasoning.” [Official] Pricing - Claude API Docs · AnthropicT1-official original The tiers form a cost-and-capability ladder — Haiku < Sonnet < Opus, cheapest and least capable up to most capable and most expensive — and the lever is to route the cheapest model that clears each subtask. A classification step, a quick extraction, a routine summarization does not need the top tier; reserve Opus for the reasoning that genuinely requires it. The point is per-subtask: one agent run can dispatch cheap work to a cheap model and keep the expensive model for the part that earns it.
The second is the Batch API, for work that is not time-sensitive. It “allows asynchronous processing of large volumes of requests with a 50% discount on both input and output tokens” [Official] Pricing - Claude API Docs · AnthropicT1-official original — corroborated on the feature’s own documentation, which describes processing large volumes “while reducing costs by 50% and increasing throughput.” [Official] Batch processing — Claude Docs · AnthropicT1-official original The trade is latency for price: you give up real-time response and get half off. An overnight evaluation run, a bulk re-classification, a backfill — none of these needs to answer in seconds, and all of them halve in cost by going through the batch path.
The two levers are orthogonal to caching and to each other: tier-routing changes which model parses the input, batching changes when the work runs, caching changes how much of the input is reprocessed. That orthogonality is what lets them stack.
The four levers, composed
The chapter’s payoff is that these are not four competing fixes you choose between — they are four moves you apply together, each on a different part of the spend.
- Reduce the input context (the driver). Spend less to begin with: trim the prompt, keep tool outputs lean, do not carry context the turn does not need. This is the lever that attacks the premise directly.
- Cache the stable prefix. Make the input you do carry cheap to reprocess — write it once at ~1.25×, read it at ~0.1× thereafter, and watch the read-to-creation ratio to confirm it is amortizing.
- Model the multi-agent burn against value. Decide whether to spend the ~15× at all — pay it only when the task’s value clears the multiplier.
- Route and batch to cheapen what’s left. Send each subtask to the cheapest model that clears it (Haiku < Sonnet < Opus), and push non-urgent work through the 50% Batch API.
They compose because they act on different variables — tier-routing per subtask, caching on repeated context, batch on deferrable work. One boundary is worth stating: the Batch API is a 50%-off asynchronous path, so it applies to work you can defer (offline evals, bulk jobs), not to a live interactive turn. A cost-disciplined agent runs every applicable lever against one bill — caching and tier-routing on its live turns, batch on whatever can go async.
Quick reference
- The premise: input context, not output, is the cost driver — context is a finite resource billed by the attention spent parsing it, and it accumulates across an agent’s loop. Effective context engineering for AI agents · Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, and Jeremy Hadfield (2025)T1-official original
- Cache asymmetry: a write costs ~1.25× base input, a read ~0.1×, with a 5-minute free-refresh TTL — so “caching pays off after just one cache read.” Prompt caching · AnthropicT1-official original Pricing - Claude API Docs · AnthropicT1-official original
- Cache health = cost signal: a high read-to-creation ratio means caching is working; persistently high creation means you keep paying the write premium. How Claude Code uses prompt caching - Claude Code Docs · AnthropicT1-official original
- Multi-agent burn: Anthropic measured ~4× (single agent) and ~15× (multi-agent) the tokens of a chat, with “token usage by itself explains 80% of the variance” — a cost-modeling input on their workload, not a universal constant or a flat anti-pattern. How we built our multi-agent research system · Anthropic (2025)T1-official original
- Cheapen-the-spend levers: route the cheapest model that clears each subtask (Haiku < Sonnet < Opus, qualitative — no dollars, recheck after 2026-06-26) and push non-urgent work through the 50% Batch API. Pricing - Claude API Docs · AnthropicT1-official original Batch processing — Claude Docs · AnthropicT1-official original
- The playbook: reduce input → cache the stable prefix → model the burn against value → route and batch — four levers that stack, all against the input side of one bill.
Practice
Exercise solutions
The four levers are (1) reduce input context, (2) prompt caching, (3) model-tier routing, and (4) the Batch API. Each acts on a different variable: lever 1 reduces how much input is reprocessed each turn (it attacks the spend at the source); lever 2 reduces the price of reprocessing the input you do carry (write once at ~1.25×, read at ~0.1×); lever 3 changes which model parses the input, per subtask (route the cheapest tier that clears the task); lever 4 changes when the work runs (non-urgent work goes async for a 50% discount). Because they act on different variables — quantity of input, price-per-reprocess, model, and timing — they do not contend for the same fix; they multiply through independently, and some compose further (tier-routing sits underneath caching; the batch discount then applies to whatever work runs async, where its cache hits are best-effort rather than guaranteed). The lever that attacks the “context is compute” premise most directly is reduce input context: the premise says input is the dominant cost, and trimming the input lowers the very quantity the other three only make cheaper to carry, route, or schedule. It is the cheapest move precisely because it removes spend rather than discounting it.
The bare ~15× figure tells you only that a multi-agent system is expensive — it burns roughly fifteen times the tokens of a chat. On its own that reads as an indictment (“multi-agent is wasteful”). The “token usage by itself explains 80% of the variance” finding adds the missing half: on that benchmark, how much a system spent was the single strongest predictor of how well it performed. Put together, the two say the burn is large and that the burn buys performance — so the spend is not waste but a lever you pull when a task’s value justifies it; the modeling question becomes “does this task clear the 15×?” rather than “avoid multi-agent.” A reader could misuse the ~15× by treating it as a flat verdict that multi-agent should always be avoided; and could misuse the “80%” by concluding one should always spend more tokens. The honest qualifier the chapter attaches to both is that they are measurements on one workload — the ~15× is one first-party datapoint from Anthropic’s own multi-agent research system, and the “80%” is specific to the BrowseComp benchmark — so neither is a universal constant, and a different topology or task mix will have different numbers. The right move is to gather your own before betting on a figure.
A worked example. Take a customer-support triage agent that reads a ticket, searches a knowledge base over several tool calls, and drafts a reply. Where the bill goes: the drafted reply is a few hundred output tokens, but each turn re-feeds the system prompt, the tool definitions, the growing tool-result history, and the ticket — so the input it reprocesses dominates, and it grows as the search deepens. That is the bill as an input problem. Lever 1 — reduce input: stop carrying full knowledge-base articles in the window once they have been read; keep only the extracted snippet the draft needs. Lever 2 — cache: the system prompt and tool definitions are a stable prefix — write them to the cache once and collect ~0.1× reads on every subsequent turn; confirm via a high read-to-creation ratio. Lever 3 — model the burn: if this is a single agent (~4×), there is no multi-agent topology to justify — but if triage fanned out into parallel sub-agents per knowledge source, ask whether the support volume’s value clears the ~15× before keeping it. Lever 4 — route and batch: the initial “which category is this ticket?” classification can run on Haiku, reserving the top tier for the draft; and any overnight bulk re-tagging of old tickets goes through the 50% Batch API. The value of the exercise is seeing that the largest, cheapest win (trimming carried articles, caching the prefix) sits on the input side — exactly where the premise says the money is — long before any architecture change.
For the discipline. Hard-coded per-MTok dollar prices would go stale the moment the pricing page changes — and the chapter’s own sourcing flags that page as volatile (recheck after 2026-06-26). A printed dollar figure in a reference book becomes a quietly-wrong number that readers trust precisely because it looks precise; worse, a stale absolute price corrupts any cost model built on it. A precise input-to-output ratio has the same defect with an added one: no first-party source asserts such a ratio, so printing one would be inventing a constant and laundering it as fact — and any real ratio is a property of one workload, not a universal. Ratios and an ordering, by contrast, are the durable part: the ~10× cached-versus-uncached gap and the Haiku < Sonnet < Opus ladder survive a pricing change, because a repricing typically moves the absolute levels while preserving the asymmetry and the ordering. The other side. A reader genuinely loses the ability to compute an absolute budget from the chapter alone — “ratios” cannot tell you whether next month’s bill is $40 or $4,000, only how the levers move it. The responsible recovery is explicit in the chapter’s own discipline: when a real cost model needs absolute figures, fetch them live from the pricing surface at the moment you build the model, treat them as that-day’s volatile numbers, and re-verify on the cadence the volatility implies — rather than carrying a remembered or book-printed price into a decision. The book’s job is to teach the shape of the economics that survives repricing; the live pricing page’s job is to supply the day’s absolute numbers.