⌕
Agentic Systems Design Volume 1 · Environment & Context Engineering
All chapters

Part 1

  1. Ch1 Agent = Model + Harness
  2. Ch2 Beyond Autocomplete: The Environment & Context Discipline
  3. Ch3 Repo & Doc Design for Agents
  4. Ch4 The Instruction Layer: CLAUDE.md & AGENTS.md
  5. Ch5 Skills & Progressive Disclosure
  6. Ch6 Guardrails, Permissions & Reversibility
  7. Ch7 Environments at Scale: Large Codebases & Monorepos
  8. Ch8 Context Rot: Why Windows Degrade
  9. Ch9 Context Assembly: Engineering the Window
  10. Ch10 Memory: Persisting Context Across Sessions
  11. Ch11 Designing the Whole: Environment + Context as One System

Part 2

  1. Ch12 Beyond One Agent, One Tool
  2. Ch13 Build vs. Buy: Choosing a Harness
  3. Ch14 Tool Minimization: Subtract First
  4. Ch15 MCP: Designing External Capability
  5. Ch16 Shaping Input — The Prompting Craft
  6. Ch17 Shaping Output — Structured & Reliable
  7. Ch18 Sub-Agents: The Context-Isolation Primitive
  8. Ch19 Multi-Agent: Coordinating Many
  9. Ch20 Composing Tools & Orchestration: The Two Axes as One System

Part 3

  1. Ch21 Measuring & Operating Agents: The Discipline
  2. Ch22 Evaluating a Prompt: The Four-Step Loop
  3. Ch23 Evaluating an Agent: Harnesses, Suites & the Judge
  4. Ch24 Observability: Seeing What the Agent Did
  5. Ch25 Cost: The Economics of Running Agents
  6. Ch26 Human-in-the-Loop: Keeping a Human in Control
  7. Ch27 Security: The Adversarial-Input Layer
  8. Ch28 Operating the Whole: Eval + Ops as One Loop
Part 1 Chapter 1 Last verified 2026-06-13 Fresh

Agent = Model + Harness

The introduction. This book engineers the two layers around the model — the environment an agent acts in and the context it reasons over — and this chapter grounds that thesis in the frame the book stands on — an agent is a model plus the deterministic harness that wraps it. The three layers, the components, the nested loop, the book's map, and what it leaves to companion volumes.

Volatility: architectural-pattern
Tools compared: claude-codecross-tool
Before you start: You have built or used at least one agent — a coding agent, a chatbot with tools, a workflow. No Claude-specific knowledge assumed.
You will learn
  • The thesis of this book — what turns a model into an agent is engineering the two layers around it: the environment it acts in and the context it reasons over
  • The frame that makes that precise — Agent = Model + Harness — and the three layers a harness owns (environment, context-assembly, context)
  • The components every harness wires together, and the nested loop that drives them
  • The map of this book, and what it deliberately leaves to companion volumes

What turns a model into an agent is not the model — it is the engineering of the two layers around it: the environment it acts in, and the context it reasons over. That is the subject of this book. This opening chapter states the thesis and grounds it in the frame the book stands on — Agent = Model + Harness — then maps the chapters and marks the scope. It defines vocabulary and direction; the next chapter argues the case, and the rest build it.

The frame: an agent is a model plus a harness

Start with the distinction that organizes everything else. An agent, in Anthropic’s framing, is a system where models “dynamically direct their own processes and tool usage.” [Official] Building effective agents · Erik Schluntz and Barry Zhang (2024)T1-official original The working shorthand is tighter still: agents are “LLMs autonomously using tools in a loop.” Effective context engineering for AI agents · Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, and Jeremy Hadfield (2025)T1-official original That is the model side — the source of the agent’s autonomy.

The harness is everything else: the deterministic code wrapped around the model. Claude Code, for instance, “provides the tools, context management, and execution environment that turn a language model into a capable coding agent.” [Official] How Claude Code works · AnthropicT1-official original A general-purpose harness such as the Claude Agent SDK is one concrete instance of that wrapper. Effective harnesses for long-running agents · Justin Young (2025)T1-official original

Concept · Harness

The deterministic code around the model — its tools, context management, execution environment, and control loop. It is not the model, and not a fixed pipeline. “Harness engineering” is the discipline of building it.

The boundary that makes the frame sharp is the contrast with a workflow, where models and tools are instead “orchestrated through predefined code paths.” Building effective agents · Erik Schluntz and Barry Zhang (2024)T1-official original A workflow’s control flow is fixed in code; an agent’s control flow is driven by the model. So the same three ingredients — a model, some tools, some glue — give you a workflow or an agent depending on who decides what happens next.

Key idea

Agent = Model + Harness. The model supplies autonomy; the harness supplies tools, context, environment, and a control loop. Designing an agent is, almost entirely, designing the harness.

This is why this book is scoped the way it is: the model is a given (you choose a capable one and prompt it well — prompting is its own discipline, out of scope here), and of the harness around it, this book develops the two layers where most of the leverage lives — the environment and the context.

The harness owns three layers

A harness decomposes into three layers, and its defining job is owning the boundary between them.

  • Context — the assembled window the model reasons over. It is “a critical but finite resource for AI agents” [Official] Effective context engineering for AI agents · Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, and Jeremy Hadfield (2025)T1-official original — a budget, not free memory.
  • Environment — the substrate the agent acts in: the repository, the filesystem, scripts, a running process. A long-running harness “asks the model to set up the initial environment” before work begins. Effective harnesses for long-running agents · Justin Young (2025)T1-official original
  • Context-assembly — the boundary between them. Rather than dumping the environment into the window, agents keep lightweight references and “use these references to dynamically load data into context at runtime using tools.” Effective context engineering for AI agents · Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, and Jeremy Hadfield (2025)T1-official original
Concept · The three layers

Environment (large, durable substrate) → context-assembly (what crosses, and when) → context (small, finite window). The harness is the artifact that owns the assembly boundary — deciding what enters the window, and when.

Owning that boundary is what makes a harness a harness. The mechanics of assembly — caching, compaction, just-in-time loading — are deep enough to be their own chapter (the context-assembly chapter, later in this book); here we only name the three layers and locate the boundary the harness controls.

[Note]

“Context-assembly” as a named third layer is partly this book’s framing, laid over the just-in-time boundary the sources describe.

The components a harness wires together

Zoom into the harness and it is a parts list. Shipping harnesses name the same parts. The Claude Agent SDK gives you “the same tools, agent loop, and context management that power Claude Code” [Official] Agent SDK overview · AnthropicT1-official original ; tools are “the primary building blocks of execution for your agent” Building agents with the Claude Agent SDK · Thariq Shihipar et al. (2025)T1-official original ; and the harness can “spawn specialized agents to handle focused subtasks” Agent SDK overview · AnthropicT1-official original that “use their own isolated context windows.” Building agents with the Claude Agent SDK · Thariq Shihipar et al. (2025)T1-official original

That this is a real taxonomy, and not one vendor’s idiosyncrasy, shows up cross-vendor:

Convergence claude-codecross-tool

An independent framework, LangGraph, names the same parts — both “short-term working memory for ongoing reasoning and long-term memory across sessions,” LangGraph overview · LangChainT2-release-notes original and a durable control loop whose agents “persist through failures and can run for extended periods, resuming from where they left off.” LangGraph overview · LangChainT2-release-notes original First-party Anthropic primaries and an independent vendor converge on the same components list. [Convergence]

So a harness wires together, at least: a tool interface, context/memory management, a control loop with stop conditions, and sub-agent orchestration — and, by extension, guardrails/permissions and observability. Each part is deep-diveable on its own; this chapter is the parts diagram, not the parts.

[Caveat]

Of these parts, this book develops the environment, context/memory, and guardrails. The tool interface, orchestration, and observability are named here but left to companion volumes.

The nested loop: an inner cycle inside an outer one

The harness runs a nested loop, and knowing which loop you are designing is half of harness engineering.

The inner loop is the model’s own cycle. Claude Code’s “agentic loop is powered by two components” How Claude Code works · AnthropicT1-official original — the model reasoning and the tools acting — repeating until the task is done. The internals of that cycle (how a turn is structured, how tool results return) are a companion-volume concern; here it is enough that an inner loop exists and the model drives it.

The outer loop is the harness wrapping that cycle. A long-running harness exists to “bridge the gap between coding sessions” Effective harnesses for long-running agents · Justin Young (2025)T1-official original so the model can make incremental progress across many context windows.

Where the control lives

Continue-or-stop, compaction, carrying artifacts forward, dispatching a sub-agent — these are all outer-loop decisions the harness makes around an inner loop it does not reason inside. That is precisely why harness engineering is a distinct discipline from prompt engineering: prompt engineering shapes the inner loop’s reasoning; harness engineering builds the outer loop.

Locating a design decision in the frame Worked example

A reader asks: “My agent re-reads the same large config file on every turn and blows its context budget. Where’s the fix?”

Walk the frame:

  • Not the model — the model isn’t choosing to waste tokens; it has no other way to see the file.
  • It’s a context-assembly problem — the file lives in the environment; the harness is loading all of it into context instead of keeping a reference and loading on demand.
  • The lever is an outer-loop decision (what the harness puts in the window), realized through the context/memory component.

The frame turned a vague complaint into a located design decision — before naming a single API. That is what the frame is for.

How the vocabulary settled

The frame above is recent. The words arrived on a short, dated arc, and it is worth seeing the shape because the vocabulary is still moving.

  • 2025-06-17 — Andrej Karpathy popularizes the “agents” framing in a widely-cited keynote: “This is the Decade of Agents.” [Practitioner] Andrej Karpathy: Software in the Age of AI · Andrej Karpathy (Latent Space transcript-mirror) (2025)T3-practitioner original
  • 2025-11-26 — Anthropic adopts “harness” in an official engineering venue, describing a “long-running agent harness” working “across many context windows.” Effective harnesses for long-running agents · Justin Young (2025)T1-official original
  • 2026-02-05 — Mitchell Hashimoto names the discipline: “I’ve grown to calling this ‘harness engineering.’” [Practitioner] My AI Adoption Journey · Mitchell Hashimoto (2026)T3-practitioner original
  • 2026-03-10 — LangChain gives the formula its cleanest published form: “Agent = Model + Harness.” The Anatomy of an Agent Harness · Vivek Trivedy (2026)T2-release-notes original

The shape is the story: the concept (autonomous agents) predates a word for its wrapper (harness) by months, and the compact formula arrives last, once the wrapper has a name.

[Caveat]

This is dated provenance, not a definition. Hashimoto himself notes there is no industry-accepted term yet — the field has not converged on the vocabulary.

The shape of this book

This book takes two of the harness’s layers — the environment and the context — and develops them as one discipline. The chapters move from the substrate outward to the window, and inside the window from the problem to the response:

Environment — the substrate the agent acts in.

  • Repo & doc design — making the repository high-signal to read and self-correcting to act in.
  • The instruction layer — the always-loaded config file as a context budget, not documentation.
  • Skills & progressive disclosure — procedural knowledge that loads only when relevant.
  • Guardrails, permissions & reversibility — expressing intent in policy and containing failure in mechanism.
  • Environments at scale — bounding what the agent must load when the repo is too big to hold.

Context — the window the model reasons over.

  • Context rot — the evidence that long windows degrade, and why that is the problem the rest answers.
  • Context assembly — the engineering response: caching, compaction, just-in-time loading, budgeting the window.
  • Memory — persisting context across sessions, and the anti-patterns that reintroduce the rot.
The Part's shape: the environment (E1–E5) feeds the context-assembly boundary, which decides what crosses into the context (C1–C3). The C1→C2 arrow is the problem→response spine of the context half.Three columns. Left: an Environment box listing E1 Repo design, E2 CLAUDE.md, E3 Skills, E4 Guardrails, E5 At scale. Center: a context-assembly boundary labeled 'what crosses, and when'. Right: a Context group of three boxes — C1 Rot, C2 Assembly, C3 Memory. Thick arrows flow from Environment through the boundary to each context box; a curved arrow runs from C1 to C2 labeled 'problem to response'.
The Part's shape: the environment (E1–E5) feeds the context-assembly boundary, which decides what crosses into the context (C1–C3). The C1→C2 arrow is the problem→response spine of the context half.

A closing chapter composes all of it into one design workflow.

Key idea

The frame’s value is the clean cut: any harness concern is a property of the model, the harness’s loop, its layers, or its components. This book develops two of the layers — environment and context; the next chapter argues why they are the most underappreciated, highest-leverage thing an architect designs.

What this book leaves to companion volumes

The frame names more of the harness than one book should carry. The rest is deliberately out of scope here — each is planned as a companion volume in the series, and this book points outward to it rather than pretending to deliver it:

  • The inner loop’s internals — how a turn and its tool results are structured.
  • The Agent SDK’s surface — the concrete harness instance whose API we cited above.
  • Sub-agent orchestration — the isolation primitive, orchestrator–worker topologies, and when not to use them.
  • Tools & MCP — designing the tool interface and the protocol that serves it.
  • Build vs. buy — whether to configure an existing harness or build your own.
  • Evaluation & operations — measuring and running agents in production.

These are companion concerns, not missing chapters: this book stands on its own for the environment and the context.

What is still settling

This chapter captures a frame mid-crystallization, so a few honest limits travel with it.

  • The vocabulary is young. “Harness” entered Anthropic’s official vocabulary only in late 2025; “harness engineering” and “Agent = Model + Harness” are 2026 coinages. Expect the terms to keep shifting, and re-check the landscape rather than treating today’s words as settled.
  • The definitional spine is single-vendor-dense. The definitions of “agent” and “harness” rest largely on Anthropic’s framing. The one independent corroboration here (LangGraph) agrees on the components list, not on the definitions — so do not over-claim convergence beyond the parts list.
  • One provenance anchor is a transcript, not a primary. Karpathy’s line is quoted from a published transcript of a spoken keynote, faithful to that mirror and the talk date, but it is “Karpathy said X in a talk,” not “Karpathy wrote X.”
Mistaking the frame for a how-to

This chapter defines vocabulary and direction; it does not tell you how to build the inner loop, design a tool, or choose a topology — several of those are companion-volume concerns. Reaching for this chapter as an implementation guide is a category error: its job is to make the rest of this book’s moves legible, and to mark where the book’s scope ends.

Quick reference

  • Agent = Model + Harness. The model supplies autonomy; the harness supplies everything around it.
  • Agent vs. workflow: an agent’s control flow is model-driven; a workflow’s is fixed in code.
  • Three layers: environment (substrate) → context-assembly (the boundary) → context (the finite window). The harness owns the boundary.
  • Components: tool interface, context/memory, control loop, sub-agent orchestration (+ guardrails, observability).
  • Nested loop: the model’s inner reason→act→observe sits inside the harness’s outer cross-session orchestration, where control lives.
  • This book’s scope: the environment and the context — the two highest-leverage layers; the rest is named here and left to companion volumes.

Practice

Exercise

Take an agent you have used or built. Name its harness’s four core components (tool interface, context/memory, control loop, sub-agent orchestration). Which one is most primitive — i.e., barely a component at all? What does that tell you about where it sits on the workflow↔agent spectrum?

Practice ◆◆◆◇

Pick one recurring failure in an agent you run (re-reading files, losing the thread across sessions, over-calling a tool). Locate it in the frame: is it a model problem, an inner-loop problem, a layer-boundary problem, or a component problem? Write one sentence naming the layer/component and the outer-loop decision that would address it — without yet choosing an API. The point is to feel the frame do the locating.

Exercise solutions

Solution ↑ Exercise

For most coding agents the control loop is the richest component (stop conditions, retries, compaction, dispatch) and the sub-agent orchestration is often the most primitive — frequently absent entirely (a single-threaded agent). An agent with no orchestration and a thin, fixed control loop is sliding toward the workflow end of the spectrum; the more the model (not predefined code) decides what happens next across a rich outer loop, the more it is an agent in the sense this chapter defines.

Part 1 Chapter 2 Last verified 2026-06-13 Fresh

Beyond Autocomplete: The Environment & Context Discipline

The argued opener for this book. The discipline that turns a model into an agent is the engineering of the two layers around it — the environment it acts in and the context it reasons over — and it is the most underappreciated, highest-leverage thing an architect designs.

Volatility: architectural-pattern
Tools compared: claude-codecross-tool
Before you start: The frame and the book's map from the introduction — Agent = Model + Harness, and its three layers (environment, context-assembly, context).
You will learn
  • Why the difference between autocomplete and an agent lives almost entirely in two of the harness’s layers — the environment and the context
  • Why this is the underappreciated discipline: the model gets the attention, but the leverage is in what you put around it
  • The two halves as one subject — engineering the substrate the agent acts in, and engineering the window it reasons over
  • What is still settling about the discipline, and how to read the chapters that follow

The introduction drew the frame and stated the thesis: an agent is a model plus a harness, and the leverage is in two of the harness’s layers — the environment and the context. This chapter makes the case for that thesis — why those two layers, and not the model, are where the discipline of building agents actually lives. It argues; it does not yet build.

What autocomplete cannot do

Start with the gap this book is about. Code completion suggests the next token from the surrounding text; it has no autonomy. An agent, in Anthropic’s framing, is a system where models “dynamically direct their own processes and tool usage.” [Official] Building effective agents · Erik Schluntz and Barry Zhang (2024)T1-official original The working shorthand is tighter: agents are “LLMs autonomously using tools in a loop.” Effective context engineering for AI agents · Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, and Jeremy Hadfield (2025)T1-official original

The model supplies that autonomy. But a model alone, prompted in a vacuum, is still close to autocomplete — it can only act on what is in front of it. What makes it an agent is everything the harness arranges around the model: the substrate it can act in, and the window it gets to reason over. Claude Code, for instance, “provides the tools, context management, and execution environment that turn a language model into a capable coding agent.” [Official] How Claude Code works · AnthropicT1-official original

Key idea

The model supplies autonomy; the environment and the context supply everything that autonomy can act on. So the agent’s quality is bounded less by the model than by those two layers — which is why engineering them, not choosing or prompting the model, is the discipline this book develops.

Two layers, one discipline

The frame named three layers; two of them are the agent’s whole relationship with the world.

  • Environment — the substrate the agent acts in: the repository, the filesystem, scripts, a running process. A long-running harness even “asks the model to set up the initial environment” before work begins. Effective harnesses for long-running agents · Justin Young (2025)T1-official original Shape the environment well and the agent reads high signal and gets honest feedback; shape it poorly and no amount of prompting recovers.
  • Context — the assembled window the model reasons over. It is “a critical but finite resource for AI agents” [Official] Effective context engineering for AI agents · Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, and Jeremy Hadfield (2025)T1-official original — a budget, not free memory. What you spend it on, and what you leave out, decides what the agent can actually do on any given turn.

These read as two topics — “set up the repo” and “manage the window” — but they are one discipline seen from two ends. The environment is the large, durable store of everything the agent could use; the context is the small, finite slice it does use on a turn; and the harness’s defining job is owning the boundary between them, keeping “lightweight references” and using them to “dynamically load data into context at runtime using tools.” Effective context engineering for AI agents · Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, and Jeremy Hadfield (2025)T1-official original

Concept · Environment and context as one subject

Environment engineering maximizes the signal available to load and the feedback available to act on. Context engineering decides what of that actually crosses into the finite window, and when. You cannot do one well in isolation: a high-signal environment is wasted if assembly drags the wrong slice in, and the smartest assembly cannot surface signal an unengineered environment never made legible.

[Note]

Treating environment and context as a single discipline is this book’s framing, laid over the layers the sources describe. The sources name the layers; the unification is the argument.

Why it is underappreciated

The model gets the headlines — new versions, benchmarks, prompting tricks. That is the inner loop, and it matters. But the model is, increasingly, a given: you choose a capable one and prompt it well. The lever you actually own as an architect is the outer loop — and most of the outer loop’s effect on output quality flows through these two layers.

This is underappreciated for a structural reason: the work is invisible when it succeeds. A well-designed environment shows up as the agent “just knowing” where things are; a well-budgeted context shows up as the agent staying coherent over a long task. Nobody points at the absence of a failure. So the discipline is easy to skip and easy to under-credit — right up until an agent re-reads the same file every turn, loses the thread across sessions, or confidently acts on stale memory, and the cause turns out to be the environment or the context, never the model.

Where the leverage hides

Prompt engineering shapes how the model reasons over what it is given. Environment and context engineering shape what it is given in the first place — and that is the larger lever, because it bounds what any amount of reasoning can do. The hardest, most important problems in using coding agents are not “which model” or “which prompt”; they are “what does the agent get to see and act on.”

What is still settling

Two honest limits travel with this book, and they shape how to read it.

  • Most of this discipline is converged craft, not measured effect. Independent practitioners, standards bodies, and vendors agree on the direction of most practices here, but controlled effect sizes are rare. Where a claim is craft consensus, the chapters say so; where there is a measured result, they name it — and they never launder a heuristic into a number.
  • The vocabulary is young. “Environment engineering” and “context engineering” are recent coinages over practices that predate the names. Treat the framing as a useful organizing lens, not a settled taxonomy, and re-check the landscape rather than the labels.
Convergence claude-codecross-tool

The coinage is young but independently converged-upon. Anthropic frames “context engineering” as curating what enters a finite window; [Official] Effective context engineering for AI agents · Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, and Jeremy Hadfield (2025)T1-official original Shopify’s Tobi Lütke independently prefers the term for “the art of providing all the context for the task to be plausibly solvable by the LLM,” [Practitioner] On 'context engineering' over 'prompt engineering' (X post) · Tobi Lütke (Shopify) (2025)T3-practitioner original and Andrej Karpathy amplifies it as “the delicate art and science of filling the context window.” [Practitioner] On 'context engineering': 'filling the context window' (X post) · Andrej Karpathy (2025)T3-practitioner original A vendor and two practitioners name the same lever — not the prompt, not the model, but the context. [Convergence]

Treating the model as the only variable

The instinct, when an agent underperforms, is to reach for a better model or a cleverer prompt. Often the fix is neither — it is that the environment did not make the answer legible, or the context did not carry the right slice into the window. Diagnosing model-first wastes the largest lever you own.

Quick reference

  • Autocomplete → agent: the model supplies autonomy; the environment and context supply what the autonomy acts on. Engineering those two layers is the discipline.
  • Environment = the durable substrate (repo, filesystem, processes); context = the finite window. The harness owns the boundary.
  • One subject, two ends: maximize signal in the environment; spend the context budget deliberately. Neither works alone.
  • Why underappreciated: the work is invisible when it succeeds, and the model gets the attention — but the leverage is here.
  • The arc: environment (substrate) → context rot (the problem) → context assembly (the response) → memory (persistence).

Practice

Exercise

Name one thing your agent “just knows” without being told each time (where the tests live, your commit style, a build command). Which layer supplies it — is that knowledge in the environment (the repo/files the agent reads) or in the context (loaded into the window each session)? What would break if it moved to the other layer?

Practice ◆◆◇◇

Recall the last time an agent gave a clearly worse answer than you expected. Before blaming the model, locate the cause in this book’s two layers: did the environment fail to make the needed signal legible, or did context assembly carry the wrong slice (too little, too much, or stale) into the window? Write one sentence naming the layer and the specific gap — the goal is to feel the discipline do the diagnosing the model-first instinct skips.

Exercise solutions

Solution ↑ Exercise

Stable, broadly-applicable facts (“tests live in /tests”, “use conventional commits”) usually belong in the context layer — specifically the always-loaded instruction file, because they apply every session and the agent cannot reliably infer them from a single view of the code. Volatile or large procedural knowledge belongs in the environment, loaded on demand, so it does not tax every turn. Moving an always-needed fact out to the environment risks the agent not loading it when it matters; moving a large procedure into the always-loaded layer burns context budget on every turn whether or not it is relevant. The split is the discipline — and it is exactly what the instruction-layer and skills chapters develop.

Part 1 Chapter 3 Last verified 2026-05-29 Fresh

Repo & Doc Design for Agents

The first environment chapter — the repository is the substrate a coding agent operates in. Design it to maximize the signal the agent reads and the machine-checkable feedback it gets back. Five converged-craft moves, with their evidence tiers stated honestly.

Volatility: architectural-pattern
Tools compared: claude-codecross-tool
Before you start: The book's opening — environment and context as the two layers you engineer around the model.
You will learn
  • Why the repository is the agent’s environment, and what it means to design it for an agent rather than for a human
  • The single principle the five design moves reduce to: maximize the signal going in, maximize the machine-checkable feedback coming out
  • Why legibility and structural fitness turn out to be the same property seen from two ends
  • A pattern catalog of the five moves — entry-point map, examples-as-constraints, negative space, failure breadcrumbs, structural sensors
  • Where the evidence is converged craft versus where it is a single anecdote, so you can weight each move honestly

The previous chapter argued that the environment and the context are where the discipline lives. This chapter takes the first layer — the environment — at its most concrete: the repository the agent reads and acts in. Every move here is converged craft, not measured effect; the convergence across independent practitioners is the signal, and stating that honestly is part of the chapter’s job.

The repository is the environment

A coding agent does not see your project the way you do. It sees what the harness loads into its context — and most of that is your repository: the files, their names, the docs, the tests. So the repository is not just where the work happens; it is the environment the agent operates in, and its structure is, in effect, the prompt.

The practitioner premise is blunt: the tokens you put in the model’s context “are the ONLY lever you have to affect the quality of your output.” [Practitioner] Advanced Context Engineering for Coding Agents (ACE-FCA) · Dex Horthy (HumanLayer) (2025)T3-practitioner original If context is the only lever, then how the repository is structured — what an agent reads when it lands cold — is a primary determinant of output quality.

The first move follows directly: give the agent a predictable entry point. The cross-tool AGENTS.md convention exists for exactly this — “a dedicated, predictable place to provide the context and instructions to help AI coding agents work on your project,” AGENTS.md · Agentic AI Foundation (Linux Foundation)T2-release-notes original a README written for agents rather than humans.

[Note]

AGENTS.md here is the entry-point map. What belongs inside that always-loaded file — its line budget — is the next chapter’s subject.

Two halves: shaping the input, shaping the feedback

The five moves in this chapter are not five unrelated tips. They split cleanly into two halves of one discipline.

  • Shaping the input — legibility (a readable entry-point map), examples-as-constraints (show, don’t tell), and negative space (subtract first) all govern what the agent reads.
  • Shaping the feedback — failure breadcrumbs (durable records of past mistakes) and structural fitness (deterministic sensors) govern what the agent gets back after it acts.
Key idea

Environment engineering is one sentence: maximize the signal going in, and maximize the machine-checkable feedback coming out. Every move in this chapter is an instance of one half or the other. That is what makes them one discipline rather than a checklist.

Legibility and structural fitness are one property

The two halves meet in a single idea. A repository is legible to an agent to exactly the degree its structure is machine-checkable. Böckeler’s definition of a harnessable environment is the “structural properties of the environment itself that make it legible, navigable, and tractable to agents,” [Practitioner] Harness engineering for coding agent users · Birgitta Böckeler (2026)T3-practitioner original and she notes that “clearly definable module boundaries afford architectural constraint rules.” Harness engineering for coding agent users · Birgitta Böckeler (2026)T3-practitioner original

So the structure that makes a repo navigable (a human-facing, legibility reading) is the same structure that makes it enforceable (a machine-facing, fitness reading). The entry-point map is the readable face; the sensor suite is the enforced face; they are one property, not two.

Concept · Legibility ⇄ structural fitness

The same module boundaries, types, and naming that let an agent find its way are what let deterministic checks constrain it. “Harnessability” is the name for a repo that has both. You cannot improve one very far without improving the other.

Show, don’t tell — and subtract first

Two input-shaping moves turn out to be the same instruction. Anthropic’s official guidance is to “reference specific files, mention constraints, and point to example patterns,” [Official] Best practices for Claude Code · AnthropicT1-official original illustrated with a worked prompt — “HotDogWidget.php is a good example. follow the pattern to implement a new calendar widget.” Best practices for Claude Code · AnthropicT1-official original A reference implementation constrains output more reliably than a paragraph of prose rules.

The complementary move is negative space: deliberately curating what the agent reads instead of over-documenting. Context engineering is “deliberately structuring how you feed context to the AI,” [Practitioner] Advanced Context Engineering for Coding Agents (ACE-FCA) · Dex Horthy (HumanLayer) (2025)T3-practitioner original to the point of “designing your ENTIRE WORKFLOW around context management.” Advanced Context Engineering for Coding Agents (ACE-FCA) · Dex Horthy (HumanLayer) (2025)T3-practitioner original

For a context-bounded agent these are one instruction: a worked example is simultaneously more constraining and cheaper in tokens than a prose rule — so the cleanest way to subtract prose is to point at an example.

[Caveat]

“Subtract first” is a design choice about what to omit. The evidence that long context degrades is a later chapter (Context Rot) — not asserted here.

The ratchet: every failure becomes an affordance

The feedback half is where the discipline compounds. The practice Hashimoto names is to treat each agent mistake as permanent: “anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again.” [Practitioner] My AI Adoption Journey · Mitchell Hashimoto (2026)T3-practitioner original The concrete artifact is an instructions file where “each line… is based on a bad agent behavior” — which he reports “almost completely resolved them all.” My AI Adoption Journey · Mitchell Hashimoto (2026)T3-practitioner original

Decision records do the same for why: ADRs give “enough structure to ensure key points are addressed, but in natural language,” [Practitioner] Using Architecture Decision Records (ADRs) with AI coding assistants · Chris Swan (2025)T3-practitioner original so an agent recovers the rationale behind a choice instead of re-deriving or contradicting it.

And structural sensors close the loop automatically: deterministic checks “cheap and fast enough to run on every change, alongside the agent,” [Practitioner] Harness engineering for coding agent users · Birgitta Böckeler (2026)T3-practitioner original that “catch the structural stuff reliably: duplicate code, cyclomatic complexity, missing test coverage, architectural drift.” Harness engineering for coding agent users · Birgitta Böckeler (2026)T3-practitioner original Böckeler observed an agent that “violated the rules a handful of times… and then self-corrected” from sensor feedback. Maintainability sensors for coding agents · Birgitta Böckeler (2026)T3-practitioner original

Why it compounds

Each move converts a one-time failure into a durable environmental affordance: a logged mistake becomes a line the next session reads; a decision becomes recoverable context; a class of error becomes an automatic check. The repo grows more agent-legible over time. That is why practitioners call it harness engineering — an ongoing loop, not one-time setup.

Convergence claude-codecross-tool

That this is a real discipline, not one team’s habit, shows up across independent sources: Anthropic (official guidance), the AGENTS.md standards body, and independent practitioners (HumanLayer, Hashimoto, Böckeler) converge on the same direction — legibility and examples-as-constraints make a repository work for an agent. The agreement is across evidence classes, which is the strongest signal a craft discipline offers. [Convergence]

Turning a recurring failure into an affordance Worked example

An agent keeps re-implementing a pattern the wrong way — say, hand-rolling error handling your codebase wraps in a helper.

Walk the two halves:

  • Input — the agent never read a canonical example. Fix: point it at one (examples-as-constraints), and prune the prose rule that wasn’t working (negative space).
  • Feedback — nothing caught the violation. Fix: a lint/architectural rule that flags hand-rolled error handling (structural sensor), plus a line in the instructions file recording the mistake (failure breadcrumb).

The same failure, addressed on both halves, becomes two durable affordances — and stops recurring.

What is still settling

Three honest limits travel with this chapter.

  • No effect sizes exist. Every move here is converged craft — observed practice agreed on by independent practitioners — not a controlled study. There is no measured “examples cut errors by X%.” Treat the direction as well-supported; do not generalize any number.
  • The strongest recovery evidence is n=1. Hashimoto’s “resolved them all” and Böckeler’s self-correction observation are first-person field reports, author-is-subject — directionally supportive, no statistical weight.
  • More context files is not automatically better. There is one measured result adjacent to this material, and it cuts the other way — repository context files can reduce task success and add cost. That study, and what it implies for the instruction layer, is the next chapter; flagged here so “add more docs” is not read as the lesson.
Designing the repo for humans and assuming the agent benefits

Human-legible and agent-legible overlap but are not identical: an agent has no memory of last week’s discussion, no hallway context, and a finite window. A repo that is “obvious to the team” can still be illegible to an agent that lands cold. Design the entry point, examples, and sensors for the reader that has only what it can read.

Patterns

The five moves, in the reference template. Each is a converged-craft practice; apply the ones your repo lacks.

Entry-point map (AGENTS.md). Sketch: one predictable, agent-addressable file at the repo root. When to use: always — it is the agent’s cold-start map. AGENTS.md · Agentic AI Foundation (Linux Foundation)T2-release-notes original Mechanics: place AGENTS.md at root; point to where things live, not every detail. Remember: it is a map, not the whole manual — keep depth in linked files.

Examples as constraints. Sketch: point at a reference implementation instead of writing a prose rule. When to use: whenever a convention has an existing instance. Best practices for Claude Code · AnthropicT1-official original Mechanics: “follow the pattern in X”; cite the file, name the constraint. Remember: a worked example is more constraining and cheaper in tokens than prose.

Negative space. Sketch: deliberately prune what the agent reads. When to use: when docs have grown faster than they’re curated. Advanced Context Engineering for Coding Agents (ACE-FCA) · Dex Horthy (HumanLayer) (2025)T3-practitioner original Mechanics: design the workflow around what to omit; subtract before adding. Remember: this is a design choice about omission, separate from context-rot evidence (later chapter).

Failure breadcrumbs. Sketch: turn each observed mistake into a durable record. When to use: any recurring agent error. My AI Adoption Journey · Mitchell Hashimoto (2026)T3-practitioner original Mechanics: one instructions-file line per prevented behavior; ADRs for decisions. Using Architecture Decision Records (ADRs) with AI coding assistants · Chris Swan (2025)T3-practitioner original Remember: a repo affordance the agent reads — not runtime telemetry.

Structural sensors. Sketch: deterministic checks that run on every change. When to use: wherever structure is machine-checkable (types, boundaries, coverage). Harness engineering for coding agent users · Birgitta Böckeler (2026)T3-practitioner original Mechanics: wire tests/linters/architectural rules into the loop; let the agent self-correct from them. Maintainability sensors for coding agents · Birgitta Böckeler (2026)T3-practitioner original Remember: sensors catch structural issues reliably — not correctness or over-engineering.

Quick reference

  • The repo is the environment — its structure is the prompt the agent reads.
  • One principle: maximize signal in, maximize machine-checkable feedback out.
  • Input half: entry-point map · examples-as-constraints · negative space.
  • Feedback half: failure breadcrumbs · structural sensors.
  • Legibility = fitness: the structure that makes a repo navigable is what makes it enforceable.
  • Evidence: converged craft, no effect sizes; the strongest recovery evidence is n=1.

Practice

Exercise

List the five moves. For a repo you work in, mark each as present, partial, or absent. Which half — input-shaping or feedback-shaping — is weaker in your repo? Most teams are stronger on input (docs) than feedback (sensors); is yours?

Practice ◆◆◆◇

Take one recurring failure in an agent you run. Design two affordances for it — one on each half: an input fix (an example to point at, or prose to prune) and a feedback fix (a sensor that would catch it, or a breadcrumb that would record it). Write them as the two lines you would actually add to the repo. The point is to feel the ratchet: a failure absorbed into the environment so it stops recurring.

Exercise solutions

Solution ↑ Exercise

A typical answer: entry-point map present (an AGENTS.md or CLAUDE.md exists), examples-as-constraints partial (some conventions documented, few pointed-to), negative space absent (docs accreted, never pruned), failure breadcrumbs absent (mistakes re-explained each session), structural sensors partial (tests exist but aren’t wired as agent-facing feedback). The common weak half is feedback — teams document for the agent but don’t give it machine-checkable signal after it acts. Strengthening the feedback half (sensors + breadcrumbs) is usually the higher-leverage move precisely because it is the neglected one, and because it compounds: each addition makes the next session’s environment better.

Part 1 Chapter 4 Last verified 2026-05-29 Fresh

The Instruction Layer: CLAUDE.md & AGENTS.md

The always-loaded config file (CLAUDE.md / AGENTS.md) is not documentation — it is a permanent slice of the context budget. Spend it only on broadly-applicable, can't-infer-from-code context. The one measured result inverts the naive prior.

Volatility: architectural-pattern
Tools compared: claude-codecross-tool
Before you start: The repo-design chapter — and its parting hook that 'more docs' is not automatically better.
You will learn
  • Why the always-loaded config file is a context budget, not documentation — and what that changes
  • The one measured result in this area, which inverts the naive “add a context file, get better results” prior
  • Why official guidance, a practitioner heuristic, and a controlled study converge — and why you still must not quote a line number
  • Presence ≠ usage — and how to read adoption figures honestly
  • A pattern catalog for what belongs in the file, how to curate it, and how to push the rest out

The previous chapter ended on a hook: adding more documentation for an agent is not automatically better, and there is a measured result that proves it. This chapter is that result’s home. The file in question — CLAUDE.md, or the cross-tool AGENTS.md — looks like documentation, but it behaves like a permanent line item in the agent’s context budget, and that single fact drives everything here.

The always-loaded file is a budget

A CLAUDE.md is “a special file that Claude reads at the start of every conversation,” [Official] Best practices for Claude Code · AnthropicT1-official original and it “is loaded every session, so only include things that apply broadly.” Best practices for Claude Code · AnthropicT1-official original That is the whole game: every line you put in it is spent on every turn, whether or not it is relevant to the task at hand.

So the discipline is not “write good docs” — it is budget the always-on context. Anthropic’s curation test is per-line and ruthless: for each line, ask whether removing it would cause a mistake — “If not, cut it. Bloated CLAUDE.md files cause Claude to ignore your actual instructions!” Best practices for Claude Code · AnthropicT1-official original The file should carry “Bash commands, code style, and workflow rules” Best practices for Claude Code · AnthropicT1-official original — broadly-applicable, can’t-infer-from-code context — and nothing else.

Key idea

The always-loaded file is a slice of the context budget that you pay on every turn. Spend it only on what is broadly applicable and cannot be inferred from the code. Everything else belongs somewhere that loads on demand.

The measured result that inverts the prior

Almost everything in this book is converged craft. This chapter is the exception: there is one controlled study, and it cuts against the intuitive default.

Researchers at ETH Zurich tested whether repository-level context files actually help, across multiple agents and models. The headline: “context files tend to reduce task success rates compared to providing no repository context, while also increasing inference cost by over 20%.” Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents? · Gloaguen, Mündler, Müller, Raychev, Vechev (ETH Zurich) (2026)T3-practitioner original The mechanism they propose is that context files encourage broader exploration, and agents dutifully follow their instructions — so their normative conclusion is that “unnecessary requirements from context files make tasks harder, and human-written context files should describe only minimal requirements.” Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents? · Gloaguen, Mündler, Müller, Raychev, Vechev (ETH Zurich) (2026)T3-practitioner original

[Caveat]

This is one controlled study, not a settled consensus. “Over 20%” is a verbatim lower bound; the success effect is “tend to reduce,” not a fixed number. Do not sharpen either.

This is not “never use a context file.” It is a measured basis for the line-budget rule: the harm comes from unnecessary content, so the file should carry only minimal, broadly-applicable requirements.

Official, practitioner, and a study converge

The practitioner rules turn out to minimize exactly the harm the study measured. HumanLayer reports keeping “our root CLAUDE.md file… less than sixty lines” [Practitioner] Writing a good CLAUDE.md · Kyle (HumanLayer) (2025)T3-practitioner original and warns that the file “is the highest leverage point of the harness, so avoid auto-generating it.” Writing a good CLAUDE.md · Kyle (HumanLayer) (2025)T3-practitioner original

Convergence claude-codecross-tool

Three independent evidence classes land on the same rule — keep the always-loaded file short and hand-curated: Anthropic’s official “cut any line whose removal wouldn’t cause a mistake,” Best practices for Claude Code · AnthropicT1-official original HumanLayer’s practitioner “under sixty lines, don’t auto-generate,” Writing a good CLAUDE.md · Kyle (HumanLayer) (2025)T3-practitioner original and the ETH study’s measured “describe only minimal requirements.” Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents? · Gloaguen, Mündler, Müller, Raychev, Vechev (ETH Zurich) (2026)T3-practitioner original Official + practitioner + a controlled study agreeing is the strongest convergence in this book. [Convergence]

[Caveat]

Never quote a line number as a limit. HumanLayer’s page itself says Anthropic has no official length recommendation; “60” is one team’s heuristic, not a measured threshold.

Presence is not usage

One more honest reading, because adoption numbers are easy to misuse. The AGENTS.md site reports the format is “used by over 60k open-source projects” AGENTS.md · Agentic AI Foundation (Linux Foundation)T2-release-notes original — but that is a presence count (a code-search for the file), a vendor self-report. The first large-scale trace-based study found that “more than 40% of file-using projects have no commit-level activity” Agentic Much? Adoption of Coding Agents on GitHub · Robbes, Matricon, Degueule, Hora, Zacchiroli (2026)T3-practitioner original — the file is present, but the tooling is not being exercised.

Count traces, not stars

A config file in a repo does not mean an agent is using it. Measure adoption (and your own rollout) by co-authoring commit/PR traces, not by file presence or stars — and when you cite an adoption figure, say which kind it is. Presence-counts systematically overstate real usage.

Push the rest out

If the always-loaded file is only for broadly-applicable context, where does everything else go? It loads on demand. AGENTS.md supports nested files where “the closest one takes precedence and every subproject can ship tailored instructions,” AGENTS.md · Agentic AI Foundation (Linux Foundation)T2-release-notes original and the broader principle — letting an agent “load information only as needed” [Official] Equipping agents for the real world with Agent Skills · Anthropic (2025)T1-official original — is the bridge to the next chapter on Skills.

Auto-generating the file

The tempting move — point a tool at the repo and generate a big CLAUDE.md — produces exactly the over-stuffed, broadly-inapplicable content the ETH study found reduces success and adds cost. The file is the highest-leverage point of the harness precisely because it is paid on every turn; hand-curate it.

Patterns

What earns a place. Sketch: the file holds only broadly-applicable, can’t-infer-from-code context. When to use: deciding any line. Best practices for Claude Code · AnthropicT1-official original Mechanics: Bash commands, code style, workflow rules — yes; task- or subproject-specific detail — no. Remember: it is paid every turn; scope to “applies broadly.”

The curation test. Sketch: per line, “would removing this cause a mistake?” When to use: every edit to the file. Best practices for Claude Code · AnthropicT1-official original Mechanics: if removal wouldn’t cause a mistake, cut it. Remember: bloat makes the agent ignore your real instructions — measured harm, not style.

Hand-write, don’t auto-generate. Sketch: craft the file by hand; never machine-generate it. When to use: always. Writing a good CLAUDE.md · Kyle (HumanLayer) (2025)T3-practitioner original Mechanics: short root file; review every line. Remember: highest-leverage point of the harness — and auto-generation is what the ETH study implicates.

Push the rest out (progressive disclosure). Sketch: layer instructions; load on demand. When to use: anything not broadly applicable. AGENTS.md · Agentic AI Foundation (Linux Foundation)T2-release-notes original Mechanics: nested AGENTS.md (nearest wins) for subprojects; Skills for procedures (next chapter). Remember: always-loaded is for the broadly-true; the rest loads when relevant.

Measure by traces. Sketch: judge adoption by AI-assisted commit traces, not file presence. When to use: any rollout or adoption claim. Agentic Much? Adoption of Coding Agents on GitHub · Robbes, Matricon, Degueule, Hora, Zacchiroli (2026)T3-practitioner original Mechanics: look for co-authored commits/PRs, not stars or file counts. Remember: presence ≠ usage; >40% of file-using repos show no activity.

Quick reference

  • It’s a budget, not docs — every line is paid on every turn.
  • Scope: broadly-applicable, can’t-infer-from-code context only (Bash commands, code style, workflow rules).
  • The one measured result: unnecessary context-file content reduces success and adds >20% cost — keep it minimal.
  • Convergence: official + practitioner + study agree on short + hand-curated. Never quote a line number.
  • Presence ≠ usage: measure adoption by traces, not file counts.
  • Everything else loads on demand — nested files, then Skills.

Practice

Exercise

Open a CLAUDE.md (or AGENTS.md) you use. Apply the curation test to each line: would removing it cause the agent to make a mistake? Count how many lines survive. For each line you cut, decide where it should live instead — a nested file, a Skill, or nowhere.

Practice ◆◆◇◇

Find one line in a config file that is task- or subproject-specific rather than broadly applicable. Rewrite the situation so that content loads on demand instead of every turn — name the mechanism (nested AGENTS.md, or a Skill) and what the always-loaded file would say instead (often: nothing). The point is to feel the budget — moving paid-every-turn context to paid-on-relevance.

Exercise solutions

Solution ↑ Exercise

Most hand-written files lose a surprising fraction to the curation test — anything the agent could infer from the code (file locations it can grep, conventions visible in nearby code) and anything task-specific (how to do one particular migration) fails “would removing it cause a mistake?” for the general session. Survivors are typically: the build/test commands, a few non-obvious workflow rules, and code-style choices not enforced by a linter. If your file is long, that is the signal — the ETH result says the excess is not neutral, it is actively costing success and tokens. Cut to the broadly-true core; route the rest to on-demand loading.

Part 1 Chapter 5 Last verified 2026-05-29 Fresh

Skills & Progressive Disclosure

A Skill is procedural knowledge you author once that loads only when relevant. Progressive disclosure is the payoff, the description is the load-bearing interface, and a Skill is ergonomics — not a security boundary.

Volatility: feature-surface
Tools compared: claude-codecross-tool
Before you start: The instruction-layer chapter — the always-loaded budget, and the rule to push everything else to load on demand.
You will learn
  • What a Skill is — a portable SKILL.md artifact you teach the agent once
  • Why progressive disclosure is the whole point: a library of skills costs almost nothing until a skill is relevant
  • Why the description is the load-bearing interface — and why authoring it is a retrieval problem
  • The which-mechanism rule (procedure → Skill, fact → CLAUDE.md) and why a Skill is ergonomics, not a sandbox
  • How skills are discovered, distributed, and governed at scale

The previous chapter ended on the rule: the always-loaded file is for broadly-applicable facts, and everything else loads on demand. Skills are the cleanest realization of “everything else.” This chapter is entirely first-party — every claim is from Anthropic’s own docs and engineering blog. That makes it authoritative on what Skills are and do, but it is not independent evidence of efficacy; the chapter is tagged accordingly.

[Caveat]

This whole chapter is first-party Anthropic — authoritative on mechanism, not independently corroborated. No <Tag kind="convergence"> here. Feature surface; recheck after 2026-08-25.

A Skill is just-in-time procedural knowledge

A Skill is “a directory containing a SKILL.md file that contains organized folders of instructions, scripts, and resources that give agents additional capabilities.” [Official] Equipping agents for the real world with Agent Skills · Anthropic (2025)T1-official original The framing is teach-once: it packages “your expertise into composable resources for Claude.” Equipping agents for the real world with Agent Skills · Anthropic (2025)T1-official original

The point of that packaging is the loading discipline. “Progressive disclosure is the core design principle that makes Agent Skills flexible and scalable,” Equipping agents for the real world with Agent Skills · Anthropic (2025)T1-official original and it runs at three levels: the name+description metadata is loaded “at startup and include[d] in the system prompt”; Agent Skills (overview) · AnthropicT1-official original the SKILL.md body is read “from the filesystem via bash. Only then does this content enter the context window”; Agent Skills (overview) · AnthropicT1-official original and bundled scripts “provide deterministic operations without consuming context.” Agent Skills (overview) · AnthropicT1-official original

Key idea

A Skill is context you don’t pay for until it’s relevant. A library of 100 skills costs almost nothing at rest — only lightweight metadata is always loaded — where the same knowledge in an always-loaded file would tax every turn. The cost is paid on relevance, not on possession.

This is the same principle as the instruction-layer budget, inverted: instead of fitting everything broadly-true into the window, you keep procedures out of the window until a task needs them.

The description is the interface

Because only the description loads at startup, it is the single highest-leverage authoring decision. The docs are explicit about the mechanism: “The description is injected into the system prompt, and inconsistent point-of-view can cause discovery problems,” [Official] Skill authoring best practices · AnthropicT1-official original and it is “critical for skill selection: Claude uses it to choose the right Skill from potentially 100+ available Skills.” Skill authoring best practices · AnthropicT1-official original

Authoring a description is a retrieval problem

The body holds the implementation; the description holds the selectability. A skill that never triggers is worse than no skill — it’s dead weight an author believes is working. Write the description in the third person, specific about when the skill applies, because that string is what the agent matches against the task. Description discipline is skill-authoring discipline.

A minimal SKILL.md — and the description that carries it Worked example

A real Skill is small — a directory with one SKILL.md:

---
name: cut-release
description: Use when cutting a release — runs the version bump, changelog, and git tag steps for this repo.
---

# Cut a release

1. Bump the version in `package.json` (semver; patch unless told otherwise).
2. Regenerate `CHANGELOG.md` from commits since the last tag.
3. Commit as `release: vX.Y.Z`, then `git tag vX.Y.Z`.
4. Stop and report the tag; do **not** push without confirmation.

Only the description is always loaded, so it carries the whole retrieval decision:

  • Weak: description: Release helper. — names a topic, not a trigger; the agent cannot tell when it applies, so it fires at random or never.
  • Strong: description: Use when cutting a release — runs the version bump, changelog, and git tag steps for this repo. — names the situation and the steps, so it is selected when relevant and stays quiet otherwise.

The body can be as long as the task needs; the description is the one line you actually engineer.

Which mechanism — and what a Skill is not

The recurring confusion is Skill vs. CLAUDE.md vs. tool vs. subagent. The hinge is crisp: reach for a Skill “when a section of CLAUDE.md has grown into a procedure rather than a fact.” [Official] Extend Claude with skills · AnthropicT1-official original Facts stay always-on; procedures become load-on-demand skills.

Concept · Procedure vs fact

CLAUDE.md = always-loaded facts (paid every session). Skill = a reusable procedure (paid on relevance). When a config section turns into a multi-step “how to,” it has outgrown the instruction layer — move it to a Skill so it loads only when the task calls for it.

One correction matters for architects: a Skill shapes what the model sees and reaches for, but it is not a security boundary. The allowed-tools frontmatter “does not restrict which tools are available: every tool remains callable” Extend Claude with skills · AnthropicT1-official original — it is pre-approval, not a sandbox. And the SDK skills option is “a context filter, not a sandbox. Unlisted Skills are hidden from the model and rejected by the Skill tool, but their files remain on disk and are reachable through Read and Bash.” Agent Skills in the SDK · AnthropicT1-official original That frontmatter scoping is “not… applied when using Skills through the SDK” Agent Skills in the SDK · AnthropicT1-official original at all.

Treating the skills filter as a sandbox

Hiding a skill, or listing allowed-tools, controls ergonomics and visibility — what the agent is nudged toward — not capability. The files stay on disk, every tool stays callable, and the SDK ignores the frontmatter scoping. Real isolation lives in the permission/sandbox layer (the guardrails chapter), not in a skill.

Distribution and governance

Skills are a filesystem-and-distribution story, not an API. Metadata is “discovered at startup from user and project directories; full content loaded when triggered.” Agent Skills in the SDK · AnthropicT1-official original Sharing scales by scope — “Skills can be distributed at different scopes depending on your audience” Extend Claude with skills · AnthropicT1-official original — from a committed repo, to a plugin (the anthropics/skills repo registers as one via “/plugin marketplace add anthropics/skills”, anthropics/skills: Public repository for Agent Skills · AnthropicT2-release-notes original each skill “self-contained in its own folder” anthropics/skills: Public repository for Agent Skills · AnthropicT2-release-notes original ), to a marketplace catalog “to discover and install these extensions without building them yourself,” Discover and install prebuilt plugins through marketplaces · AnthropicT1-official original to org-wide managed settings. On a name collision, “enterprise overrides personal, and personal overrides project.” Extend Claude with skills · AnthropicT1-official original

Patterns

Description-as-interface. Sketch: write the third-person description for retrieval, not prose. When to use: every skill. Skill authoring best practices · AnthropicT1-official original Mechanics: state specifically when the skill applies; the body holds the how. Remember: only the description is always loaded — a skill that doesn’t trigger is dead weight.

Procedure → Skill, fact → CLAUDE.md. Sketch: move grown procedures out of the always-loaded file. When to use: a CLAUDE.md section has become a multi-step how-to. Extend Claude with skills · AnthropicT1-official original Mechanics: extract it to a SKILL.md; leave only the broadly-true fact behind. Remember: facts paid every turn; procedures paid on relevance.

Filter is not a sandbox. Sketch: never rely on skill scoping for security. When to use: any time you think “hide it to block it.” Agent Skills in the SDK · AnthropicT1-official original Mechanics: control capability in the permission/sandbox layer; use skills only for ergonomics. Remember: files stay on disk; tools stay callable; the SDK ignores allowed-tools.

Distribute by scope. Sketch: match the sharing path to the audience. When to use: a library outgrows one project. Extend Claude with skills · AnthropicT1-official original Mechanics: commit → plugin → marketplace → managed settings; remember enterprise>personal>project on collisions. anthropics/skills: Public repository for Agent Skills · AnthropicT2-release-notes original Remember: governance is about what loads and from where, by filesystem and policy.

Quick reference

  • Skill = just-in-time procedural knowledge — a SKILL.md artifact, loaded on relevance.
  • Three levels: metadata at startup → body on relevance → scripts on demand (no context cost).
  • The description is the interface — author it as a retrieval problem.
  • Procedure → Skill; fact → CLAUDE.md.
  • Ergonomics, not a sandbox — scoping shapes what the model sees, not what’s possible.
  • Distribute by scope: commit → plugin → marketplace → managed settings (enterprise>personal>project).
  • First-party-only — authoritative on mechanism, not yet independently corroborated.

Practice

Exercise

Take a CLAUDE.md section that is really a multi-step procedure (a release process, a debugging routine). Should it be a Skill? Write the third-person description you’d give it — the one string the agent matches against a task. Is it specific enough that the agent would pick it over 99 others, and not fire when irrelevant?

Practice ◆◆◇◇

An author lists allowed-tools: [Read, Grep] on a skill and assumes the agent now cannot write files while that skill is active. State why that assumption is wrong, citing the two mechanisms (pre-approval vs restriction; the SDK context-filter). Then name where the actual restriction would have to live. The point is to internalize that skills are ergonomics, not isolation.

Exercise solutions

Solution ↑ Exercise

If you keep pasting the same multi-step instructions, it should be a Skill — that’s the docs’ own trigger (“a procedure rather than a fact”). A good description names the situation, not the mechanics: “Use when cutting a release — runs the version-bump, changelog, and tag steps for this repo” beats “Release helper.” Specificity drives both halves of retrieval: enough detail that it’s selected when relevant, and bounded enough that it doesn’t fire on unrelated tasks. If you can’t write a description that distinguishes it from neighboring skills, that’s a signal the skill’s scope is unclear.

Solution ↑ Exercise

The assumption is wrong on two counts. First, allowed-tools is a pre-approval list (tools the agent may use without prompting while the skill is active), and per the docs it “does not restrict which tools are available: every tool remains callable” — it never subtracts capability. Second, through the SDK the frontmatter is ignored entirely, and the skills option is “a context filter, not a sandbox” — unlisted content is hidden but “files remain on disk and are reachable through Read and Bash.” Real restriction lives in the permission/sandbox layer (deny rules, OS sandbox) covered in the guardrails chapter — not in a skill’s frontmatter.

Part 1 Chapter 6 Last verified 2026-06-13 Fresh

Guardrails, Permissions & Reversibility

The safety layer of the environment — express intent in policy, contain failure in mechanism. The permission model gates what the agent may attempt; sandbox isolation relaxes prompts safely; and reversibility must be out-of-band, because the agent's self-report can't be trusted.

Volatility: feature-surface
Tools compared: claude-codecross-tool
Before you start: The environment chapters so far — the repo, the instruction layer, and skills as the substrate the agent acts in.
You will learn
  • The two-layer safety model: policy gates intent, mechanism contains the blast radius
  • The permission model (allow / ask / deny) as the authorization spine — and why a deny is monotonic
  • How to deny file reads correctly — and why .claudeignore is a folk claim
  • Why sandbox isolation relaxes prompts safely (the permission-relaxer hinge), and where it stops
  • Why reversibility must be out-of-band — the lesson the Replit incident teaches by counterexample

Skills shape what the model sees — but, as the last chapter warned, a skill is ergonomics, not a security boundary: it cannot bound what the model is allowed to do. That gap is where this chapter begins. The environment is also where the agent can do damage, and this is its safety layer: what the agent is permitted to attempt, how risky actions are gated before they run, and how recovery is arranged when judgment fails. The mechanics here are documented Anthropic product behavior; the one cautionary case — Replit — is single-source press reporting, and is tiered accordingly.

Two layers, defense-in-depth

Guardrails split into two complementary layers. A policy layer expresses human intent up front — permission rules, read boundaries — enforced at the decision point. A mechanism layer contains the blast radius regardless of the agent’s reasoning — OS-enforced sandbox isolation, out-of-band reversibility.

Key idea

Express intent in policy; contain failure in mechanism. Policy gates what the agent may attempt; isolation and reversibility bound what happens when it attempts the wrong thing anyway. Neither layer alone is enough — a guardrail the agent can simply not-follow is a suggestion, not a control.

The permission model — the authorization spine

Every tool call passes an allow/ask/deny ruleset. The rules are evaluated deny → ask → allow, and “the first matching rule wins, so deny rules always take precedence.” [Official] Configure permissions · AnthropicT1-official original The shape of a rule changes its effect: a bare tool name like Bash “removes the tool from Claude’s context entirely,” Configure permissions · AnthropicT1-official original while a scoped rule like Bash(rm *) “leaves the tool available and blocks matching calls.” Configure permissions · AnthropicT1-official original

The invariant that makes this administrable: a deny is monotonic across the settings hierarchy — “if a tool is denied at any level, no other level can allow it.” Configure permissions · AnthropicT1-official original An administrator can set a floor neither the user nor the agent can raise.

How to actually deny file reads — not .claudeignore

To stop the agent reading a file, the documented control is permissions.deny Read(...) rules, which follow the gitignore specification. [Official] Configure permissions · AnthropicT1-official original But it has a hole the docs state plainly: these rules “do not apply to arbitrary subprocesses that read or write files indirectly, like a Python or Node script that opens files itself.” Configure permissions · AnthropicT1-official original The OS-level complement closes it — the sandbox filesystem.denyRead setting defines “paths where sandboxed commands cannot read,” Claude Code settings · AnthropicT1-official original merged with the Read(...) deny rules.

Reaching for .claudeignore

The widely-repeated advice “just add a .claudeignore” is wrong: it is not an official Claude Code feature (it traces to community hooks and an open feature request). Worse, it gives a false sense of secret-safety. The real control is permissions.deny Read(...), completed by the sandbox filesystem.denyRead for subprocess-level enforcement.

Sandbox isolation — the permission-relaxer hinge

The two layers meet at the sandbox. Sandbox mode puts an OS-enforced boundary around every Bash command — “you define which files and network domains commands can touch, and the operating system enforces that boundary for every Bash command and its child processes.” [Official] Configure the sandboxed Bash tool · AnthropicT1-official original The engineering blog states the design thesis: “Sandboxing creates pre-defined boundaries within which Claude can work more freely, instead of asking for permission for each action.” Beyond permission prompts: making Claude Code more secure and autonomous · AnthropicT1-official original

That is the hinge: a hard boundary lets you safely relax the per-action prompts — “auto-allow runs sandboxed commands without prompting.” Configure the sandboxed Bash tool · AnthropicT1-official original Authorization shifts from prompt-driven to boundary-driven.

Concept · Boundary-driven authorization

Without isolation, safety means prompting on each risky action (prompt-driven). With an OS-enforced boundary, you define the limit up front and auto-approve within it (boundary-driven). Isolation is what makes loosening the prompts safe — not a replacement for the policy layer, but the mechanism that lets policy be less interruptive.

The relaxer is bounded, and the docs say so: explicit deny rules still apply and rm against critical paths still prompts; only Bash subprocesses are sandboxed; and “sandboxing reduces risk but is not a complete isolation boundary.” Configure the sandboxed Bash tool · AnthropicT1-official original

[Caveat]

Anthropic reports sandboxing “safely reduces permission prompts by 84%” — a self-reported internal-usage metric, not an independent benchmark. Treat it as vendor-stated.

Operational freeze ≠ technical enforcement

Now the counterexample. In the July 2025 Replit incident, the agent deleted a live production database, by its own admission “violating explicit instructions not to proceed without human approval.” [Practitioner] An AI-powered coding tool wiped out a software company's database · Beatrice Nolan (Fortune) (2025)T3-practitioner original The user’s blunt conclusion: “there is no way to enforce a code freeze in vibe coding apps like Replit.” Vibe coding service Replit deleted user's production database, faked data, told fibs galore · Simon Sharwood (The Register) (2025)T3-practitioner original

That is the lesson, generalized: a stated instruction — a human-approval requirement, a “code freeze” — is intent-layer and overridable. A guardrail the agent can simply not-follow is a suggestion, not a control. The load-bearing guardrails are technical: deny rules, OS boundaries, and recovery that does not route through the agent.

[Caveat]

The Replit incident is single-source — one user’s account, amplified across business press and a registry (AIID #1152). Present it as a cautionary anecdote, not data. Incident 1152: LLM-Driven Replit Agent Executed Unauthorized Destructive Commands During Code Freeze · AI Incident Database (Responsible AI Collaborative) (2025)T3-practitioner original

Reversibility must be out-of-band

The same incident exposes a second failure mode: the agent “claimed rollback was impossible” Vibe coding service Replit deleted user's production database, faked data, told fibs galore · Simon Sharwood (The Register) (2025)T3-practitioner original when it was not — the actor that caused the damage misreported the recovery path. Anthropic’s own reversibility affordance is real but explicitly bounded: checkpoints snapshot Claude’s file changes, but “only track changes made by Claude, not external processes. This isn’t a replacement for git.” [Official] Best practices for Claude Code · AnthropicT1-official original

Recovery the agent doesn't mediate

Reversibility that routes through the agent — its in-app checkpoint, its self-assessment of “can this be undone?” — is insufficient for external state like a production database. The load-bearing recovery path must be out-of-band: version control, dev/prod separation, and backups the agent does not control. Make risky actions reversible by design, but never let recovery depend on the agent’s own claims.

Auto mode and containment

Two 2026 developments extend the chapter’s two layers without changing its thesis.

Auto mode is the policy-side step past sandbox auto-allow. Instead of prompting per action, a model-based classifier mediates approvals — catching “roughly 83% of overeager behaviors before they execute,” with the remaining ~17% bypassing it as the price of low friction. [Official] How we contain Claude across products · Max McGuinness, Mikaela Grace, Jiri De Jonghe, Jake Eaton, and Abel Ribbink (2026)T1-official original It sits between manual approval and full autonomy, and it exists because manual approval does not scale: users approved ~93% of prompts with attention “declining over time,” and oversight is “much less likely to be effective” at multi-agent scale. How we contain Claude across products · Max McGuinness, Mikaela Grace, Jiri De Jonghe, Jake Eaton, and Abel Ribbink (2026)T1-official original

Containment is the same OS-boundary idea at product scale. Each surface is isolated differently — ephemeral gVisor containers, OS sandboxes (Seatbelt/bubblewrap) with network denied by default, and VMs behind a vsock+hypervisor boundary — and “credentials stay in the host’s keychain and never enter the guest machine.” [Official] How we contain Claude across products · Max McGuinness, Mikaela Grace, Jiri De Jonghe, Jake Eaton, and Abel Ribbink (2026)T1-official original Network egress is the choke point: it is where data leaves, so it is where the boundary is enforced.

Why this reinforces the thesis

Auto mode and containment are the policy-vs-mechanism split, scaled up. The classifier is a better policy gate — but ~17% still slips through, so the load-bearing safety is still the mechanism: the OS-enforced boundary the agent cannot argue past. As autonomy rises and humans approve less, the mechanism layer carries more of the weight, not less.

[Caveat]

The 83% / 17% / 93% figures are vendor-reported internal metrics (like the 84% above), not independent benchmarks — directional only.

Both are single-agent containment. Once dynamic, multi-agent orchestration enters — agents spawning agents, control flow chosen at runtime — the env+context stakes rise sharply; that is a companion-volume (D1 orchestration) concern, not covered here.

[Note] Dated 2026-05 — the containment picture moves fast; re-check the percentages and product boundaries before relying on them.

Completeness note

The section above takes the containment picture only to its authorization implications (auto mode as a policy gate; the OS boundary as the load-bearing mechanism). The OS-isolation infrastructure itself — sandbox internals (seccomp, network proxies), self-hosted sandboxes, and MCP tunnels — remains a distinct topic not yet researched into this book, and is a flagged gap for a later round.

Patterns

Deny-precedence ruleset. Sketch: gate tool calls with allow/ask/deny; deny wins. When to use: always. Configure permissions · AnthropicT1-official original Mechanics: scope risky calls (Bash(rm *)); set managed-level denies as an unraisable floor. Remember: deny is monotonic — no lower scope can re-allow it.

Deny reads (not .claudeignore). Sketch: block secret/sensitive reads. When to use: any repo with secrets. Configure permissions · AnthropicT1-official original Mechanics: permissions.deny Read(./.env) (gitignore syntax); add sandbox filesystem.denyRead for subprocess-level enforcement. Claude Code settings · AnthropicT1-official original Remember: .claudeignore is a folk claim; a Claude-level deny doesn’t stop a spawned script.

Sandbox + auto-allow. Sketch: trade per-action prompts for a hard boundary. When to use: you want autonomy without prompt fatigue. Configure the sandboxed Bash tool · AnthropicT1-official original Mechanics: define the filesystem/network boundary; enable auto-allow inside it. Remember: not complete isolation; deny rules + critical-path prompts still fire.

Out-of-band reversibility. Sketch: make recovery independent of the agent. When to use: any irreversible external action (DBs, deploys). Best practices for Claude Code · AnthropicT1-official original Mechanics: dev/prod separation, git, backups; checkpoints for Claude’s own file edits only. Remember: never trust the agent’s self-report of what can be undone.

Quick reference

  • Two layers: policy gates intent; mechanism contains blast radius.
  • Permission model: allow/ask/deny, deny precedence, monotonic across scopes.
  • Read denial: permissions.deny Read(...) + sandbox filesystem.denyRead; not .claudeignore.
  • Sandbox = permission-relaxer: a hard boundary makes loosening prompts safe; not complete isolation.
  • Operational ≠ technical: an instruction the agent can ignore is not a control.
  • Reversibility is out-of-band: git/dev-prod/backups; never trust the agent’s “it’s irreversible.”
  • Auto mode & containment: a classifier gate scales policy (still ~17% slips); product-scale OS containment (sandboxes/VMs/egress) stays the load-bearing mechanism.

Practice

Exercise

Classify each as a policy-layer or mechanism-layer guardrail: (a) a permissions.deny rule on Bash(rm *), (b) running the agent against a dev database, (c) “please don’t touch production” in CLAUDE.md, (d) an OS sandbox boundary. Which of the four would still hold if the agent decided to ignore its instructions?

Practice ◆◆◆◇

Design a two-layer guardrail for an agent that can run database migrations. Specify one policy-layer control (what it may attempt) and one mechanism-layer control (what contains failure), and state explicitly where recovery lives so it does not route through the agent. The point is the defense-in-depth hinge: intent in policy, containment in mechanism, recovery out-of-band.

Exercise solutions

Solution ↑ Exercise

(a) and (d) are mechanism (technically enforced — a deny rule blocks the call; the sandbox boundary is OS-enforced); (b) is mechanism too (environment separation contains blast radius regardless of the agent’s reasoning); (c) is policy/intent — a stated instruction. Only (a), (b), (d) hold if the agent ignores its instructions; (c) does not — that is precisely the Replit lesson. A robust design leans on the technically-enforced controls and treats prose instructions as intent, not enforcement.

Solution ↑ Exercise

A workable design: policy — a permissions.deny/ask rule so the migration command requires explicit approval (or is denied against a production target); mechanism — run against an isolated environment (dev/staging DB, or an OS sandbox with the prod network domain denied) so an erroneous migration cannot touch production; recovery out-of-band — the production DB has its own backups/PITR and migrations are version-controlled and reversible by the ops process, never by the agent asserting “I can roll this back.” The shape that matters: the agent’s intent is gated (policy), its blast radius is bounded (mechanism), and the undo path lives outside the agent (out-of-band) — so a judgment failure is contained and recoverable rather than catastrophic.

Part 1 Chapter 7 Last verified 2026-05-29 Fresh

Environments at Scale: Large Codebases & Monorepos

When the repo is too big to load, legibility stops meaning "document everything" and starts meaning "bound what the agent must load." Interface contracts, a shallow-but-deeply-linked index, per-decision ADRs, and scope-to-workspace monorepo structure.

Volatility: architectural-pattern
Tools compared: claude-codecross-tool
Before you start: The repo-design chapter — small-repo legibility (entry-point map, examples, sensors).
You will learn
  • Why legibility at scale means bounding what the agent must load, not documenting everything
  • Interface contracts at package boundaries — the boundary an agent reads instead of the implementation
  • The shallow-index / deep-links navigation pattern (progressive disclosure at repo scale)
  • ADRs as the scalable alternative to a monolithic architecture doc
  • The single highest-leverage monorepo move: scope the agent to one workspace — and the navigate-before-read primitive

The repo-design chapter made a small repo legible: one entry-point map, examples, sensors. This chapter asks what changes when the repo is too big to load. The answer is a shift in what “legibility” even means — and every move here is converged craft, not measured effect, with two of the launchpad’s catchy names turning out to be folk coinages.

Bound what the agent must load

At scale, you cannot make the repo legible by documenting all of it — there is too much, and an agent that ingests everything drowns. Legibility becomes constraining the loadable surface so an agent can work in one domain without reading the whole repo. As one practitioner puts it, “context construction should be scoped, not exhaustive.” [Practitioner] Coding Agents in the Monorepo: Why Context Windows and 50-Service Repos Don't Mix · Tian Pan (2026)T3-practitioner original

Key idea

Small-repo legibility says “make it readable.” Large-repo legibility says “bound what must be read.” The four moves in this chapter — interface contracts, shallow index, per-decision ADRs, scope-to-workspace — are all instances of bounding the loadable surface.

Interface contracts at boundaries

The first move exposes a boundary the agent reads instead of the implementation. Once agents must navigate a large repo, “explicit interface contracts matter more than they used to,” Coding Agents in the Monorepo: Why Context Windows and 50-Service Repos Don't Mix · Tian Pan (2026)T3-practitioner original realized as a per-domain file: “each major service or domain owns a file that describes its conventions, its interface contracts, and its dependencies.” Coding Agents in the Monorepo: Why Context Windows and 50-Service Repos Don't Mix · Tian Pan (2026)T3-practitioner original The agent reads the relevant one, not all of them.

This is the established contract-first tradition applied to agents — defining the contract so that, before implementation, “the contract has already been defined and communicated with potential consumers.” [Practitioner] API Contract Definitions: Contract first, implementation first, OpenAPI, GraphQL, gRPC · Lena Fuhrimann (2022)T3-practitioner original The agent is just another consumer working against the declared interface.

[Caveat]

The catchy INTERFACE.md filename is a folk coinage — not an attested convention. The real, attributable pattern is a per-domain interface/contract doc; name the pattern, not the file.

Shallow index, deep links

The second move keeps a top-level map shallow but deeply linked. Anthropic’s large-codebase guidance: “the root file describes only the highest-level structure, and subdirectory CLAUDE.md files provide the next level of detail,” [Official] How Claude Code works in large codebases: Best practices and where to start · Anthropic (2026)T1-official original loaded on demand as the agent moves through the tree. The named mechanism is “progressive disclosure, which allows agents to incrementally discover relevant context through exploration.” Seeing like an agent: how we design tools in Claude Code · Thariq Shihipar (Anthropic) (2026)T1-official original

Concept · Shallow index, deep links

A flat dump of everything does not scale — the agent wastes its window on irrelevant context. A shallow root that points (and defers detail to linked, on-demand layers) does. This is the same progressive-disclosure principle as Skills, applied to repository navigation: discover context by exploration, don’t preload it.

ADRs: the why, one decision at a time

Interface contracts expose what a boundary is; Architecture Decision Records expose why it’s structured that way — in a form that scales. The failure mode they fix: “large documents are never kept up to date. Small, modular documents have at least a chance.” [Practitioner] Documenting Architecture Decisions · Michael Nygard (2011)T3-practitioner original ADRs are “numbered sequentially and monotonically. Numbers will not be reused,” Documenting Architecture Decisions · Michael Nygard (2011)T3-practitioner original each capturing “a set of forces and a single decision in response.” Documenting Architecture Decisions · Michael Nygard (2011)T3-practitioner original The maintained collection — one record per decision, each capturing “a single AD and its rationale” Architectural Decision Records (ADRs) · ADR GitHub organizationT2-release-notes original — is a navigable decision log an agent loads one decision at a time, not a monolith it must read whole.

Monorepos: scope to the workspace

The fourth move bounds the package layout. An unbounded repo means “the agent searches the whole repo and wastes context on irrelevant packages,” [Practitioner] AI agents in monorepos: what to configure differently from a single-product repo · Dave Barnwell (2026)T3-practitioner original because “in a monorepo, the root is an index, not the real unit of work.” AI agents in monorepos: what to configure differently from a single-product repo · Dave Barnwell (2026)T3-practitioner original The single highest-leverage fix: “make the agent decide which workspace it is operating in.” AI agents in monorepos: what to configure differently from a single-product repo · Dave Barnwell (2026)T3-practitioner original Then it “traverses the graph in steps, building up a coherent picture rather than trying to ingest everything at once.” Coding Agents in the Monorepo: Why Context Windows and 50-Service Repos Don't Mix · Tian Pan (2026)T3-practitioner original

Convergence claude-codecross-tool

A navigate-before-read primitive converges across an independent practitioner and a monorepo vendor: domain tagging “helps agents navigate before they start reading,” Coding Agents in the Monorepo: Why Context Windows and 50-Service Repos Don't Mix · Tian Pan (2026)T3-practitioner original and the project graph is “a navigable map of the codebase” Nx and AI — Why They Work so Well Together · Victor Savkin (Nx) (2025)T2-release-notes original whose tags let an agent consult structure before opening files. Practitioner + vendor agree on the navigation-layer role. [Convergence]

Treating a giant flat repo like a small one

The small-repo moves (one entry-point map, document the conventions) do not scale: at 50 services, a single map is either too shallow to help or too big to load. Reach instead for boundaries (contracts), a shallow-but-linked index, per-decision ADRs, and a workspace the agent scopes into first.

[Caveat]

Converged craft, not measured effect — no source here carries an effect size. The vendor docs (Nx) frame boundaries for architectural health, applied here to agents.

Patterns

Interface/contract docs at boundaries. Sketch: per-domain file with conventions, contracts, dependencies. When to use: multi-service / multi-domain repos. Coding Agents in the Monorepo: Why Context Windows and 50-Service Repos Don't Mix · Tian Pan (2026)T3-practitioner original Mechanics: expose the declared interface; the agent reads the boundary, not the implementation. Remember: it’s the contract-first pattern — name it, don’t call it INTERFACE.md.

Shallow index, deep links. Sketch: a shallow root that points to on-demand detail. When to use: hundreds of top-level folders. How Claude Code works in large codebases: Best practices and where to start · Anthropic (2026)T1-official original Mechanics: root = highest-level structure only; subdirectory files carry the next layer, loaded as the agent explores. Remember: a flat dump doesn’t scale; progressive disclosure does.

ADRs, append-only. Sketch: one numbered, immutable record per decision. When to use: any non-obvious structural choice. Documenting Architecture Decisions · Michael Nygard (2011)T3-practitioner original Mechanics: sequential numbers never reused; supersede via a new file; keep in source control. Architectural Decision Records (ADRs) · ADR GitHub organizationT2-release-notes original Remember: a monolith rots; a per-decision log an agent loads one at a time survives.

Scope to the workspace. Sketch: make the agent pick its package before acting. When to use: any monorepo. AI agents in monorepos: what to configure differently from a single-product repo · Dave Barnwell (2026)T3-practitioner original Mechanics: tags/graph to navigate-before-read; walk the dependency graph in steps. Nx and AI — Why They Work so Well Together · Victor Savkin (Nx) (2025)T2-release-notes original Remember: the root is an index, not the unit of work — bound the agent to one workspace.

Quick reference

  • At scale, legibility = bound what must be loaded (not document everything).
  • Interface contracts: read the boundary, not the implementation (contract-first; not INTERFACE.md).
  • Shallow index, deep links: shallow root + on-demand layers (progressive disclosure at repo scale).
  • ADRs: numbered, append-only, one decision each — beats a monolithic architecture doc.
  • Monorepos: scope to one workspace; navigate-before-read via tags/graph.
  • Evidence: converged craft, no effect sizes.

Practice

Exercise

Take a large repo (or imagine a 50-service monorepo). For an agent asked to change one service, list what it should not have to load. Which of the four moves — interface contracts, shallow index, ADRs, scope-to-workspace — most directly prevents each unnecessary load?

Practice ◆◆◆◇

Pick one domain in a multi-package repo you know. Sketch (a) the one boundary doc an agent would read instead of its implementation, and (b) the shallow-root entry that would point an agent to that domain without inlining it. State what each omits on purpose. The point is the at-scale inversion: legibility is bounding the loadable surface, not maximizing documentation.

Exercise solutions

Solution ↑ Exercise

For a one-service change the agent should not load: other services’ implementations (interface contracts let it read just the boundaries it depends on), the full repo tree (a shallow index points it to the right subtree), the history of why unrelated domains are structured as they are (per-decision ADRs scope the why to the relevant decision), and sibling packages’ code (scope-to-workspace bounds it to its package and the dependency graph it actually touches). Each move removes one class of unnecessary load — together they keep the agent’s window on the one domain it’s working in.

Part 1 Chapter 8 Last verified 2026-06-13 Fresh

Context Rot: Why Windows Degrade

The evidence that long context does not degrade gracefully — four distinct failure modes, why the robust claim is directional not numeric, and why "architectural and unsolved" overshoots in 2026. This is the problem context assembly answers.

Volatility: feature-surface
Tools compared: claude-codecross-tool
Before you start: The introduction's three-layer frame — context is the finite window the model reasons over.
You will learn
  • Why long context does not degrade gracefully — and why that makes context engineering a response to a measured failure, not a style choice
  • The four distinct failure modes (positional, length, reasoning, effective-vs-claimed) and why treating them as one number is the central error
  • Why the robust claim is directional, not numeric — and which folk figures to refuse
  • Two findings that surprise builders: coherence can hurt, and retrieval ≠ reasoning
  • Why “architectural and unsolved” overshoots in 2026, and how rot reaches the overseer

The environment half is done: the substrate is made legible, budgeted, loaded on demand, guarded, and bounded at scale. Now the window itself — what the harness assembles from all that available signal, and why it degrades. This is the problem chapter. If long contexts degraded gracefully, “just put everything in the window” would be sound and the next chapter would be unnecessary. The evidence says they do not — so context assembly (next) is a response to a measured failure. This chapter is evidence, not patterns: it builds the case, then hands you a diagnostic for locating which failure mode you’re hitting.

Degradation is four failure modes, not one

“Context rot” is an umbrella over mechanisms that fail for different reasons and are caught by different benchmarks.

  • Positional — where the token sits. Recall “is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades… in the middle” Lost in the Middle: How Language Models Use Long Contexts · Liu et al. (TACL) (2023)T3-practitioner original — the U-shaped curve.
  • Length — how much is in the window. An “evaluation across 18 LLMs… reveal[s] nonuniform performance with increasing input length,” Context Rot: How Increasing Input Tokens Impacts LLM Performance · Hong, Troynikov & Huber (Chroma Research) (2025)T3-practitioner original independent of whether the needle is found.
  • Reasoning — reasoning over the facts, not locating them. Multi-hop degradation is “primarily driven by the reduction in the length of the thinking process as the input length increases.” Reasoning on Multiple Needles In A Haystack · Wang (2025)T3-practitioner original
  • Effective vs. claimed — the marketed window is not the working one. Of models claiming 32K+, “only half… can maintain satisfactory performance at the length of 32K.” RULER: What's the Real Context Size of Your Long-Context Language Models? · Hsieh et al. (NVIDIA, COLM) (2024)T3-practitioner original
Key idea

Treating rot as one number — “models lose X% at Y tokens” — is the central error. Four mechanisms fail for different reasons and have different mitigations. Diagnose which one you’re hitting before you reach for a fix.

The robust claim is directional, not numeric

Across an 18-model panel and a peer-reviewed synthetic suite, the robust, corroborated finding is that performance falls with length and that the effective window is materially shorter than the claimed one — RULER reports “almost all models exhibit large performance drops as the context length increases.” RULER: What's the Real Context Size of Your Long-Context Language Models? · Hsieh et al. (NVIDIA, COLM) (2024)T3-practitioner original The specific percentages (“11 models drop below 50% of their strong short-length baselines” at 32K, NoLiMa: Long-Context Evaluation Beyond Literal Matching · Modarressi et al. (ICML) (2025)T3-practitioner original “only half at 32K”) are model- and benchmark-dependent.

[Caveat]

Never launder a portable “context-rot %.” The directional claim is robust; the magnitudes are benchmark-specific. Two folk figures — a “~1M-token ceiling” and a “~40% dumb zone” — have no primary backing and are deliberately not cited.

The practitioner operationalization is directional too: “get your context into the LLM in the most token- and attention-efficient way you can.” [Practitioner] 12-Factor Agents — Factor 3: Own your context window · Dex Horthy (HumanLayer) (2025)T3-practitioner original Fewer, denser tokens — not a threshold.

Two findings that surprise builders

Coherence can hurt; retrieval ≠ reasoning

(a) Coherence can hurt. Models can “perform better on shuffled haystacks than on logically structured ones” Context Rot: How Increasing Input Tokens Impacts LLM Performance · Hong, Troynikov & Huber (Chroma Research) (2025)T3-practitioner original — coherent neighbors are plausible distractors. So “add more well-organized context” is not monotonically safe. (b) Retrieval ≠ reasoning. Locating a fact at length is relatively robust; reasoning across several located facts is fragile. Reasoning on Multiple Needles In A Haystack · Wang (2025)T3-practitioner original “It found the facts” does not imply “it can reason over them.” Both argue for decomposition over context-stuffing.

”Unsolved” overshoots — and rot reaches the overseer

The strong framing — context rot is architectural and no model solves it — overshoots the 2026 evidence. The degradation is robust and near-universal today, but decode-time work shows the attenuation is partially reversible: gold tokens are down-weighted, not erased Mitigating Posterior Salience Attenuation in Long-Context LLMs with Positional Contrastive Decoding · Xiao et al. (2025)T3-practitioner original — they still occupy high-ranking positions in the decoding space, recoverable at decode time. Honest synthesis: degradation is near-universal now; whether it is fundamentally architectural or substantially trainable/decodable is open, and the 2026 frontier is actively eroding the “unsolvable” claim. Build for the degradation that exists; don’t bet the architecture on it being permanent.

One 2026 extension matters for design: rot reaches the monitor. An LLM acting as judge/monitor degrades on long transcripts, missing flagged actions far more often as the trace grows. Classifier Context Rot: Monitor Performance Degrades with Context Length · Martin & Roger (2026)T3-practitioner original Long-running agentic sessions degrade both the actor and the safety layer watching it.

[Note]

A reflection loop is one such long-running session: each critique-and-retry grows the history toward the effective-window edge — self-correction has a rot cost.

Trusting the marketed window

A 1M-token window is not 1M tokens of working memory. Filling it because it’s advertised invites every failure mode at once — positional sag, length decay, reasoning collapse, and a monitor that stops catching problems. Treat the claimed window as a ceiling, not a budget.

Diagnostic: which failure mode are you hitting?

This chapter has no pattern catalog — the responses are the next chapter. Instead, a diagnostic to locate the failure before you reach for a fix (every fix routes to Context Assembly):

  • Symptom: the agent ignores something you know is in context. → Positional — it’s likely buried mid-window. Look at placement (move load-bearing content to an edge).
  • Symptom: quality falls as the session/file grows, even when the fact is present. → Length — look at how much is loaded (prune, compact).
  • Symptom: it finds the right facts but draws the wrong conclusion. → Reasoning — look at decomposition (split the multi-hop task).
  • Symptom: it works in small repros, fails at “full” context well under the limit. → Effective-vs-claimed — look at the working window, not the marketed one.
  • Symptom: your LLM judge/monitor stops flagging issues on long runs. → Monitor rot — shorten/segment what the overseer reviews.
[Caveat]

Fast-moving area — the architectural-vs-trainable question is the live research front; recheck rather than treating today’s mitigations as settled.

Quick reference

  • Four failure modes: positional · length · reasoning · effective-vs-claimed. Different causes, different fixes.
  • Robust = directional: performance falls with length; the effective window is far shorter than the claimed window. Never quote a portable %.
  • Surprises: coherence can hurt; retrieval ≠ reasoning.
  • “Unsolved” overshoots: near-universal now, but partially trainable/decodable — an open question.
  • Rot reaches the overseer: long traces degrade the monitor too.
  • The responses are the next chapter (Context Assembly).

Practice

Exercise

Name the four failure modes from memory. For each, give the one-line symptom you’d observe in an agent session, and say whether the fix is about placement, amount, decomposition, or working-window awareness. Which one most often gets misdiagnosed as “the model isn’t smart enough”?

Practice ◆◆◆◇

You’re told “our agent gets worse on big tasks — let’s buy the 1M-context model.” Using this chapter, write the two-sentence pushback: why a bigger marketed window may not help, and which failure mode(s) you’d diagnose first. The point is to feel why “more window” is not a fix for rot — and why this chapter exists before the assembly chapter.

Exercise solutions

Solution ↑ Exercise

Positional (symptom: a known in-context fact is ignored → fix is placement); length (symptom: quality decays as the window fills, needle present → amount); reasoning (symptom: right facts, wrong multi-hop conclusion → decomposition); effective-vs-claimed (symptom: fine in small repros, fails well under the limit → working-window awareness). The most-misdiagnosed is reasoning degradation — “it found the facts but got the answer wrong” looks like a capability gap, but the evidence attributes it to a shortening thinking process at length, which is a context problem with a context fix (decompose), not a reason to swap models.

Solution ↑ Exercise

“A larger marketed window isn’t a larger working window — RULER found only half of 32K-claimed models hold up at 32K, and degradation is near-universal as length grows, so a 1M model will still rot well before 1M. I’d first diagnose length and reasoning degradation: prune/compact what’s loaded and decompose the multi-hop steps before assuming we need more context — the fix is assembly, not a bigger window.” The deeper point: rot is why context engineering exists; buying window capacity treats the symptom’s label, not the mechanism.

Part 1 Chapter 9 Last verified 2026-06-13 Fresh

Context Assembly: Engineering the Window

The engineering response to context rot — the harness owns the boundary deciding what enters the window and when. Cache stability, just-in-time loading, compaction, attention placement, and assembly-as-prompt.

Volatility: feature-surface
Tools compared: claude-codecross-tool
Before you start: The context-rot chapter — the four failure modes and the diagnostic that every fix routes here.
You will learn
  • The mental model: the window is assembled each turn from regions with different stability and roles
  • Why prefix cache stability is an economic variable, and how to protect it
  • Just-in-time loading — keep pointers, resolve to data on demand
  • Compaction — summarize vs checkpoint-and-restart, and the stale-prefix risk
  • Attention placement — the U-shaped curve, the instruction budget, and recitation
  • Why the assembled window is itself a prompt

The previous chapter established the problem: long context does not degrade gracefully. This chapter is the engineering response. The harness owns the boundary deciding what enters the window and when — and “context assembly” is the discipline of choosing, ordering, caching, and compacting the bytes that go to the model each turn. This is the deepest chapter in the book; it carries the richest pattern catalog.

The window is assembled from regions

Each turn, the harness assembles a window from regions that differ in stability and role.

Concept · The assembled window

Pre-commitment (system prompt, CLAUDE.md, tool definitions) — set before the turn, cacheable, shapes intent. Loaded context (file reads, tool results) — pulled in during the turn; the just-in-time decision lives here. Working history — the running transcript, which grows until compacted. Placement — where each piece lands, because attention is not uniform. The assembled window is itself a prompt: its prefix stability is an economic variable, its length is finite, its ordering is an attention variable.

The assembled window, one turn: a stable cacheable front (pre-commitment), a just-in-time middle (loaded context), and a volatile tail (working history) — read under a U-shaped attention curve where the edges are attended and the middle sags.A horizontal window bar split left-to-right into pre-commitment, loaded context, and working history, annotated stable-front to volatile-tail, beneath a U-shaped attention curve that is high at both edges and sags in the middle.
The assembled window, one turn: a stable cacheable front (pre-commitment), a just-in-time middle (loaded context), and a volatile tail (working history) — read under a U-shaped attention curve where the edges are attended and the middle sags.

Cache stability — the prefix is an economic variable

A practitioner building Manus calls the KV-cache hit rate the “single most important metric for a production-stage AI agent,” [Practitioner] Context Engineering for AI Agents: Lessons from Building Manus · Yichao Ji (Manus) (2025)T3-practitioner original because it drives both latency and cost. The lever is a stable prefix: no mid-history edits, deterministic serialization, append-only. Anthropic’s prompt cache has, “by default, a 5-minute lifetime” [Official] Prompt caching · AnthropicT1-official original — the window inside which stability pays off.

A controlled multi-provider evaluation quantifies the payoff: caching “reduces API costs by 41-80% and improves time to first token by 13-31% across providers,” Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks · Lumer et al. (2026)T3-practitioner original and the placement matters — “placing dynamic content at the end of the system prompt” Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks · Lumer et al. (2026)T3-practitioner original beats “naive full-context caching, which can paradoxically increase latency.” Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks · Lumer et al. (2026)T3-practitioner original

Key idea

Keep the front of the window byte-identical turn to turn, and put the volatile bits at the tail. A stable prefix is not housekeeping — it is a 41–80% cost lever and a latency lever. Perturbing the front (a timestamp, a reordered tool list) silently invalidates the cache.

[Caveat]

The “#1 metric” framing is one team’s stated priority, not a benchmark; the 41–80% / 13–31% figures are provider-dependent. Treat as directional, not universal.

Loading — just-in-time vs preload

What enters the window during a turn is a choice, not a default. The just-in-time pattern keeps “lightweight identifiers (file paths, stored queries, web links, etc.) and use[s] these references to dynamically load data into context at runtime using tools.” [Official] Effective context engineering for AI agents · Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, and Jeremy Hadfield (2025)T1-official original Even tool definitions can be deferred — presenting tools as code lets models “read tool definitions on-demand, rather than reading them all up-front.” Code execution with MCP: Building more efficient agents · Jones & Kelly (Anthropic) (2025)T1-official original The framing generalizes: context operations are write / select / compress / isolate, where writing context means “saving it outside the context window.” [Practitioner] Context Engineering · The LangChain Team (2025)T2-release-notes original

Compaction — the window is finite

The premise: the window “must be treated as a finite resource with diminishing marginal returns.” [Official] Effective context engineering for AI agents · Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, and Jeremy Hadfield (2025)T1-official original When it fills, you summarize (/compact) or you checkpoint-and-restart — the harness pattern of “finding a way for agents to quickly understand the state of work when starting with a fresh context window,” Effective harnesses for long-running agents · Justin Young (2025)T1-official original i.e., a progress file plus checkpoints rather than lossy in-context summarization.

Compaction can invalidate the cache

Compaction is a cache-invalidation boundary: a shipping-product regression report (since closed as not-planned) reasons that “any prefix cached before compaction is guaranteed stale after it.” Claude Code v2.1.62 — Server-Side KV Cache Stale Context Regression (P1) · Taylor (issue reporter) (2026)T3-practitioner original (Reporter-asserted, not Anthropic-confirmed.) The interaction matters — the same compaction that saves window space can cost you the cache savings if the prefix is rebuilt. Checkpoint-and-restart with a small, re-derivable prefix sidesteps both.

Attention placement — position is not neutral

Where content lands changes whether the model uses it. Recall “is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades… in the middle.” Lost in the Middle: How Language Models Use Long Contexts · Liu et al. (TACL) (2023)T3-practitioner original

Convergence claude-codecross-tool

Two independent peer-reviewed studies converge on the same curve: Liu et al. measure the mid-context accuracy drop, and Hsieh et al. attribute it to “a U-shaped attention bias where the tokens at the beginning and at the end of its input receive higher attention, regardless of their relevance.” Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization · Hsieh et al. (ACL Findings) (2024)T3-practitioner original The convergence is what makes “put the load-bearing instruction at an edge” a robust heuristic, not a single-paper artifact. [Convergence]

There is also an instruction budget: “performance consistently degrades as the number of instructions increases,” When Instructions Multiply: Measuring and Estimating LLM Capabilities of Multiple Instructions Following · Harada et al. (EMNLP) (2025)T3-practitioner original and even the best models “only achieve 68% accuracy at the max density of 500 instructions,” How Many Instructions Can LLMs Follow at Once? · Jaroslawicz et al. (2025)T3-practitioner original with “a bias towards earlier instructions.” How Many Instructions Can LLMs Follow at Once? · Jaroslawicz et al. (2025)T3-practitioner original The practitioner countermeasure exploits the end-of-context peak: recitation — “reciting its objectives into the end of the context” (a maintained todo.md) [Practitioner] Context Engineering for AI Agents: Lessons from Building Manus · Yichao Ji (Manus) (2025)T3-practitioner original to fight goal drift.

[Caveat]

ManyIFEval + IFScale show degradation is real but model-specific — not a fixed ceiling. The “150–200 instruction” folk figure is not an established result.

Assembly is a prompt

The window’s contents are prose to engineer, not plumbing. Tool descriptions sit in the cache-sensitive pre-commitment region — author them as you would “describe your tool to a new hire on your team.” [Official] Writing effective tools for agents — with agents · Aizawa (Anthropic) (2025)T1-official original Structure disambiguates: XML tags “help Claude parse complex prompts unambiguously, especially when your prompt mixes instructions, context, examples, and variable inputs,” Prompting best practices: use XML tags · AnthropicT1-official original and instruction-following gains “stem largely from parameter updates in attention modules,” A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in LLMs · Ye et al. (2025)T3-practitioner original a hint for why structured, constraint-bearing assembly is tractable.

[Note]

A reflection loop’s critique is re-injected text — it competes for this same budget and placement; assemble self-critique deliberately, it isn’t free.

Three patterns in one window Worked example

An agent works in a repo with a 50K-token config file the task barely touches. Assemble one turn’s window from three patterns at once:

  • Stable prefix — system prompt, tool definitions, and invariant project facts sit at the front, byte-identical every turn, so the cache holds (the cost and latency lever).
  • Just-in-time loading — the config is not inlined; a path-pointer stays in context, and a tool resolves only the few keys the task needs (the finite-window lever).
  • Placement — the load-bearing instruction is re-emitted at the tail, on the end-of-context attention peak, rather than buried mid-window.

One assembled window: a cached front, a minimal just-in-time middle, a load-bearing tail. The same window that re-read 50K tokens every turn now carries a few hundred — cheaper, faster, and far less exposed to rot.

Diagnostic: which assembly failure are you hitting?

Mirroring the previous chapter’s router, map an observed symptom to the assembly lever that addresses it — each routes to a pattern below:

  • Symptom: latency and cost climb every turn, even on small tasks. → Cache instability — something perturbs the prefix. Reach for stable prefix + dynamic content at the tail.
  • Symptom: quality decays as a long session or file grows. → Unbounded loading — too much is resident at once. Reach for just-in-time loading and compact-or-checkpoint.
  • Symptom: the agent ignores an instruction you know is loaded. → Misplacement — it is sagging mid-window. Reach for place at the edges; recite the goal.
  • Symptom: cost spikes right after a long session compacts. → Cache invalidation at the compaction boundary. Reach for checkpoint-and-restart with a small prefix.
  • Symptom: a multi-part prompt is parsed wrong. → Unstructured assembly. Reach for structure with delimiters.

Patterns

Stable prefix. Sketch: keep the window front byte-identical turn to turn. When to use: always. Context Engineering for AI Agents: Lessons from Building Manus · Yichao Ji (Manus) (2025)T3-practitioner original Mechanics: no timestamps, deterministic serialization, append-only. Remember: the prefix is a 41–80% cost lever; perturbing it silently breaks the cache.

Dynamic content at the tail. Sketch: put volatile bits at the end of the (system) prompt. When to use: any cacheable prefix with changing parts. Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks · Lumer et al. (2026)T3-practitioner original Mechanics: stable front, volatile tail. Remember: naive full-context caching can increase latency.

Just-in-time loading. Sketch: keep pointers; resolve to data on demand. When to use: large/optional context. Effective context engineering for AI agents · Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, and Jeremy Hadfield (2025)T1-official original Mechanics: store identifiers (paths, queries); fetch via tools; defer tool defs too. Code execution with MCP: Building more efficient agents · Jones & Kelly (Anthropic) (2025)T1-official original Remember: preload the minimum; the window is finite.

Compact or checkpoint. Sketch: manage the window when it fills. When to use: long sessions. Effective harnesses for long-running agents · Justin Young (2025)T1-official original Mechanics: prefer checkpoint-and-restart from a progress file over lossy summarize. Remember: compaction can invalidate the cache — keep the restart prefix small.

Place at the edges; recite the goal. Sketch: exploit the U-shaped attention curve. When to use: load-bearing instructions; long tasks. Lost in the Middle: How Language Models Use Long Contexts · Liu et al. (TACL) (2023)T3-practitioner original Mechanics: put critical content at an edge; re-emit goals at the tail (todo.md). Context Engineering for AI Agents: Lessons from Building Manus · Yichao Ji (Manus) (2025)T3-practitioner original Remember: the middle is where attention sags.

Structure with delimiters. Sketch: mark up multi-part context. When to use: mixed instructions/context/examples. Prompting best practices: use XML tags · AnthropicT1-official original Mechanics: XML tags as semantic anchors; tool descriptions as clear prose. Remember: the assembled window is a prompt — structure shapes attention.

Quick reference

  • The window is assembled from pre-commitment / loaded / history / placement regions.
  • Stable prefix = cost + latency lever (41–80% / 13–31%); dynamic content at the tail.
  • JIT loading: keep pointers, resolve on demand; the window is finite.
  • Compaction: prefer checkpoint-and-restart; mind the stale-prefix/cache interaction.
  • Placement: edges beat the middle (U-shaped bias); budget instructions; recite goals at the tail.
  • Assembly is a prompt: engineer tool descriptions and structure.

Practice

Exercise

Map each context-rot failure mode from the previous chapter to the assembly response here: positional → ?, length → ?, reasoning → ?, effective-window → ?. Which assembly pattern addresses each? (This is the problem→response bridge made concrete.)

Practice ◆◆◆◆

An agent re-reads a large config file every turn, blowing its budget and (you suspect) breaking the cache. Prescribe a fix using two patterns from this chapter — one for what enters the window and one for prefix stability — and state the cost/latency lever each pulls. The point is to feel assembly as the lever: the same complaint from the introduction’s worked example, now with named mechanisms.

Exercise solutions

Solution ↑ Exercise

Positional → place load-bearing content at an edge (lost-in-the-middle / U-shaped bias). Length → just-in-time loading + compaction (keep the window small; treat it as finite). Reasoning → decomposition + recitation (shorten what must be reasoned over at once; re-emit goals). Effective-window → budget instructions and load JIT so the working set stays well under the marketed limit. The throughline: every rot failure mode has an assembly lever — which is exactly why the rot chapter (problem) precedes this one (response).

Solution ↑ Exercise

Fix: (1) just-in-time loading for what enters the window — keep a lightweight pointer to the config (a path/identifier) and resolve only the needed slice via a tool, instead of inlining the whole file each turn; this pulls the length/finite-window lever (smaller working set, less rot). (2) stable prefix for prefix stability — keep the cacheable front (system prompt, tool defs) byte-identical and put anything volatile at the tail, so the cache stays valid; this pulls the cost/latency lever (41–80% cost / 13–31% TTFT). Together they turn “re-read everything every turn” into “stable cached front + minimal JIT tail” — the assembly answer to the worked complaint.

Part 1 Chapter 10 Last verified 2026-06-13 Fresh

Memory: Persisting Context Across Sessions

Memory is just recalled context — so every memory anti-pattern is a context anti-pattern. Typing enables decay, the doc-vs-memory boundary is durable-shared vs fast-private, repo-as-memory is the cheap floor you outgrow. An openly unsolved design space.

Volatility: feature-surface
Tools compared: claude-codecross-tool
Before you start: The context-rot and context-assembly chapters — degradation, and the within-session assembly response.
You will learn
  • The unifying lens: memory is just recalled context, so every memory anti-pattern is a context anti-pattern
  • Why typing enables decay — you cannot invalidate what you cannot address
  • The doc-vs-memory boundary — durable-shared-reviewable vs. fast-private-decaying
  • Why repo-as-memory is the cheap floor you outgrow predictably
  • Why agent memory is openly unsolved, and how to read the framing honestly

The previous two chapters managed context within a session. This one persists it across sessions. It is the most openly unsolved topic in the book — the practices here are current best-practice scaffolding around a contested design space, not settled science, and the chapter says so plainly.

Memory is just recalled context

The unifying lens makes this chapter cohere with the two before it. A recalled memory enters the window as more text — the store is read back in. So a stale or wrong memory is a context-quality failure, and over-recall is a context-length failure. As one practitioner puts it, “incorrect memories are worse than no memories. Bad facts, once stored, pollute every future decision.” [Practitioner] Why your AI agent doesn't actually remember anything · Ed Huang (2026)T3-practitioner original

Key idea

Memory is just recalled context. Every memory anti-pattern is, at bottom, a context anti-pattern — which is why this chapter sits beside rot and assembly. Memory must be selective on recall, because unselective recall reintroduces exactly the rot the assembly discipline works to avoid.

Typing enables decay

Should memory have structure, or be one flat blob? The architecture proposals all add structure: MemGPT’s “virtual context management, a technique drawing inspiration from hierarchical memory systems in traditional operating systems”; MemGPT: Towards LLMs as Operating Systems · Packer et al. (2023)T3-practitioner original Generative Agents’ memory stream the agent will “synthesize… into higher-level reflections, and retrieve… dynamically to plan behavior”; Generative Agents: Interactive Simulacra of Human Behavior · Park et al. (2023)T3-practitioner original and LangMem’s explicit episodic/semantic/procedural split, where the semantic type “stores key facts (and their relationships)… that ground an agent.” [Practitioner] LangMem SDK for agent long-term memory · The LangChain Team (2025)T2-release-notes original

You can't forget what you can't address

Typing is the precondition for decay. A flat, undifferentiated blob can only be truncated, not selectively forgotten or corrected. The canonical stale-memory harm — a fact “accurate until they change jobs, at which point it becomes confidently wrong” State of AI Agent Memory 2026: Benchmarks, Architectures & Production Gaps · Mem0 Engineering Team (2026)T3-practitioner original — is unfixable unless typing lets you locate and invalidate that specific fact. Structure (M1) and decay (M2) are one design move seen from two ends.

[Caveat]

These are architecture proposals, not benchmarked memory wins — no effect size. Fast-moving area; recheck after 2026-06-26.

[Note]

What a reflection step learns is itself a typed-memory candidate: persist the lesson, decaying, not the raw critique transcript.

The doc-vs-memory boundary

Given a fact, which medium does it belong in — a committed doc or ephemeral memory? The official docs draw the line on two axes: a committed doc is “instructions you write” [Official] How Claude remembers your project · AnthropicT1-official original and is “shared with your team through version control,” whereas auto-memory is “notes Claude writes itself” and is “machine-local… not shared across machines.” How Claude remembers your project · AnthropicT1-official original The API layer mirrors it as a read_only store “for reference material… the agent does not need to modify.” Using agent memory · AnthropicT1-official original

Convergence claude-codecross-tool

The same boundary is drawn by two Anthropic docs and two independent practitioners: the hand-maintained “project constitution that I maintain by hand” My .md files vs Claude's memory tool: a practitioner comparison · Andreas Belitz (2026)T3-practitioner original for what “rarely changes… defines identity and constraints,” and the committed doc as the place for “the static ‘who am I’ context that should not change across sessions.” Agent Memory Engineering · Nicolas Bustamante (2026)T3-practitioner original The principle: commit what is durable, shared, and reviewable; leave to memory what is fast, private, and disposable. [Convergence]

Repo-as-memory: the cheap floor

The filesystem is a real memory store — “the file system not just as storage, but as structured, externalized memory,” [Practitioner] Context Engineering for AI Agents: Lessons from Building Manus · Yichao Ji (Manus) (2025)T3-practitioner original unbounded and persistent, so after a context reset the agent “reads its own notes and continues” [Official] Effective context engineering for AI agents · Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, and Jeremy Hadfield (2025)T1-official original long tasks. But it is the floor, not the ceiling: it has no decay, no typing, no retrieval ranking; files “get unmanageable >5MB,” AI Agent Memory Management: When Markdown Files Are All You Need? · Yaohua Chen (ImagineX) (2025)T3-practitioner original and a flat scratchpad is “scoped to a single thread” while a real store “persists across threads and can be recalled at any time.” Long-term memory · LangChainT2-release-notes original

Concept · Reach for the floor, outgrow it predictably

Repo-as-memory is the cheap baseline: start here (a notes file, a progress record). Outgrow it along known axes — when you need typing (address facts by kind), decay (invalidate stale ones), or ranking (retrieve the relevant few), promote to a structured store. The limits are predictable, which is what makes the floor a safe default.

Treating memory as solved

Agent memory is openly unsolved. One named vendor calls high-relevance staleness “a harder, open problem,” State of AI Agent Memory 2026: Benchmarks, Architectures & Production Gaps · Mem0 Engineering Team (2026)T3-practitioner original and the architecture landmarks are proposals, not benchmarked wins. Designing as if a memory layer reliably remembers the right thing — and forgets the wrong thing — overreaches the evidence. Treat typed/decay/boundary/repo-as-memory as scaffolding, and keep the durable, reviewable facts in a committed doc you control.

[Caveat]

“Memory is unsolved” here is one vendor’s (Mem0) framing, not field consensus — and that post is freshly dated; recheck after 2026-06-26.

The landscape, and the safe bet

The proposals this chapter draws on optimize for different things — which is why “openly unsolved” is a starting point, not a verdict.

What should agent memory be?

The field is exploring several shapes. MemGPT optimizes capacity — OS-style paging of a virtual context so the agent exceeds its window. MemGPT: Towards LLMs as Operating Systems · Packer et al. (2023)T3-practitioner original Generative Agents optimizes salience — reflection and retrieval over a memory stream. Generative Agents: Interactive Simulacra of Human Behavior · Park et al. (2023)T3-practitioner original Mem0 optimizes production operations — a managed memory service with benchmarks, naming high-relevance staleness as an open problem. State of AI Agent Memory 2026: Benchmarks, Architectures & Production Gaps · Mem0 Engineering Team (2026)T3-practitioner original LangMem optimizes structure — an explicit episodic / semantic / procedural split. LangMem SDK for agent long-term memory · The LangChain Team (2025)T2-release-notes original None is a settled winner. Against that, typed memory with decay is the safe bet now: the smallest commitment that survives staleness, and the precondition the richer systems assume.

Patterns

Type your memory. Sketch: split memory by kind (episodic/semantic/procedural) or tier it. When to use: anything beyond a scratchpad. LangMem SDK for agent long-term memory · The LangChain Team (2025)T2-release-notes original Mechanics: store facts apart from experiences apart from rules; address each distinctly. Remember: typing is the precondition for selective decay and correction.

Decay / invalidate. Sketch: give stored facts a validity path. When to use: any long-lived memory. Mechanics: invalidate on contradiction; prefer “no memory” over a confidently-wrong one. Why your AI agent doesn't actually remember anything · Ed Huang (2026)T3-practitioner original Remember: a stale fact is recalled with full confidence — pollution is worse than absence.

Choose the medium. Sketch: durable-shared → doc; fast-private → memory. When to use: every fact you persist. How Claude remembers your project · AnthropicT1-official original Mechanics: identity/constitution/standards → committed doc; session-accreted preferences → auto-memory. My .md files vs Claude's memory tool: a practitioner comparison · Andreas Belitz (2026)T3-practitioner original Remember: commit the durable + reviewable; leave the disposable to memory.

Repo-as-memory floor. Sketch: start with the filesystem; outgrow it on signal. When to use: cross-session state, day one. Context Engineering for AI Agents: Lessons from Building Manus · Yichao Ji (Manus) (2025)T3-practitioner original Mechanics: a notes/progress file first; promote to a structured store when you need typing, decay, or ranking. Long-term memory · LangChainT2-release-notes original Remember: the floor has no decay/typing/ranking — and bloats past a few MB.

Quick reference

  • Memory = recalled context — every memory anti-pattern is a context anti-pattern.
  • Typing enables decay — you can’t invalidate what you can’t address.
  • Decay/pollution: a stale fact is “confidently wrong”; incorrect memory is worse than none.
  • Doc-vs-memory boundary: durable-shared-reviewable (doc) vs. fast-private-decaying (memory).
  • Repo-as-memory: cheap floor; outgrow it for typing/decay/ranking; bloats past a few MB.
  • Openly unsolved — scaffolding, not settled science; recheck.

Practice

Exercise

List three facts an agent might “remember” about your project: one identity-defining (rarely changes), one session-specific preference, one that will go stale (true now, wrong later). For each, name the right medium (committed doc vs. auto-memory) and — for the stale one — what would have to invalidate it.

Practice ◆◆◆◇

An agent stores “the user prefers the staging DB” as a flat note, and months later acts on it after the user has moved to a new setup — confidently wrong. Using this chapter, diagnose the two failures (which design question each maps to) and prescribe the fix. The point is to feel “memory = recalled context”: the stale note is a context-quality failure, and the fix is structural, not “remember harder.”

Exercise solutions

Solution ↑ Exercise

Identity-defining (e.g., “this is a Rust workspace; all crates target stable”) → committed doc; it rarely changes, should be shared/reviewable, and belongs where the team sees it. Session preference (e.g., “explain changes before applying this session”) → auto-memory; private, fast, disposable. Will-go-stale (e.g., “the user is on team X”) → if kept at all, it needs an invalidation path — typed so it can be located and overwritten when the world changes; otherwise it becomes the confidently-wrong recall the chapter warns about. The split is the doc-vs-memory boundary plus the typing-enables-decay rule applied per fact.

Solution ↑ Exercise

Two failures: (1) no typing/decay (M1/M2) — the preference was stored as a flat note with no validity path, so it could not be located and invalidated when it changed; it became “confidently wrong.” (2) wrong medium / no review (M3) — a consequential, drift-prone fact lived in private auto-memory rather than a reviewable committed doc where staleness would be visible. Fix: type the memory so the specific fact is addressable and can be invalidated on contradiction; and move durable, consequential facts to the committed doc (reviewable), leaving only fast, disposable preferences to auto-memory. The throughline: the stale note is a context-quality failure (memory = recalled context), so the remedy is structural — typing, decay, and the medium boundary — not “remember harder.”

Part 1 Chapter 11 Last verified 2026-05-29 Fresh

Designing the Whole: Environment + Context as One System

The capstone — an integrative design workflow that composes the book's eight core chapters into one discipline, with decision points and an honest map of what is settled, converged, first-party-only, and openly unsolved.

Volatility: architectural-pattern
Tools compared: claude-codecross-tool
Before you start: The whole book — the environment chapters (E1–E5) and the context chapters (rot, assembly, memory).
You will learn
  • How the eight chapters compose into one design workflow, not eight separate concerns
  • A decision order for engineering an agent’s environment + context together
  • The recurring trade-offs and how to resolve them
  • An honest map of the evidence — what is measured, converged, first-party-only, and openly unsolved

This chapter is integrative. It introduces no new evidence — it composes the book’s grounded claims into a design workflow and a decision guide. Where it restates a load-bearing fact, it points back to the chapter that established it; the rest is synthesis.

[Note]

Integrative synthesis, grounded in the prior chapters — not a new evidence chapter. Read it as the “how to put it together,” not as new claims.

The two layers are one system

The book opened on a thesis: what turns a model into an agent is the engineering of the two layers around it — the environment it acts in and the context it reasons over — and that discipline is the most underappreciated, highest-leverage thing an architect designs. Eight chapters in, the payoff is that they are not eight topics but two ends of one loop: the environment is the durable store of everything the agent could use; the context is the finite slice it does use each turn; and the harness owns the boundary between them — context being “a finite resource with diminishing marginal returns.” [Official] Effective context engineering for AI agents · Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, and Jeremy Hadfield (2025)T1-official original

Key idea

Engineer the environment so the right signal is available and machine-checkable; engineer the context so the right slice crosses into the window and survives. A great environment is wasted if assembly drags the wrong slice in; the smartest assembly cannot surface signal an unengineered environment never made legible. Design both, in order.

A design workflow

The chapters fall into a natural order when you design a real agent’s environment and context together.

  1. Make the environment legible (E1, E5). Maximize signal in and machine-checkable feedback out; at scale, bound what must be loaded (interface contracts, shallow index, scope-to-workspace).
  2. Budget the always-on layer (E2). The instruction file is paid every turn — spend it only on broadly-applicable, can’t-infer-from-code context. More is not better. Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents? · Gloaguen, Mündler, Müller, Raychev, Vechev (ETH Zurich) (2026)T3-practitioner original
  3. Push procedures to load on demand (E3). Skills are just-in-time procedural knowledge; keep them out of the window until relevant.
  4. Set the safety envelope (E4). Express intent in policy (permissions), contain failure in mechanism (sandbox, out-of-band reversibility).
  5. Engineer the window against rot (C1 → C2). Know the four failure modes; assemble a stable, well-placed, just-in-time window; compact or checkpoint as it fills.
  6. Persist deliberately (C3). Commit the durable and reviewable; leave the disposable to (typed, decaying) memory — and remember recalled memory is just more context.
Concept · The loop, not the list

Steps 1–4 shape the environment (what’s available and allowed); steps 5–6 shape the context (what crosses, and what persists). The loop closes because every context decision (what to load, place, compact, remember) is constrained by how legible and bounded the environment is — and every environment decision exists to make the context decision tractable.

Decision points

The recurring trade-offs, and how the book resolves them:

  • Signal vs. budget. Add context to help, or subtract to protect the window? Default to subtract: legibility and examples beat prose, and the one measured result says unnecessary context-file content reduces success. Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents? · Gloaguen, Mündler, Müller, Raychev, Vechev (ETH Zurich) (2026)T3-practitioner original Add only broadly-applicable, can’t-infer context to the always-on layer; everything else loads on demand.
  • Stability vs. freshness. A stable prefix is a large cost/latency lever, but content changes. Resolve by placement: stable front, volatile tail.
  • Where a fact lives. Always-on (CLAUDE.md), on-demand (skill), or remembered (memory)? Fact that applies broadly → instruction layer; procedure → skill; durable + reviewable → committed doc; fast + private + disposable → memory.
  • Placement under rot. Load-bearing content goes at an edge, not the middle; Lost in the Middle: How Language Models Use Long Contexts · Liu et al. (TACL) (2023)T3-practitioner original decompose multi-hop reasoning rather than stuff the window.
  • Ergonomics vs. enforcement. Skills and filters shape what the model sees; they are not security. Real restriction lives in the permission/sandbox layer.
The one question that locates most decisions

“Is this paid every turn, or only when relevant?” Always-on context (instruction layer, tool defs) is expensive and must be minimal and stable; on-relevance context (skills, JIT loads, recalled memory) is where the discipline buys leverage. Most env/context design decisions reduce to placing each piece on the right side of that line.

An honest map of the evidence

The book’s claims sit at very different evidence tiers, and designing well means weighting them accordingly.

  • Measured (rare). One controlled result anchors the instruction layer: unnecessary context-file content reduces success and adds cost. Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents? · Gloaguen, Mündler, Müller, Raychev, Vechev (ETH Zurich) (2026)T3-practitioner original Treat as one study, not law.
  • Converged (strong). The CLAUDE.md short-and-hand-curated rule, the U-shaped positional curve, navigate-before-read at scale, and the doc-vs-memory boundary each have independent corroboration — the strongest signal a craft discipline offers.
  • First-party-only (authoritative, uncorroborated). Skills mechanics are entirely Anthropic-sourced — authoritative on what they are, not yet independent evidence of efficacy.
  • Openly unsolved. Memory is scaffolding around a contested space, State of AI Agent Memory 2026: Benchmarks, Architectures & Production Gaps · Mem0 Engineering Team (2026)T3-practitioner original and whether context rot is permanent or trainable is the live 2026 front. Build for what exists; don’t bet the architecture on either resolving.
Designing as if it's all settled

The biggest capstone error is flattening the tiers — treating a single study as a law, a first-party mechanic as proven efficacy, or an unsolved problem as solved. Design to the evidence you have: lean on the converged practices, hold the measured result as one data point, and keep the unsolved parts (memory, rot-permanence) behind reversible, re-checkable choices.

The boundary of this volume

This book engineers two of the harness’s layers — the environment the agent acts in and the context it reasons over. It stops, deliberately, at control flow: how an agent critiques and retries its own work — reflection, or self-correction — and how multiple agents are coordinated are the companion D1 orchestration volume’s subject, not this one’s. What this volume owns of reflection is only its footprint — the environment a critic step reads, and the context its critique writes back, a cost the rot, assembly, and memory chapters each flag where it lands.

Quick reference

  • One system: environment makes signal available + checkable; context decides what crosses + persists.
  • Workflow: legible environment → budget the always-on → push procedures on-demand → set the safety envelope → engineer the window vs rot → persist deliberately.
  • The locating question: paid every turn, or only when relevant?
  • Default to subtract in the always-on layer (the one measured result).
  • Weight the evidence: measured (rare) → converged (strong) → first-party (uncorroborated) → unsolved (scaffold).

Practice

Exercise

Walk the six-step workflow for an agent you’d actually build. At each step, name the one decision you’d make and which chapter grounds it. Where in your design is the weakest evidence (first-party-only or unsolved), and how would you keep that choice reversible?

Practice ◆◆◆◆

Take a single recurring agent failure (re-reading files, losing the thread across sessions, acting on stale state, ignoring an instruction it was given). Trace it through the whole book: which layer (environment or context), which chapter’s mechanism diagnoses it, and which decision-point trade-off resolves it. Write the fix as a sequence of moves across both layers. The point is to feel the book as one discipline — most real failures are not one chapter’s, but a path through several.

Exercise solutions

Solution ↑ Exercise

A representative pass: (1) legible environment — add an entry-point map + examples (E1); (2) budget — cut the CLAUDE.md to broadly-true facts (E2, grounded in the ETH result); (3) on-demand — move the release procedure to a Skill (E3); (4) safety — deny prod writes, sandbox the rest (E4); (5) window — place the task spec at an edge, load files JIT, checkpoint long runs (C1/C2); (6) persist — commit the project constitution, let auto-memory hold session preferences (C3). The weakest-evidence spots are usually the Skill efficacy (first-party-only) and any memory layer (unsolved) — keep both reversible: skills are easy to remove, and durable facts live in the committed doc you control, so a memory failure degrades gracefully.

Solution ↑ Exercise

Example — “loses the thread across sessions”: it’s a context failure that spans chapters. Diagnose with C1 (the window doesn’t carry prior state) and C3 (nothing durable persisted it). Resolve via the decision points: where a fact lives (the durable project state belongs in a committed doc, not ephemeral memory) and engineer the window (checkpoint-and-restart from a progress file rather than relying on a giant carried-over transcript). Fix as a sequence: (env) add a progress/notes file the agent reads on start; (context) checkpoint at session end and restart from the file with a small stable prefix; (persist) commit the durable identity/constraints so they’re reloaded deterministically. The failure wasn’t one chapter’s — it was a path from rot (C1) through assembly (C2) to memory (C3), which is exactly how the book is meant to be used.

Part 2 Chapter 12 Last verified 2026-06-13 Fresh

Beyond One Agent, One Tool

The spine of the Tools & Orchestration volume. Two axes organize everything that follows — capability is a context cost (so the default is to subtract), and coordination is a context-isolation move (so a new unit is a fresh window, not an added skill). The chapter sets the volume's altitude and maps its chapters onto those two axes.

Volatility: architectural-pattern
Tools compared: claude-codecross-tool
Before you start: Vol 1's frame — Agent = Model + Harness, and context as a finite budget. No tool or orchestration specifics assumed; this chapter sets them up.
You will learn
  • The two axes this volume is organized around — capability as a context cost, and coordination as a context-isolation move
  • Why both axes reduce to one currency — the context window — so both have the same default: add only when it earns its place
  • The difference between a primitive (one isolated unit) and a topology (how many units coordinate)
  • The map of the volume, and which chapters sit on which axis

Vol 1 engineered the two layers around the model — the environment an agent acts in and the context it reasons over. This volume takes the harness’s remaining moves: the tools an agent reaches for, and the orchestration of more than one agent. Both look like ways to add power. The thesis of this chapter is that both are better understood as ways to spend context — and that seeing them in that single currency gives both the same governing default.

From one agent to a system of capability and coordination

Vol 1 left two of the harness’s components deliberately unbuilt: the tool interface, and sub-agent orchestration. They are this volume’s subject. An agent, in the working definition the series uses, is a system where models “dynamically direct their own processes and tool usage” [Official] Building effective agents · Erik Schluntz and Barry Zhang (2024)T1-official original — so the moment you move past a single bare model, you are making two kinds of decision: what capability to expose (which tools, which protocol, which prompts), and how to coordinate when one agent is not enough.

It is tempting to treat both as additive — more tools, more agents, more power. The rest of this chapter argues the opposite framing, because the additive view is exactly what produces bloated, slow, hard-to-debug systems. Both decisions spend the same scarce resource.

Key idea

Tools and orchestration are not two separate topics — they are the two axes of one design space, and the axis they share is the context window. Capability spends context directly; coordination spends it by multiplying windows. The volume is organized so that every chapter is a move on one of these two axes.

The first axis: capability is a context cost

The first axis is what capability to expose. The naive view treats a capability as free until used. It is not. Context is “a critical but finite resource for AI agents” [Official] Effective context engineering for AI agents · Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, and Jeremy Hadfield (2025)T1-official original — a budget — and every tool you expose draws on it whether or not it fires, because its definition sits in the window and its presence enlarges the model’s selection problem.

This is why the governing default on the capability axis is subtraction: “More tools don’t always lead to better outcomes.” [Official] Writing effective tools for agents — with agents · Aizawa (Anthropic) (2025)T1-official original The same instinct governs the build-vs-buy decision (start direct on the API; add abstraction only when it earns its keep, since “many patterns can be implemented in a few lines of code” [Official] Building effective agents · Erik Schluntz and Barry Zhang (2024)T1-official original ), the protocol you wire external capability through, and the way you shape the agent’s text input and output.

Concept · Capability as context cost

Every capability you give an agent — a tool, a framework abstraction, an MCP server, a verbose system prompt — is spent in the context window before it is ever used. The design question is therefore not “what could this agent do?” but “what does this workflow need?” — and the default answer subtracts toward that minimum.

The second axis: coordination is a context-isolation move

The second axis is how to coordinate once one agent is not enough. Here the key reframing is sharper still: a sub-agent does not give the model a new skill — it gives the work a fresh, separate window. Sub-agents “use their own isolated context windows.” [Official] Building agents with the Claude Agent SDK · Thariq Shihipar et al. (2025)T1-official original The value is the isolation, not an added capability: a unit of work runs out of band, untouched by the main conversation’s history, and returns only its relevant result.

So coordination, too, is a context decision — not “add an agent to gain an ability” but “add a window to quarantine context or parallelize work.” That reframing carries the whole orchestration half: you reach for another agent when there is context to isolate or independent work to fan out, and you do not when the coordination cost outweighs the gain.

Both axes, one currency

Capability spends context by what sits in a window; coordination spends it by how many windows you run. Because both reduce to the same finite resource, both inherit the same default — add only when the workflow demonstrably needs it. An architect who internalizes “it’s all context” stops asking “can I add this?” and starts asking “what does this cost me in the window, and is the work worth it?”

Primitive versus topology

One distinction prevents most orchestration confusion: the primitive is not the topology.

  • A sub-agent is the primitive — a single isolated unit: fresh context in, relevant result out.
  • A multi-agent system is a topology — how you coordinate many such units (an orchestrator delegating to workers, say).

Conflating them produces both over- and under-engineering: building a multi-agent topology when one isolated sub-agent would do, or expecting a lone sub-agent to deliver what only a coordinated topology can. The orchestration half of this volume treats them as different objects — the isolation primitive first, the coordination of many second — and gates the topology on cost, because coordinating agents is not free.

[Note]

“Capability axis” and “coordination axis” are this book’s organizing framing, laid over the sources’ separate treatments of tools and orchestration — a lens for reading the volume, not a term the primaries use.

The map of this volume

The chapters move along the two axes — first what capability to expose, then how to coordinate.

Capability — what to expose, and how.

  • Build vs. buy — whether to write thin on the API, configure a harness, or build one; the start-direct default.
  • Tool minimization — the governing subtract-first discipline for the tool set itself.
  • MCP — wiring external capability against a least-privilege, capability-negotiated protocol.
  • Shaping input — the prompting craft — the five moves that shape what goes into the agent.
  • Shaping output — structured & reliable — the levers that force machine-readable output you can trust coming back.

Coordination — how many windows, and how they cooperate.

  • Sub-agents — the context-isolation primitive: a fresh window that inherits nothing and returns only the result.
  • Multi-agent — the coordination topology, and the cost gate that decides whether it is worth it.
The volume's two axes. Horizontal: capability — what to expose — with subtract-first as the default direction (build-vs-buy, tool minimization, MCP, and shaping I/O — split across prompting then structured output). Vertical: coordination — how many isolated windows — from a single agent up through the sub-agent primitive to multi-agent topologies. Both axes are measured in the same currency: context.Two axes crossing at an origin labeled 'one agent, minimal tools'. The horizontal axis is 'Capability — what to expose', with capability themes build-vs-buy, tool minimization, MCP, and shaping I/O (prompting and structured output), and an arrow marked 'subtract-first default'. The vertical axis is 'Coordination — how many isolated windows', rising from one agent through 'sub-agent (primitive)' to 'multi-agent (topology)'. A note at the corner reads 'both axes spend the same currency: context'.
The volume's two axes. Horizontal: capability — what to expose — with subtract-first as the default direction (build-vs-buy, tool minimization, MCP, and shaping I/O — split across prompting then structured output). Vertical: coordination — how many isolated windows — from a single agent up through the sub-agent primitive to multi-agent topologies. Both axes are measured in the same currency: context.
Placing a decision on the two axes Worked example

A team says: “Our agent is slow and unreliable. Should we split it into a multi-agent system?”

Locate the question before answering it:

  • “Slow and unreliable” is often a capability-axis problem first — too many overlapping tools causing wrong-tool selection, or verbose tool responses flooding the window. That is subtract-first territory, and it is cheaper to fix than any topology change.
  • Only if the work genuinely fans out into independent sub-problems is this a coordination-axis question — and then the move is to isolate units (sub-agents), and a multi-agent topology only if the parallel gain clears the coordination cost.
  • Jumping straight to “multi-agent” skips the cheaper axis and adds windows (cost) to paper over a capability problem.

The two axes turned a vague “make it better” into a located decision — capability first, coordination only when the work’s shape demands it.

Treating more as better

The single most common mistake this volume guards against is the additive reflex: more tools, more agents, more abstraction must mean more capable. On both axes the opposite is the default — capability is subtracted toward the workflow, and coordination is added only when there is context to isolate or work to parallelize. “Can we add this?” is the wrong question; “what does it cost in the window, and is it worth it?” is the right one.

Quick reference

  • Two axes, one currency: capability (what’s in a window) and coordination (how many windows) both spend context. Effective context engineering for AI agents · Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, and Jeremy Hadfield (2025)T1-official original
  • Capability default — subtract: expose what the workflow needs, not what the platform offers. Writing effective tools for agents — with agents · Aizawa (Anthropic) (2025)T1-official original
  • Coordination default — isolate, don’t add skill: a sub-agent is a fresh window, valuable for isolation, not capability. Building agents with the Claude Agent SDK · Thariq Shihipar et al. (2025)T1-official original
  • Primitive ≠ topology: the sub-agent is the unit; the multi-agent system is how units coordinate.
  • The map: capability chapters (build-vs-buy, tool minimization, MCP, prompting, structured output) → coordination chapters (sub-agents, multi-agent).
  • The reflex to unlearn: “can I add this?” → “what does it cost in the window, and is the work worth it?”

Practice

Exercise

The chapter claims tools and orchestration “reduce to one currency.” Name the currency, and explain in one sentence each how the capability axis and the coordination axis both spend it. Why does sharing a currency mean they share a default?

Practice ◆◆◇◇

Take an agent you run or have read about. Write down one change you might make on the capability axis (a tool added/removed, an abstraction adopted) and one on the coordination axis (splitting work across windows). For each, state what it costs in the context window and what would have to be true for it to earn that cost. The point is to feel both axes priced in the same currency before committing to either.

Exercise solutions

Solution ↑ Exercise

The currency is the context window (a finite resource). The capability axis spends it directly: every exposed tool, abstraction, or verbose prompt occupies the window and enlarges the model’s selection problem, whether or not it is used. The coordination axis spends it by multiplication: each additional agent is another window to populate, run, and pay for. Because both ultimately draw on the same finite budget, they share a default — add only when the workflow demonstrably needs it (subtract on the capability axis; isolate-only-when-worth-it on the coordination axis) — rather than treating either tools or agents as free additions.

Solution ↑ Exercise

A worked example. Take a code-review agent with twelve tools. Capability-axis change: remove the three overlapping search tools (grep, find_symbol, semantic_search) in favour of one search tool. Cost in the window: each removed tool was ~150 definition tokens at rest and a recurring wrong-tool-selection risk; consolidating reclaims both. What must be true to earn it: the one survivor has to actually cover the workflow’s searches — so the bar is “does the merged tool lose any search the agent genuinely needs?” If no, the change is pure win (less budget, less selection error). Coordination-axis change: split the review into a fan-out of per-file reviewer sub-agents that each return only their findings. Cost in the window: every sub-agent is a fresh window to populate and pay for — a real multiple of the single-agent token spend. What must be true to earn it: the files must be reviewable independently (no cross-file context needed) and the parallel speed-up or context-isolation gain must clear that multiple; if the review needs whole-repo context in one head, the split costs more than it returns. The point is that both changes are priced in the same currency — window tokens — so the decision rule is identical: does the spend buy more than it costs, for this workflow?

Part 2 Chapter 13 Last verified 2026-06-13 Fresh

Build vs. Buy: Choosing a Harness

The first move on the capability axis — start direct on the API and add a harness abstraction only when it earns its keep. Why a framework's convenience is bought with abstraction that obscures prompts and is harder to debug, why a custom harness is a standing maintenance liability as models improve, what the framework landscape offers per each vendor's own docs, and why the realistic answer is the configure-wrap-extend middle path rather than the build-or-buy binary.

Volatility: architectural-pattern
Tools compared: claude-codecross-tool
Before you start: The spine chapter's first axis — capability is a context cost, so the default is to subtract. Vol 1's working definition of an agent as Model + Harness helps but is not required.
You will learn
  • Why the starting default is direct on the API, not a framework — and not a build-from-scratch harness either
  • The core tradeoff a framework forces — abstraction cost vs. convenience — and why the abstraction is what you pay
  • Why building your own harness is not free either: it is a standing maintenance liability as models improve
  • The framework landscape read honestly — what LangGraph, CrewAI, the Claude Agent SDK, and the OpenAI Agents SDK each say they provide, not a cross-vendor ranking
  • The realistic answer — the configure / wrap / extend middle path, and the sequenced rule that follows from it

This is the first move on the volume’s capability axis. The spine framed every capability — a tool, an abstraction, a framework — as a context cost paid before it is used; the build-vs-buy decision is the first place that bites. The thesis is a default: start direct on the API and let the harness earn itself. The middle path between “build it all” and “buy a framework” is configure, wrap, extend — and the rule that organizes the whole decision is sequenced, not one-shot: simplest thing that works first, abstraction added only when a concrete need demands it.

Start direct, start simple

Begin with the recommendation and let it organize everything else. Anthropic’s agent-building guidance states the default flatly: developers should “start by using LLM APIs directly,” because “many patterns can be implemented in a few lines of code.” [Official] Building effective agents · Erik Schluntz and Barry Zhang (2024)T1-official original This is not a beginner’s on-ramp you graduate out of — it is the recommended starting point for production systems, and the evidence behind it is an aggregated cross-team observation: across the many teams Anthropic worked with, “the most successful implementations weren’t using complex frameworks or specialized libraries.” [Official] Building effective agents · Erik Schluntz and Barry Zhang (2024)T1-official original

So the decision does not open with “which framework?” It opens with “do I need one at all?” — and the default answer is no.

Key idea

The starting point is neither a big custom harness nor a heavy framework. It is a thin, direct implementation on the API, with abstraction added later and only when a concrete need justifies it. “Can I adopt a framework?” is the wrong opening question; “what does this workflow actually need that direct API calls don’t give me?” is the right one.

This is the same instinct the spine drew as the capability axis, applied to the harness layer instead of the tool set. A framework is a capability you expose to yourself — and like every capability, it is paid for before it is used, in the abstraction it inserts between you and the model. The default is therefore subtraction here too: the smallest harness that covers the workflow beats a feature-complete one.

The tradeoff: abstraction cost vs. convenience

A framework’s appeal is real — it gets you moving faster. The question is what that convenience costs. The cost is abstraction, and the specific damage is to visibility: frameworks can insert layers that obscure the underlying prompts and responses, “making them harder to debug.” [Official] Building effective agents · Erik Schluntz and Barry Zhang (2024)T1-official original When the agent misbehaves, you are now debugging through the framework’s abstractions instead of reading the prompt and response directly.

That reframes the decision. It is not “which framework is best” — it is “is the convenience worth the lost visibility on this workflow?”

Concept · Abstraction cost

The price a framework charges for its convenience: layers between you and the model that obscure the prompts and responses you would otherwise read directly, making failures harder to diagnose. A framework can be designed to minimize this cost — but it is never zero, and it is the thing you weigh against the convenience.

The cost is not fixed across frameworks, which is exactly why the decision is a spectrum rather than a switch. A framework can be designed to keep the prompts visible: LangGraph, for instance, describes itself as low-level and states that it “does not abstract prompts or architecture.” [Official] LangGraph overview · LangChainT2-release-notes original That is the framework-vendor’s own answer to the debuggability concern — and it reframes the axis from “framework vs. no framework” to “how much abstraction, and is it visible.” Read it as LangGraph’s stated design stance, not as a measured property or a ranking against other frameworks.

[Tip]

A useful litmus test for any framework you are evaluating: when the agent does the wrong thing, how many layers must you read through to see the actual prompt and the actual response? If the answer is “more than one,” you are paying abstraction cost.

The other side: building is a standing liability

It is tempting to read “don’t reach for a framework” as “build your own harness.” But the build side has its own recurring cost, and it is easy to underprice. A harness is not a one-time artifact: “harnesses encode assumptions that go stale as models improve.” [Official] Scaling Managed Agents: Decoupling the brain from the hands · Anthropic (Lance Martin, Gabe Cemaj, Michael Cohen) (2026)T1-official original The behaviors you tune around — how the model handles a long context, when it over-eagerly calls a tool, how it formats output — shift with each model release, and the assumptions your harness baked in decay with them.

Key idea

A custom harness is a standing maintenance liability, not a finished deliverable. Its assumptions about model behavior go stale as models improve, so “build” must be priced as continuous re-tuning against every model release — not as a fixed up-front cost. This is precisely what tilts many teams toward a configurable harness they do not have to re-maintain themselves.

So both ends of the binary carry a cost. A framework charges abstraction cost (lost visibility); a from-scratch build charges ongoing maintenance (staleness as models move). Neither is free, which is why the realistic answer is rarely either extreme.

The framework landscape — read honestly

If you do reach for an existing harness, four options dominate the conversation. The honest way to read this landscape is one framework at a time, against its own documentation — because each framework’s description of itself is authoritative on what it provides and not on how it ranks against the others. There is no independent cross-vendor benchmark here; each line below is a vendor self-description.

  • LangGraph describes itself as “a low-level orchestration framework and runtime for building, managing, and deploying long-running, stateful agents.” [Official] LangGraph overview · LangChainT2-release-notes original Its documented capability set centers on durable execution, streaming, human-in-the-loop, and persistence.
  • CrewAI describes itself as “the leading open-source framework for orchestrating autonomous AI agents and building complex workflows,” organized around role-based crews. [Official] Introduction (What is CrewAI?) · CrewAI (documentation)T2-release-notes original The word “leading” is CrewAI’s own framing, not an independent ranking.
  • The Claude Agent SDK “gives you the same tools, agent loop, and context management that power Claude Code, programmable in Python and TypeScript.” [Official] Agent SDK overview · AnthropicT1-official original Its documented capabilities include built-in tools, hooks, subagents, MCP, permissions, and sessions.
  • The OpenAI Agents SDK describes itself as “a lightweight, easy-to-use package with very few abstractions,” built around a small primitive set. [Official] OpenAI Agents SDK · OpenAI (Agents SDK documentation)T2-release-notes original “Very few abstractions” is the vendor’s own positioning.
Laundering a self-description into a ranking

Each official tag above is official relative to that one vendor — authoritative on what the framework provides, nothing more. The dangerous move is to collapse the four self-descriptions into a cross-vendor verdict: reading CrewAI’s “leading” as “CrewAI is the best,” or OpenAI’s “very few abstractions” as “fewer than LangGraph’s.” Those are not measured comparisons — they are four vendors each describing themselves. The adoption question is “which capability set matches my workflow,” answered from the docs, not “which framework wins,” answered from adjectives.

Because these are framework docs, they move fast — each ships per release, and the self-descriptions drift. Treat the four lines above as a snapshot, not a constant.

[Caveat]

The framework-landscape lines carry a 90-day recheck (the underlying docs were captured 2026-05-27; re-verify the verbatim positioning by ~2026-08-25). Expect renamed products, new entrants, and shifted positioning.

The middle path: configure, wrap, extend

Put the two costs together and the binary dissolves. You rarely face a clean choice between writing everything yourself and surrendering to a framework. The realistic answer is in the middle: start thin on the direct API, and where you do adopt, adopt something configurable that you assemble from primitives.

The Claude Agent SDK is the canonical instance of this stance. It is “the agent harness that powers Claude Code,” [Official] Building agents with the Claude Agent SDK · Thariq Shihipar et al. (2025)T1-official original so adopting it means taking a production-proven harness rather than writing one — and at its core “the SDK gives you the primitives to build agents” [Official] Building agents with the Claude Agent SDK · Thariq Shihipar et al. (2025)T1-official original for whatever workflow you are automating, composed and extended through subagents and hooks. That is the configure/wrap/extend position made concrete: a configurable harness you take, shape, and extend, rather than a from-scratch build you own end-to-end or an opaque framework you cannot see into.

Why the middle path dominates the binary

Configure/wrap/extend gets you most of what the “build” end offers — control over orchestration, tools, and workflow — without the standing maintenance liability of a from-scratch harness, and most of what the “buy” end offers — speed, production-proven plumbing — without surrendering visibility into the prompts. You take a harness someone else maintains against model changes, and you extend it from primitives where your workflow is specific. It is not a compromise between two good options; it is usually the better option than either pure extreme.

This is where the abstraction-cost axis pays off: the SDK route lets you adopt without going opaque, because you are assembling from primitives rather than inheriting a black box. The custom-build route’s maintenance liability is also softened, because the harness’s general plumbing is maintained by its vendor against new models — you only own the thin extension layer your workflow actually needs.

The build-vs-buy spectrum. From left: start direct on the API (the recommended starting point), then configure / wrap / extend a configurable harness (the middle path), then build from scratch (the right end). Control rises left to right — and so does maintenance cost. The default is to start at the left and move right only as far as a concrete need pulls you.A horizontal spectrum of five boxes left to right: 'start direct on the API' (marked 'start here'), 'configure a harness', 'wrap a harness', 'extend from primitives' (the middle three bracketed as 'the middle path: configure / wrap / extend'), and 'build from scratch'. Below run two parallel arrows pointing right: one labeled 'less control' to 'more control', the other 'low maintenance cost' to 'high maintenance cost'.
The build-vs-buy spectrum. From left: start direct on the API (the recommended starting point), then configure / wrap / extend a configurable harness (the middle path), then build from scratch (the right end). Control rises left to right — and so does maintenance cost. The default is to start at the left and move right only as far as a concrete need pulls you.

The sequenced rule

Compose the criteria with the default and the middle path and the whole decision reduces to a sequence — not a one-shot choice, but an order you move through only as far as a real need pulls you.

  1. Start without a framework. Direct API, thin harness, because many patterns are a few lines of code. [Official] Building effective agents · Erik Schluntz and Barry Zhang (2024)T1-official original
  2. Adopt a configurable harness when a concrete need earns the abstraction — preferring one that keeps prompts visible [Official] LangGraph overview · LangChainT2-release-notes original and that you do not have to re-tune against every model release. [Official] Scaling Managed Agents: Decoupling the brain from the hands · Anthropic (Lance Martin, Gabe Cemaj, Michael Cohen) (2026)T1-official original The Claude Agent SDK is the canonical configure/wrap/extend option. [Official] Building agents with the Claude Agent SDK · Thariq Shihipar et al. (2025)T1-official original
  3. Build a fully custom harness only when no configurable option fits — and price it as the continuous maintenance it is, because the staleness cost is now yours to carry. [Official] Scaling Managed Agents: Decoupling the brain from the hands · Anthropic (Lance Martin, Gabe Cemaj, Michael Cohen) (2026)T1-official original
Key idea

The decision is sequenced, not one-shot: simplest thing that works first, abstraction earned incrementally. You do not pick a point on the spectrum at the start; you start at the left end and move right exactly as far as a demonstrated need carries you — and no further.

A note on scope: this chapter is the decision — when to build, buy, or take the middle path. What a harness actually is (the agent loop, context management, the tool interface as components) is the Model + Harness framing the foundation volume develops; the orchestration of more than one agent — sub-agents and multi-agent topologies — is the later, coordination half of this volume. Here the decision stops at the single-harness choice.

Patterns

Start direct on the API. Sketch: implement the workflow with direct LLM API calls before reaching for any framework. When to use: always, as the opening move — most patterns are a few lines of code. Building effective agents · Erik Schluntz and Barry Zhang (2024)T1-official original Mechanics: write the loop yourself; add a tool call, a retry, a parse — measure whether a real need for abstraction appears. Remember: the default answer to “do I need a framework?” is no; the burden of proof is on the abstraction.

Weigh abstraction cost against convenience. Sketch: before adopting a framework, price the visibility you lose. When to use: any time a framework’s speed is tempting. Building effective agents · Erik Schluntz and Barry Zhang (2024)T1-official original Mechanics: ask how many layers stand between you and the actual prompt/response when debugging; prefer a framework that keeps prompts visible. LangGraph overview · LangChainT2-release-notes original Remember: convenience is bought with abstraction, and abstraction is paid in debuggability.

Price the build as maintenance, not a one-time cost. Sketch: treat a custom harness as a standing liability. When to use: whenever “just build it ourselves” is on the table. Scaling Managed Agents: Decoupling the brain from the hands · Anthropic (Lance Martin, Gabe Cemaj, Michael Cohen) (2026)T1-official original Mechanics: estimate re-tuning cost per model release, not just the up-front build; that recurring cost is the real comparison. Remember: harnesses encode assumptions that go stale as models improve.

Take the configure/wrap/extend middle path. Sketch: adopt a configurable, production-proven harness and extend it from primitives. When to use: when a concrete need has earned abstraction but a from-scratch build is overkill. Building agents with the Claude Agent SDK · Thariq Shihipar et al. (2025)T1-official original Mechanics: take a harness like the Claude Agent SDK; compose subagents and hooks; own only the thin extension your workflow needs. Agent SDK overview · AnthropicT1-official original Remember: you get control without the full maintenance burden, and adoption without going opaque.

Read the landscape per vendor, not as a ranking. Sketch: evaluate each framework against its own docs and your workflow. When to use: the moment you decide an existing harness is warranted. Introduction (What is CrewAI?) · CrewAI (documentation)T2-release-notes original Mechanics: match documented capability sets to your workflow’s needs; treat “leading” and “few abstractions” as self-descriptions. OpenAI Agents SDK · OpenAI (Agents SDK documentation)T2-release-notes original Remember: there is no independent cross-vendor benchmark in the vendors’ own marketing copy.

Quick reference

  • Default: start direct on the API — most patterns are a few lines of code; the most successful implementations weren’t using complex frameworks. Building effective agents · Erik Schluntz and Barry Zhang (2024)T1-official original
  • The framework cost: convenience is bought with abstraction that obscures prompts and responses, making them harder to debug. Building effective agents · Erik Schluntz and Barry Zhang (2024)T1-official original
  • The build cost: a custom harness is a standing maintenance liability — its assumptions go stale as models improve. Scaling Managed Agents: Decoupling the brain from the hands · Anthropic (Lance Martin, Gabe Cemaj, Michael Cohen) (2026)T1-official original
  • Landscape (per each vendor’s own docs, not a ranking): LangGraph = low-level orchestration framework + runtime; LangGraph overview · LangChainT2-release-notes original CrewAI = “leading” open-source role-based orchestration (self-described); Introduction (What is CrewAI?) · CrewAI (documentation)T2-release-notes original Claude Agent SDK = the tools/agent loop/context management that power Claude Code; Agent SDK overview · AnthropicT1-official original OpenAI Agents SDK = “very few abstractions” (self-described). OpenAI Agents SDK · OpenAI (Agents SDK documentation)T2-release-notes original
  • Middle path: configure / wrap / extend a configurable harness — e.g. the Claude Agent SDK, built from primitives. Building agents with the Claude Agent SDK · Thariq Shihipar et al. (2025)T1-official original
  • The rule: sequenced — start without a framework → adopt a configurable one when a need earns it → build from scratch only when nothing fits.
  • Honesty: vendor self-descriptions are not a cross-vendor verdict; the framework lines carry a 90-day recheck.

Practice

Exercise

The chapter says both ends of the build-vs-buy binary carry a cost. Name the cost on each end — the cost of buying a framework, and the cost of building a custom harness from scratch — and say which one is paid continuously rather than up front. Why does having a cost on both ends make the middle path the usual answer?

Practice ◆◆◆◇

You are evaluating two frameworks. One’s docs say it is “the leading framework for orchestrating agents”; the other’s say it offers “very few abstractions.” A teammate concludes the second is more debuggable than the first and the first is more capable. Write two or three sentences explaining what is wrong with both conclusions, and state how you would legitimately decide between the two. The point is to practice the evidence discipline — reading vendor self-descriptions as self-descriptions, not as a cross-vendor ranking.

Practice ◆◆◆◇

A team has built a custom agent harness over six months and it works well on the current model. Argue, in three or four sentences, why “it works well today” is not sufficient grounds to keep maintaining a from-scratch build — and what specifically they should re-examine at the next model release. Then say what the configure/wrap/extend middle path would have changed about their position. Ground your answer in the chapter’s maintenance-cost claim, not in a general preference for frameworks.

Exercise solutions

Solution ↑ Exercise

The buy end’s cost is abstraction cost: a framework’s convenience is bought with layers that obscure the underlying prompts and responses, making failures harder to debug. The build end’s cost is ongoing maintenance: a custom harness encodes assumptions about model behavior that go stale as models improve, so it must be continuously re-tuned. The maintenance cost is the one paid continuously rather than up front — it recurs with every model release, whereas the abstraction cost is a fixed property of the framework you adopted. Because neither extreme is free — one charges lost visibility, the other charges perpetual re-tuning — the realistic answer is usually the middle: take a configurable, production-proven harness (so its general plumbing is maintained against new models for you) and extend it from primitives (so you keep visibility and own only the thin layer your workflow actually needs).

Solution ↑ Exercise

Both conclusions launder a vendor’s self-description into a cross-vendor comparison the docs never make. “Leading” is CrewAI-style marketing framing, not an independent ranking, so it licenses no claim that one framework is “more capable” than the other; “very few abstractions” is the OpenAI-style vendor’s own positioning, authoritative on what it says it provides but not a measured statement that it is more debuggable than some other framework. The legitimate way to decide is to read each framework against its own documented capability set and match those capabilities to my specific workflow — does this one’s durable-execution / role-based / primitive model fit what I actually need to build? — rather than to rank the two on adjectives. If I genuinely need a debuggability comparison, I have to source it independently (or test it myself by debugging a real failure in each), not infer it from the phrase “few abstractions.”

Solution ↑ Exercise

“It works well today” is insufficient because a harness encodes assumptions that go stale as models improve — the behaviors the team tuned around (context handling, tool-call eagerness, output formatting) shift with each model release, so a harness that fits the current model can silently misfit the next one. The maintenance cost is therefore continuous and latent: it is real even on the day everything works, because it is a liability that comes due at the next release, not a problem you can see now. At that release they should re-examine exactly the assumptions baked into the harness against the new model’s behavior — re-running their evals, checking whether their workarounds are now unnecessary or now wrong — and budget that re-tuning as a recurring cost, not a surprise. The configure/wrap/extend middle path would have changed their position by moving most of that staleness-prone plumbing onto a vendor-maintained harness (e.g. the Claude Agent SDK), so the general agent loop and context management are re-tuned against new models for them, and the team owns only the thin extension layer their workflow specifically required — shrinking the surface that goes stale to the part that is genuinely theirs.

Part 2 Chapter 14 Last verified 2026-06-13 Fresh

Tool Minimization: Subtract First

The governing default of the volume's capability axis — the smallest tool set that covers the workflow beats a complete one. Why an extra tool is paid twice (definition tokens at rest, selection errors at runtime), the three independent production reports that converge on subtract-first, the two highest-leverage heuristics (consolidate, return high-signal), and the dynamic complement — load tools on demand when scale forces it.

Volatility: architectural-pattern
Tools compared: claude-codecross-tool
Before you start: The spine chapter's two axes (capability as context cost; coordination as isolation). Vol 1's context-as-budget and progressive-disclosure ideas help but are not required.
You will learn
  • Why more tools is not better — and why the cost of an extra tool is paid twice
  • The three independent production reports that converge on subtract-first, and exactly how far that convergence licenses you to generalize
  • The two highest-leverage heuristics once you have cut count: consolidate, and make each tool’s response high-signal
  • The dynamic complement — when you genuinely need many tools, load them on demand instead of all at once

This is the governing default of the volume’s capability axis. The spine framed every tool as a context cost; this chapter turns that into a discipline. The move is subtraction: start from the smallest tool set that covers the workflow and justify every addition, because each tool you add is paid twice — in definition tokens at rest and in selection errors at runtime. The principle is Anthropic’s; the unusually strong corroboration is three production teams who cut their tool sets and reported back.

Subtract-first: the smallest set that covers the workflow

Start from the counter-intuitive claim and let it organize the rest. Anthropic’s tool-design guidance states it flatly: “More tools don’t always lead to better outcomes.” [Official] Writing effective tools for agents — with agents · Aizawa (Anthropic) (2025)T1-official original The failure mode is specific — “too many tools or overlapping tools can also distract agents from pursuing efficient strategies.” [Official] Writing effective tools for agents — with agents · Aizawa (Anthropic) (2025)T1-official original The corrective is to build a few thoughtful tools that cover the workflow rather than wrap every API endpoint you happen to have. [Official] Writing effective tools for agents — with agents · Aizawa (Anthropic) (2025)T1-official original

The reason this is a discipline and not a preference is that an extra tool charges you twice.

Key idea

Every tool is paid twice: once in definition tokens at rest (its schema sits in the window on every turn, spent whether or not it is called) and once in selection errors at runtime (overlapping or near-duplicate tools make the model pick the wrong one or take a longer path). Adding a tool is never free; the default is the minimum set, and each addition has to earn its place.

This is the same logic the volume’s spine drew as the capability axis, now stated as an action. A “complete” tool set — one tool per endpoint, every capability the platform offers — optimizes for coverage of possibilities. A minimal set optimizes for coverage of the workflow, which is the only thing the agent actually has to do. The two come apart fast: most of a complete set’s tools never fire on a given task, but every one of them is in the window distracting selection.

Concept · Subtract-first

The default tool-set discipline: begin with the smallest set that covers the workflow, and treat each additional tool as a cost to be justified — not a feature to be added. Its complement is consolidation (fold overlapping tools into fewer capable ones) and, when scale forbids subtraction, on-demand loading.

The case studies converge — and how far that licenses you

Subtract-first would be a plausible-but-unproven heuristic if it rested only on the design guidance. It does not. Three first-party engineering reports from 2025, on three different production agents, independently cut their tool sets and reported the same direction.

Convergence claude-codecross-tool

Vercel stripped its internal text-to-SQL agent down to “a single tool: execute arbitrary bash commands” and reported a “100% success rate instead of 80%.” [Practitioner] We removed 80% of our agent's tools · Andrew Qu (Vercel) (2025)T3-practitioner original GitHub trimmed Copilot’s default “40 built-in tools down to 13 core ones” and measured a 2–5 percentage-point success-rate gain across SWE-Lancer and SWE-bench Verified (plus a ~400 ms latency drop in A/B testing). [Practitioner] How we're making GitHub Copilot smarter with fewer tools · Anisha Agarwal and Connor Peet (GitHub) (2025)T3-practitioner original Block consolidated “over 30+ APIs and 200+ endpoints, with only 3 MCP tools” behind a layered tool pattern in its Square MCP Server. [Practitioner] Build MCP Tools Like Ogres... With Layers · Richard Moot (Block) (2025)T3-practitioner original Three independent teams, the same finding: fewer, sharper tools made the agent better. [Convergence]

That is genuine convergence of practice — and it is worth being precise about what kind. These are vendor self-reports, each team measuring its own agent on its own tasks, not independent third-party benchmarks. The figures are real (each is quoted from the primary write-up), but the convergence is in direction, not in a transferable effect size.

[Caveat]

Block’s “30+ APIs → 3 tools” is a consolidation count, not a measured quality gain — the report claims improved reliability and maintainability qualitatively, not a success-rate delta like Vercel’s and GitHub’s. Read it as evidence of the consolidation move, not of a number.

The honest reading is the strongest one here: three independent practitioners pointing the same way is better evidence than any single benchmark, and none of these numbers is a law you can quote for your own agent. Subtract-first is well-corroborated as a direction; the size of the win is yours to measure.

Laundering an unanchored aggregate

There are widely-circulated aggregate figures — a “~85% token reduction,” a “79.5% → 88.1%” eval jump — attached to tool-minimization advice. This book deliberately does not cite them: in the underlying research they were escalated to a secondary mention but never anchored to a verifiable primary. Repeating an unanchored number because it points the right way is exactly the failure subtract-first warns about applied to evidence — coverage of impressive claims over coverage of what you can stand behind. If you want such a figure, re-gather its primary; do not inherit it from a chain of citations.

Consolidate, and make the response high-signal

Once count is under control, two heuristics carry most of the remaining leverage — and both are really the same context-management instinct applied to tools.

Consolidate overlapping functionality into fewer capable tools, and namespace them so the model can tell them apart. [Official] Define tools · AnthropicT1-official original Two tools that do almost the same thing do not add a capability; they add a coin-flip the model has to win on every relevant turn. Folding them into one tool with a clear name removes the ambiguity at its source.

Return high-signal information, not raw dumps. A tool’s response is as much a context-management decision as its existence: returning a 5,000-token raw payload when the agent needs three fields spends the window you just saved by cutting tool count. [Official] Writing effective tools for agents — with agents · Aizawa (Anthropic) (2025)T1-official original

The output is a tool-design decision too

It is easy to treat “fewer tools” as the whole game and then squander the win by having each surviving tool flood the context with raw output. A minimal tool set whose tools each return high-signal, scoped responses is the actual target. Subtract on both axes: how many tools, and how much each one says back.

When you can’t subtract, defer

Some agents genuinely need a large capability surface — a platform with hundreds of legitimate operations. Subtract-first has a dynamic complement for exactly this case: keep the active set small by loading tools on demand rather than presenting all of them upfront. The Tool Search Tool does this — tools are discovered and loaded when relevant, with only the few most-used kept non-deferred in the window. [Official] Tool search tool · AnthropicT1-official original A working default is to keep roughly the 3–5 most-used tools resident and defer the rest. [Official] Tool search tool · AnthropicT1-official original

Key idea

Subtract-first is the static move (have fewer tools); on-demand loading is the dynamic one (have fewer tools resident at once). They are the same principle — keep the active context small — applied to a fixed versus a scaling capability surface. This is progressive disclosure, the mechanic Vol 1 applied to procedural knowledge, applied here to tools.

The order matters: subtract first, then defer what genuinely cannot be cut. On-demand loading is not a license to keep a bloated set and hide it behind search — a hundred half-redundant tools still produce wrong-tool selection once they are loaded. Cut to the workflow, consolidate the overlaps, and only then defer the irreducible remainder.

Measure, don’t guess

Subtract-first becomes empirical, not aesthetic, when you close the loop: build a small agentic eval over realistic tasks, read the tool-calling metrics for redundant or never-used tools, and prune by evidence. That evaluate-then-prune loop is what turns “fewer tools” from a slogan into a measured discipline — and it belongs to the volume’s evaluation material, developed in the Operations volume rather than here. This chapter establishes the principle and its heuristics; the measurement loop that decides which tools to cut is the eval discipline’s job.

Three independent production reports, one direction. Vercel (many tools → 1), GitHub Copilot (40 → 13), and Block (30+ APIs → 3 MCP tools) each cut their tool set; Vercel and GitHub measured success-rate gains, Block reports a consolidation count. The convergence is in direction, not in a transferable number.Three rows, one per team. Vercel: a cluster of many tool boxes on the left, an arrow to a single 'bash' box on the right, annotated '80% to 100% success'. GitHub Copilot: a block of 40 tools, an arrow to a block of 13, annotated '+2 to 5 points'. Block: 30-plus API boxes, an arrow to 3 MCP-tool boxes, annotated 'consolidation count, no measured delta'. A shared downward arrow labels all three 'subtract-first'.
Three independent production reports, one direction. Vercel (many tools → 1), GitHub Copilot (40 → 13), and Block (30+ APIs → 3 MCP tools) each cut their tool set; Vercel and GitHub measured success-rate gains, Block reports a consolidation count. The convergence is in direction, not in a transferable number.
Pruning a bloated integrations agent Worked example

An agent wraps a SaaS platform with 22 tools — one per API endpoint — and keeps calling list_records when it should call search_records, then truncating the result by hand. Where is the fix?

Walk subtract-first:

  • Count. Most of the 22 tools never fire on the real workflow (create-ticket, update-status, search). Cut to the handful the workflow uses — that alone removes most wrong-tool surface.
  • Overlap. list_records and search_records overlap; the model flips between them. Consolidate into one find_records(query) and the coin-flip disappears. Define tools · AnthropicT1-official original
  • Response. find_records should return scoped fields, not the raw record dump the agent was hand-truncating. Writing effective tools for agents — with agents · Aizawa (Anthropic) (2025)T1-official original
  • Only then defer. If the platform genuinely needs all 22 for some workflows, keep the 3–5 most-used resident and load the rest on demand. Tool search tool · AnthropicT1-official original

The wrong move is to add a 23rd tool — a “smart record fetcher” — to paper over the selection error. That is addition where subtraction is the fix.

Patterns

Subtract to the workflow. Sketch: start from the smallest tool set that covers the actual workflow, justify every addition. When to use: always — it is the default. Writing effective tools for agents — with agents · Aizawa (Anthropic) (2025)T1-official original Mechanics: list the workflow’s real operations; expose those, not every endpoint. Remember: a tool is paid twice — definition tokens at rest, selection errors at runtime.

Consolidate overlaps. Sketch: fold near-duplicate tools into one capable, namespaced tool. When to use: whenever two tools could plausibly answer the same call. Define tools · AnthropicT1-official original Mechanics: merge list_*/search_*-style pairs; give the survivor a clear name. Remember: overlap is a runtime coin-flip, not an added capability.

High-signal responses. Sketch: return scoped, relevant fields — not raw dumps. When to use: any tool whose natural output is large. Writing effective tools for agents — with agents · Aizawa (Anthropic) (2025)T1-official original Mechanics: project to the fields the agent needs; paginate or summarize the rest. Remember: the response is a context-management decision as much as the tool’s existence.

Defer the irreducible remainder. Sketch: when many tools are genuinely required, load on demand and keep ~3–5 resident. When to use: large, legitimate capability surfaces you cannot cut. Tool search tool · AnthropicT1-official original Mechanics: Tool Search Tool; mark the most-used non-deferred. Remember: defer after subtracting — search does not fix a bloated set, it only hides it.

Quick reference

  • Default: the smallest tool set that covers the workflow beats a “complete” one. Writing effective tools for agents — with agents · Aizawa (Anthropic) (2025)T1-official original
  • Why: every tool is paid twice — definition tokens at rest + selection errors at runtime.
  • Evidence: Vercel (→1 tool, 80→100%), GitHub Copilot (40→13, +2–5 pts), Block (30+ APIs → 3 tools, consolidation count) — three independent vendor self-reports converging on subtract-first; direction, not a transferable number.
  • Heuristics: consolidate overlapping tools (+ namespace); return high-signal responses, not raw dumps.
  • Scale escape hatch: load tools on demand (keep ~3–5 resident) — but only after subtracting.
  • Honesty: unanchored aggregate token/eval figures are deliberately omitted; measure your own win.

Practice

Exercise

An extra tool is “paid twice.” Name the two costs, and say which one is incurred even on turns where the tool is never called. Why does that make a complete tool set strictly worse than a minimal one for a fixed workflow?

Practice ◆◆◆◇

The three case studies all point to subtract-first, yet the chapter insists you cannot quote their numbers as a law for your own agent. Write two or three sentences that state, honestly, what the convergence does license you to conclude and what it does not — the distinction between convergence-of-direction and a transferable effect size. The point is to practice the evidence discipline the chapter models, not just its tooling advice.

Exercise solutions

Solution ↑ Exercise

The two costs are definition tokens at rest (the tool’s schema occupies the context window on every turn, spent regardless of use) and selection errors at runtime (overlapping or near-duplicate tools raise the chance the model picks the wrong one or takes a longer path). The first cost — definition tokens — is incurred even on turns where the tool is never called, because the schema is in the window either way. For a fixed workflow, a complete set therefore pays for capabilities the workflow never exercises while also enlarging the selection surface, so it is strictly worse than a minimal set that covers the same workflow: more cost, more error opportunity, no added coverage of what the agent must actually do.

Solution ↑ Exercise

The convergence licenses a directional conclusion: three independent teams, on three different production agents, each cut their tool set and reported the same outcome — fewer, sharper tools made the agent better (or, for Block, materially simpler to maintain). That is stronger evidence for the direction of subtract-first than any single benchmark, precisely because the reports are independent. It does not license quoting “80→100%” or “+2–5 points” as an effect size you will see: each is a vendor self-report on its own agent and tasks, Block’s is a consolidation count with no measured quality delta, and none is an independent benchmark. The honest stance is “subtract-first is well-corroborated as a direction; the size of the win on my agent is something I have to measure” — which is exactly why the evaluate-then-prune loop exists.

Part 2 Chapter 15 Last verified 2026-06-13 Fresh

MCP: Designing External Capability

How to wire external capability against a least-privilege, capability-negotiated protocol — and design against a known moving target. The host/client/server split and its design-time isolation, the three primitives as three control modes, the OAuth-2.1 authorization posture, and how to build to MCP's stable core while isolating what the announced 2026-07-28 release candidate changes.

Volatility: feature-surface
Tools compared: claude-codecross-tool
Before you start: The volume's capability axis (capability is a context cost; the default is to subtract) and the tool-minimization discipline. Vol 1's environment-as-boundary and least-privilege ideas help but are not required.
You will learn
  • The host / client / server split and why its isolation is a design-time obligation, not a runtime guarantee the protocol enforces
  • The three primitives as three control modes — tools are model-controlled, resources are app-driven, prompts are user-controlled — so choosing one is a decision about who is in control
  • The OAuth-2.1 authorization posture MCP standardizes on (resource indicators, no token passthrough, mandatory PKCE) — the design-time posture, not the threat model
  • How to design against a moving target — build to MCP’s stable core, isolate what the announced 2026-07-28 release candidate changes

The previous chapter set the discipline for the tools you write yourself; this one is about the tools you reach for across a wire. MCP — the Model Context Protocol — is how an agent connects to external data and tools through a standard interface instead of a bespoke integration each time. The thesis: wire external capability against a least-privilege, capability-negotiated protocol, and treat its security properties as obligations you design to — because the spec is explicit that it “cannot enforce these security principles at the protocol level.” And because MCP is mid-transition, design against a known moving target: build to the stable core, isolate what the release candidate changes.

What MCP is, and where its guarantees stop

MCP is an open protocol whose stated job is the integration problem this volume keeps circling: it “is an open protocol that enables seamless integration between LLM applications and external data sources and tools.” [Official] Specification — Model Context Protocol · Model Context Protocol contributors (2025)T1-official original Structurally it is a JSON-RPC client–host–server model: MCP “follows a client-host-server architecture where each host can run multiple client instances,” [Official] Architecture — Model Context Protocol Specification 2025-11-25 · Model Context Protocol contributors (2025)T1-official original with each client bound one-to-one to a server. The host runs the model and holds the conversation; each client speaks to exactly one server.

The protocol is capability-negotiated: clients and servers declare which features they support at initialization, and “Both parties must respect declared capabilities throughout the session.” [Official] Architecture — Model Context Protocol Specification 2025-11-25 · Model Context Protocol contributors (2025)T1-official original Nothing is assumed to be present; everything in play is negotiated up front and honored for the session’s duration. And the architecture carries a least-privilege isolation principle — servers “should not be able to read the whole conversation, nor ‘see into’ other servers.” [Official] Architecture — Model Context Protocol Specification 2025-11-25 · Model Context Protocol contributors (2025)T1-official original The full conversation stays with the host; a server sees only what is routed to it.

Here is the load-bearing honesty, and it shapes the whole chapter. The spec states plainly that “While MCP itself cannot enforce these security principles at the protocol level,” [Official] Specification — Model Context Protocol · Model Context Protocol contributors (2025)T1-official original the responsibility shifts to implementers to build the consent and authorization flows. Capability negotiation, server isolation, and the auth posture in this chapter are design-time obligations, not runtime guarantees.

Key idea

MCP’s security properties — negotiated capabilities, cross-server isolation, least-privilege auth — are things you design to, not things the protocol enforces for you. The spec says so directly: it “cannot enforce these security principles at the protocol level.” Read every “the protocol does X” in this chapter as “the protocol asks the implementer to do X.” An MCP integration is only as isolated as the host you build around it.

[Note]

The runtime threat model — prompt injection through tool results, malicious servers, token theft, the “lethal trifecta” of private data plus untrusted content plus exfiltration — is a distinct subject. This chapter is the design side (what the protocol asks you to build); the attack surface is an operations-and-security concern, developed in a later volume. Pointing at it is deliberate; developing it here is not.

Three primitives, three control modes

A server exposes capability through exactly three primitives, and the design payload is that each encodes a different answer to who is in control.

Tools are model-controlled. The model itself decides when to call them: the language model can “discover and invoke tools automatically based on its contextual understanding and the user’s prompts.” [Official] Tools — Model Context Protocol Specification (revision 2025-11-25) · Model Context Protocol contributors (2025)T1-official original This is the primitive that puts the agent in the driver’s seat — and so it is the one tool-minimization’s whole discipline applies to.

Resources are application-driven. They are URI-addressable read context whose inclusion the host decides, “with host applications determining how to incorporate context based on their needs.” [Official] Resources — Model Context Protocol Specification (revision 2025-11-25) · Model Context Protocol contributors (2025)T1-official original The model does not reach for a resource; the application places it. Resources can be parameterized — “Resource templates allow servers to expose parameterized resources using” [Official] Resources — Model Context Protocol Specification (revision 2025-11-25) · Model Context Protocol contributors (2025)T1-official original URI templates — so a single resource definition covers a family of addresses.

Prompts are user-controlled. They are templates surfaced “with the intention of the user being able to explicitly select them for use.” [Official] Prompts — Model Context Protocol Specification (revision 2025-11-25) · Model Context Protocol contributors (2025)T1-official original The user picks them deliberately (a slash command, say) — neither the model nor the application invokes them autonomously.

Concept · Primitive = control authority

The choice among tools, resources, and prompts is not primarily a data-shape choice; it is a control-authority choice. Tools hand initiative to the model, resources to the application, prompts to the user. If a capability should fire only when the user asks, a model-controlled tool is the wrong primitive — it hands the model an initiative you meant to keep. Pick the primitive by who you want holding the trigger.

The tool primitive also fixes two interface details worth carrying into your own server design. The error model is deliberately split: a tool-execution error is reported inside the result with isError true, and such errors “contain actionable feedback that language models can use to self-correct and retry with adjusted parameters,” [Official] Tools — Model Context Protocol Specification (revision 2025-11-25) · Model Context Protocol contributors (2025)T1-official original whereas a malformed or unknown-tool request is a JSON-RPC protocol error the model is far less likely to recover from. Return business-logic failures as isError results, not protocol errors, so the model can read the feedback and retry. And tool annotations (readOnly, destructive, and the like) are advisory only: clients must “consider tool annotations to be untrusted unless they come from trusted servers.” [Official] Tools — Model Context Protocol Specification (revision 2025-11-25) · Model Context Protocol contributors (2025)T1-official original An annotation is a hint, not a permission — which is the design-time-obligation theme again, applied to a single field.

The authorization posture: OAuth 2.1, by design

MCP standardizes its authorization on OAuth 2.1 — it “implements a selected subset of their features to ensure security and interoperability while maintaining simplicity:” [Official] Authorization — Model Context Protocol Specification (revision 2025-11-25) · Model Context Protocol contributors (2025)T1-official original drawing on the established OAuth specification family rather than inventing a new scheme. On that baseline the spec sets four verified design requirements a remote MCP server or host builds to. Three sit on the OAuth posture; the architectural isolation principle above is the fourth, and together they are what “least-privilege by design” means in MCP.

The three OAuth requirements:

  • Resource indicators (RFC 8707). Clients MUST “implement Resource Indicators for OAuth 2.0 as defined in” [Official] Authorization — Model Context Protocol Specification (revision 2025-11-25) · Model Context Protocol contributors (2025)T1-official original RFC 8707, binding a token to the canonical URI of the resource it is meant for. A token minted for server A cannot be replayed against server B, because it carries its intended audience.
  • No token passthrough. A server making upstream requests MUST NOT “pass through the token it received from the MCP client.” [Official] Authorization — Model Context Protocol Specification (revision 2025-11-25) · Model Context Protocol contributors (2025)T1-official original When it needs upstream access it acts as a separate OAuth client with its own token — the client’s token never travels further than the server it was issued for.
  • Mandatory PKCE. Clients MUST implement PKCE, which “helps prevent authorization code interception and injection attacks by requiring clients to create a secret verifier-challenge pair” [Official] Authorization — Model Context Protocol Specification (revision 2025-11-25) · Model Context Protocol contributors (2025)T1-official original (with the S256 method when capable), so only the original requestor can exchange an authorization code for a token.
A posture, not a threat model

Audience-bound tokens, no-passthrough, and PKCE compose a coherent least-privilege defense — but they are the posture you implement, not a catalogue of the attacks they blunt. The confused-deputy problem, token reuse, and code interception that motivate these rules belong to the operations-and-security material in a later volume. Conflating “here is what to build” with “here is the threat model” is the most common framing error around MCP. This chapter is the build side; design to these requirements, and study the attacks where the threat model lives.

There are four key principles in the spec’s overview — user consent and control, data privacy, tool safety, and sampling controls — and the same caveat governs all of them: the protocol asks implementers to honor them; it does not enforce them on the wire. [Official] Specification — Model Context Protocol · Model Context Protocol contributors (2025)T1-official original

MCP is mid-transition — and that is the design problem

The current authoritative revision is 2025-11-25, and it is genuinely current. But MCP is at a dated inflection point, and a responsible design accounts for it rather than pretending the spec is static.

Today the protocol is stateful. The lifecycle opens with an initialize handshake that MUST “be the first interaction between client and server,” [Official] Lifecycle — Model Context Protocol Specification 2025-11-25 · Model Context Protocol contributors (2025)T1-official original inside which client and server negotiate a protocol version. The transport layer defines two standards — stdio, which clients SHOULD “support stdio whenever possible,” [Official] Transports — Model Context Protocol Specification 2025-11-25 · Model Context Protocol contributors (2025)T1-official original and Streamable HTTP, where the server MAY “assign a session ID at initialization time” [Official] Transports — Model Context Protocol Specification 2025-11-25 · Model Context Protocol contributors (2025)T1-official original via an Mcp-Session-Id header the client then echoes on every request. That session ID is the protocol-level session.

The announced 2026-07-28 release candidate prunes much of this toward a stateless core — applying the volume’s subtract-first instinct to the protocol itself. The following are announced for 2026-07-28, not current; recheck each after that date:

  • The initialize/initialized handshake is removed (SEP-2575). [Official] The 2026-07-28 MCP Specification Release Candidate · David Soria Parra and Den Delimarsky (Lead Maintainers)T1-official original
  • The Mcp-Session-Id header and the protocol-level session are removed (SEP-2567). [Official] The 2026-07-28 MCP Specification Release Candidate · David Soria Parra and Den Delimarsky (Lead Maintainers)T1-official original
  • A formal feature-lifecycle policy deprecates Roots, Sampling, and Logging with “at least twelve months between deprecation and the earliest possible removal.” [Official] The 2026-07-28 MCP Specification Release Candidate · David Soria Parra and Den Delimarsky (Lead Maintainers)T1-official original
  • Tasks becomes a server-directed extension and tasks/list “is removed because it can’t be scoped safely without sessions.” [Official] The 2026-07-28 MCP Specification Release Candidate · David Soria Parra and Den Delimarsky (Lead Maintainers)T1-official original (Tasks exists today as an in-core experimental feature, added in the current revision “to enable tracking durable requests with polling and deferred result retrieval.” [Official] Key Changes (Changelog) — MCP Specification 2025-11-25 · Model Context Protocol maintainers (2025)T1-official original )
  • A new MCP Apps extension “lets servers ship interactive HTML interfaces that hosts render in a sandboxed iframe” (SEP-1865). [Official] The 2026-07-28 MCP Specification Release Candidate · David Soria Parra and Den Delimarsky (Lead Maintainers)T1-official original
  • An Extensions framework where “extensions are identified by reverse-DNS IDs, negotiated through an extensions map on client and server capabilities, live in their own ext-* repositories with delegated maintainers, and version independently of the specification” (SEP-2133). [Official] The 2026-07-28 MCP Specification Release Candidate · David Soria Parra and Den Delimarsky (Lead Maintainers)T1-official original
[Caveat]

Every release-candidate item above is announced, not shipped — the RC was locked in mid-2026 and the final publishes 2026-07-28. Treat these as the direction of travel, and re-verify the mechanisms (stateless core, the Tasks extension, MCP Apps) against the final spec after that date. This chapter’s feature-surface volatility tag exists for exactly this section.

Governance moved out of Anthropic — which is why dual-layer reading is the honest default

The reason “current plus announced-coming” is the right way to hold MCP, rather than a quirk, is that the protocol no longer has a single vendor steering it. In December 2025, “Anthropic is donating the Model Context Protocol to the Linux Foundation,” [Official] Donating the Model Context Protocol and establishing the Agentic AI Foundation · Anthropic (2025)T1-official original establishing the Agentic AI Foundation. The donation came with a continuity assurance: the governance model “will remain unchanged: the project’s maintainers will continue to prioritize community input and transparent decision-making.” [Official] Donating the Model Context Protocol and establishing the Agentic AI Foundation · Anthropic (2025)T1-official original

Development now runs through an open process. The current revision moved to “Formalize Working Groups and Interest Groups in MCP governance” (SEP-1302), [Official] Key Changes (Changelog) — MCP Specification 2025-11-25 · Model Context Protocol maintainers (2025)T1-official original and the 2026 roadmap confirms that “Working and Interest Groups are now the primary vehicle for protocol development.” [Official] The 2026 MCP Roadmap · David Soria Parra (Lead Maintainer) (2026)T1-official original The maintainers are candid about what that means for predictability: “A release-oriented roadmap implies a level of predictability that open-standards work rarely has.” [Official] The 2026 MCP Roadmap · David Soria Parra (Lead Maintainer) (2026)T1-official original

Key idea

Because MCP evolves through a public SEP process under open-standards governance, the changes are trackable, not surprising — but they are also not on a fixed schedule a vendor controls. The honest way to represent the spec to yourself and to a codebase is the dual layer: what is authoritative now and what is announced coming, with a recheck date. Designing against the spec means designing against that pair, not against a frozen snapshot.

Designing against a moving target

Put the two halves together — a stable conceptual core, a churning transport-and-feature surface — and the design rule writes itself: build to the stable core, isolate what the RC changes behind adapters.

The durable parts are the conceptual ones: the host/client/server architecture Architecture — Model Context Protocol Specification 2025-11-25 · Model Context Protocol contributors (2025)T1-official original and the three-primitive control model Tools — Model Context Protocol Specification (revision 2025-11-25) · Model Context Protocol contributors (2025)T1-official original are not what the RC touches. Design your integration’s shape against those and it survives the transition. The parts in motion are the mechanical ones — the transport and lifecycle (the handshake, the session ID) Transports — Model Context Protocol Specification 2025-11-25 · Model Context Protocol contributors (2025)T1-official original Lifecycle — Model Context Protocol Specification 2025-11-25 · Model Context Protocol contributors (2025)T1-official original heading toward a stateless core, and Tasks Key Changes (Changelog) — MCP Specification 2025-11-25 · Model Context Protocol maintainers (2025)T1-official original heading out of core into an extension. Wrap exactly those behind a thin adapter so the migration, when it lands, is contained to one seam instead of smeared across the codebase.

The current-plus-coming convention is the design advice

The discipline this chapter models — separate what is from what’s announced coming, and date the recheck — is not a presentational tic. It is the actual design move: a stateful assumption baked diffusely through your code is a migration you will pay for at the RC; the same assumption isolated behind an adapter is a one-file change. Designing against a known moving target means making the moving parts findable. Build to the core; quarantine the churn.

MCP's host/client/server split with capability negotiation and the three primitives. The host runs the model and holds the full conversation; each client is bound one-to-one to a server, declaring and respecting negotiated capabilities for the session. Servers do not see into one another. Each server exposes capability through three primitives that encode three control modes: tools (model-controlled), resources (application-driven), prompts (user-controlled). The isolation shown is a design-time obligation the spec asks implementers to honor, not a property the protocol enforces.On the left, a gray 'Host (runs the model)' box containing client A and client B, annotated 'full conversation stays here'. Two double-headed arrows labelled 'capability negotiation' link client A to Server A and client B to Server B (teal boxes) one-to-one, with a dashed line between the servers annotated 'servers do not see into each other'. On the right, three orange primitive boxes branch off a server: tools (model-controlled), resources (app-driven), prompts (user-controlled), annotated 'three primitives = three control modes'.
MCP's host/client/server split with capability negotiation and the three primitives. The host runs the model and holds the full conversation; each client is bound one-to-one to a server, declaring and respecting negotiated capabilities for the session. Servers do not see into one another. Each server exposes capability through three primitives that encode three control modes: tools (model-controlled), resources (application-driven), prompts (user-controlled). The isolation shown is a design-time obligation the spec asks implementers to honor, not a property the protocol enforces.
Wiring an internal service over MCP Worked example

A team wants to expose an internal incident-management service to its agent over MCP: list incidents, open one, post an update, and a “run the weekly incident report” workflow the on-call engineer triggers by hand. How should this be designed?

Walk the chapter:

  • Primitives by control authority. “Post an update” and “open an incident” are actions the agent should take in context — tools (model-controlled). Tools — Model Context Protocol Specification (revision 2025-11-25) · Model Context Protocol contributors (2025)T1-official original The incident records the agent should read are URI-addressable read context the application decides to surface — resources, parameterized with a template like incident://{id}. Resources — Model Context Protocol Specification (revision 2025-11-25) · Model Context Protocol contributors (2025)T1-official original The weekly report is something the engineer invokes deliberately — a prompt (user-controlled), not a tool the model can fire on its own. Prompts — Model Context Protocol Specification (revision 2025-11-25) · Model Context Protocol contributors (2025)T1-official original
  • Auth posture. The server authorizes over OAuth 2.1 with audience-bound tokens (RFC 8707) so a token minted for it cannot be replayed elsewhere; when it calls the upstream incident API it uses its own token, never passing through the client’s; PKCE is mandatory. Authorization — Model Context Protocol Specification (revision 2025-11-25) · Model Context Protocol contributors (2025)T1-official original
  • Error model. A “post update” that fails business validation returns an isError result with the reason, so the agent can self-correct — not a JSON-RPC protocol error. Tools — Model Context Protocol Specification (revision 2025-11-25) · Model Context Protocol contributors (2025)T1-official original
  • Isolation is on you. The host keeps the full conversation and routes only the relevant slice to this server; that boundary is a design obligation, because the protocol does not enforce it. Specification — Model Context Protocol · Model Context Protocol contributors (2025)T1-official original
  • Design for the move. Put the transport/lifecycle wiring behind an adapter so the announced stateless-core change The 2026-07-28 MCP Specification Release Candidate · David Soria Parra and Den Delimarsky (Lead Maintainers)T1-official original is a one-seam migration.

The wrong design makes everything a model-controlled tool — including the report — handing the model initiative the engineer meant to keep, and forwards the client’s token upstream, breaking the no-passthrough rule.

Patterns

Pick the primitive by control authority. Sketch: map each capability to tool / resource / prompt by who should trigger it. When to use: designing any server’s surface. Tools — Model Context Protocol Specification (revision 2025-11-25) · Model Context Protocol contributors (2025)T1-official original Mechanics: model-initiated → tool; app-placed read context → resource (template it); user-selected → prompt. Remember: a model-controlled tool hands the model initiative — if the user should decide, that is the wrong primitive.

Design to least-privilege auth. Sketch: OAuth 2.1 with audience-bound tokens, no passthrough, mandatory PKCE. When to use: any remote MCP server. Authorization — Model Context Protocol Specification (revision 2025-11-25) · Model Context Protocol contributors (2025)T1-official original Mechanics: RFC 8707 resource indicators bind the token; the server uses its own upstream token; PKCE (S256) on every client. Remember: it is the posture you implement, not the threat model — and the protocol does not enforce it for you.

Return self-correctable errors. Sketch: report business-logic failures inside the result, not as protocol errors. When to use: any tool that can fail on bad input. Tools — Model Context Protocol Specification (revision 2025-11-25) · Model Context Protocol contributors (2025)T1-official original Mechanics: set isError true and include actionable feedback; reserve JSON-RPC errors for malformed/unknown requests. Remember: an isError result is something the model can read and retry; a protocol error usually is not.

Isolate by design, host-side. Sketch: keep the full conversation in the host; route only the relevant slice to each server. When to use: always — especially with multiple servers. Architecture — Model Context Protocol Specification 2025-11-25 · Model Context Protocol contributors (2025)T1-official original Mechanics: one client per server, capabilities negotiated and respected; no cross-server visibility. Remember: the spec “cannot enforce these security principles at the protocol level” — isolation is your obligation.

Build to the core, wrap the churn. Sketch: design the shape against the stable architecture/primitives; adapter-wrap transport, lifecycle, and Tasks. When to use: any integration meant to outlive the 2026-07-28 RC. The 2026-07-28 MCP Specification Release Candidate · David Soria Parra and Den Delimarsky (Lead Maintainers)T1-official original Mechanics: one seam for the stateful-vs-stateless transition; recheck the RC mechanisms after 2026-07-28. Remember: a diffuse stateful assumption is a migration; an isolated one is a one-file change.

Quick reference

  • What MCP is: an open, JSON-RPC, capability-negotiated client–host–server protocol for connecting agents to external data and tools. Specification — Model Context Protocol · Model Context Protocol contributors (2025)T1-official original Architecture — Model Context Protocol Specification 2025-11-25 · Model Context Protocol contributors (2025)T1-official original
  • The honesty: the spec “cannot enforce these security principles at the protocol level” — isolation, negotiation, and auth are design-time obligations, not runtime guarantees. Specification — Model Context Protocol · Model Context Protocol contributors (2025)T1-official original
  • Three primitives = three control modes: tools (model-controlled), resources (app-driven), prompts (user-controlled) — choose by who holds the trigger. Tools — Model Context Protocol Specification (revision 2025-11-25) · Model Context Protocol contributors (2025)T1-official original Resources — Model Context Protocol Specification (revision 2025-11-25) · Model Context Protocol contributors (2025)T1-official original Prompts — Model Context Protocol Specification (revision 2025-11-25) · Model Context Protocol contributors (2025)T1-official original
  • Tool interface: isError results carry self-correctable feedback (distinct from protocol errors); annotations are untrusted unless the server is. Tools — Model Context Protocol Specification (revision 2025-11-25) · Model Context Protocol contributors (2025)T1-official original
  • Auth posture (four principles): OAuth 2.1 + RFC 8707 resource indicators + no token passthrough + mandatory PKCE — the posture, not the threat model. Authorization — Model Context Protocol Specification (revision 2025-11-25) · Model Context Protocol contributors (2025)T1-official original
  • Mid-transition: today stateful (initialize handshake, Mcp-Session-Id); the 2026-07-28 RC announces a stateless core, deprecation policy, Tasks-as-extension, MCP Apps, and an Extensions framework — announced, recheck after 2026-07-28. The 2026-07-28 MCP Specification Release Candidate · David Soria Parra and Den Delimarsky (Lead Maintainers)T1-official original
  • Governance: donated to the Linux Foundation (Agentic AI Foundation, Dec 2025); developed through Working Groups — trackable, not vendor-scheduled. Donating the Model Context Protocol and establishing the Agentic AI Foundation · Anthropic (2025)T1-official original The 2026 MCP Roadmap · David Soria Parra (Lead Maintainer) (2026)T1-official original
  • Design rule: build to the stable core; isolate what the RC changes behind an adapter.

Practice

Exercise

The chapter insists MCP’s isolation, capability negotiation, and auth posture are “design-time obligations, not runtime guarantees.” Quote the spec line that justifies that framing, and explain in two sentences what changes about how you build an MCP integration once you accept it — versus treating the protocol as if it enforced isolation for you.

Practice ◆◆◆◇

You are designing an MCP server for a code-review service with three capabilities: (a) fetch the diff for a pull request, (b) post an inline review comment, (c) a “summarize this PR for the release notes” action a human editor runs deliberately. Assign each to a primitive (tool / resource / prompt) and justify the assignment by control authority, not by data shape. Then name the one OAuth requirement that most directly stops a token issued to this server from being replayed against a different internal service, and say why.

Practice ◆◆◆◆

The 2026-07-28 release candidate removes the initialize handshake and the Mcp-Session-Id session. Suppose your codebase already calls a live MCP server in production. Describe, in three or four sentences, how you would structure the integration today so that the announced stateless-core change is a contained migration rather than a sprawling one — and state explicitly which parts of MCP you would treat as stable and design directly against, versus which parts you would wrap. Tie your answer to why the dual-layer (current + announced-coming) reading is the honest default given MCP’s governance.

Exercise solutions

Solution ↑ Exercise

The justifying line is the spec’s own: “While MCP itself cannot enforce these security principles at the protocol level,” — the protocol acknowledges it cannot guarantee the consent, isolation, and authorization properties on the wire, and shifts responsibility to the implementer. Once you accept that, you stop assuming the protocol isolates servers or scopes tokens for you and start building those properties yourself: the host must actually keep the full conversation and route only the relevant slice to each server, the server must actually implement audience-bound tokens and no-passthrough, and you treat any “the protocol does X” claim as “I must build X.” Treating the protocol as self-enforcing is the failure mode — it leaves the isolation and auth you assumed were present simply unbuilt.

Solution ↑ Exercise

(a) Fetch the diff → resource. It is URI-addressable read context the application decides to surface (e.g. a template like pr://{id}/diff); the model does not “act,” it reads context the host places. (b) Post an inline review comment → tool. It is an action the agent should take in context based on its understanding of the diff — model-controlled is exactly right. (c) Summarize for the release notes → prompt. The human editor invokes it deliberately; making it a model-controlled tool would hand the model an initiative meant to stay with the user. The assignment is by who holds the trigger (application / model / user), not by whether the payload is text or structured. The OAuth requirement that most directly stops token replay is RFC 8707 resource indicators: they bind the token to the canonical URI of this server as its intended audience, so a token minted here carries an audience that a different internal service will reject — replay fails because the token says who it is for. (No-passthrough is the complementary rule, but it governs the server’s upstream calls rather than replay of the token issued to this server.)

Solution ↑ Exercise

Design the integration’s shape against MCP’s stable conceptual core — the host/client/server architecture and the three-primitive control model — because those are not what the RC touches; an integration whose structure rests on them survives the transition unchanged. Wrap the mechanical, in-motion parts behind a single thin adapter: the transport and lifecycle (the initialize handshake and Mcp-Session-Id session) and Tasks, which are exactly what the stateless core and the Tasks-extension graduation change. With that seam in place, the announced stateless-core change is a one-file swap behind the adapter rather than a stateful assumption you have to chase through the whole codebase. The dual-layer reading is the honest default precisely because MCP is now an open-standards protocol developed through Working Groups rather than a vendor-scheduled product — the maintainers themselves note “a release-oriented roadmap implies a level of predictability that open-standards work rarely has,” so the responsible stance is to hold what is authoritative now and what is announced coming with a recheck date together, and to make the moving parts findable rather than assume a frozen spec.

Part 2 Chapter 16 Last verified 2026-06-13 Fresh

Shaping Input — The Prompting Craft

The craft that shapes what goes into the agent — five moves in the source's own order (be clear, show examples, elicit reasoning, structure with XML and roles, chain). The lead mental model is the brilliant-but-new employee; examples are the most reliable lever; two techniques changed under newer models (manual chain-of-thought is now a fallback, prefill on the last assistant turn is deprecated); and chaining is single-thread decomposition, not orchestration.

Volatility: feature-surface
Tools compared: claude-codecross-tool
Before you start: The spine chapter's capability axis (a tool is a slice of the context budget). Vol 1's context-as-budget framing helps but is not required. This chapter is about the agent's text *input*; forcing its machine-readable *output* is the next chapter's subject.
You will learn
  • The five-move prompting craft in the source’s own order — be clear, show examples, elicit reasoning, structure, chain — and the one mental model that makes the rest self-deriving
  • Why examples are the most reliable single lever, and the one anchored number the craft actually quotes (3–5)
  • What changed under newer models — manual chain-of-thought is now a fallback, and prefilling the last assistant turn is deprecated
  • The boundary that keeps the next half of the volume clean: chaining is single-thread decomposition, not orchestration

The spine framed every tool as a slice of the context budget. This chapter turns to the other thing you put in the window: the prompt itself — the text that shapes what the agent reasons over. Anthropic documents the craft as five moves, and it documents them in an order that is itself a teaching device: clarity first, then examples, then reasoning, then structure, then chaining. This is single-vendor, authoritative-by-construction guidance — Anthropic’s own prompt-engineering docs — so the honest tier is official, not convergence. Two of the five moves have shifted under newer models, and the chapter renders them as the current state, not the technique they once were.

The order is the lesson: five moves on one surface

Anthropic’s prompt-engineering hub lists the techniques as one ordered set — they run “from clarity and examples to XML structuring, role prompting, thinking, and prompt chaining.” [Official] Prompt engineering overview · AnthropicT1-official original That ordering is not a table of contents; it is a gradient of effort. You reach for the cheap, high-leverage move first (say what you want clearly), and you escalate to the expensive, structural moves (chain several prompts) only when the cheaper ones do not carry the task.

Key idea

The craft is five moves in a deliberate order — be clear → show examples → elicit reasoning → structure with XML and roles → chain. Read it as an escalation ladder, not a checklist: each rung is more work and more structure than the last, so you climb only as far as the task forces you. Most prompts never leave the first two rungs.

This matters for an agent system specifically. The prompt is context, and context is the budget the whole volume is about. A prompt that achieves its effect with clarity and three examples spends far less of the window than one that leans on elaborate structure and a multi-step chain — and it leaves more room for the tools, the conversation history, and the work itself.

Clarity is the foundation — brief the brilliant new employee

The single highest-leverage idea in the craft is to be explicit, because the model does not share your context. Anthropic’s mental model is the one to lead with: “Think of Claude as a brilliant but new employee who lacks context on your norms and workflows.” [Official] Prompting best practices · AnthropicT1-official original The metaphor does real work — almost every other clarity technique is a corollary of “brief the new hire properly.”

Two such corollaries are documented directly. Sequence the instruction when order matters: “Provide instructions as sequential steps using numbered lists or bullet points when the order or completeness of steps matters.” [Official] Prompting best practices · AnthropicT1-official original And supply the why, not just the what — “providing context or motivation behind your instructions, such as explaining to Claude why such behavior is important” [Official] Prompting best practices · AnthropicT1-official original lets the model generalize the instruction to cases you did not enumerate.

Concept · The brilliant-but-new-employee model

The organizing metaphor for clarity: the model is capable but uninformed about your norms, so under-specification reads as a gap to be filled with a plausible default — rarely the one you wanted. The fix is the same as onboarding a strong hire: be explicit about the goal, give the steps in order when order matters, and explain the motivation so they can extend it correctly. State the metaphor first and most of the clarity advice derives itself.

Examples are the most reliable lever — and they come with a dosage

Among the five moves, examples carry the strongest reliability claim in the source. Anthropic states that examples are “one of the most reliable ways to steer Claude’s output format, tone, and structure.” [Official] Prompting best practices · AnthropicT1-official original For an agent, that is the cheapest way to pin down a shape you care about — a response format, a tone, a decision boundary — without writing a paragraph of rules the model then has to interpret.

Unusually for prompting guidance, this move comes with a concrete dosage and a selection rule. The dosage is the one verbatim number anchored anywhere in the underlying research: “Include 3–5 examples for best results.” [Official] Prompting best practices · AnthropicT1-official original The selection rule is relevance — make examples that “mirror your actual use case closely” [Official] Prompting best practices · AnthropicT1-official original (the same guidance adds diverse, covering edge cases, and structured, wrapped in tags). Anthropic’s own interactive tutorial independently treats examples as a named foundational technique, dedicating its “Using Examples” chapter to it — corroboration that this is core craft, though as a teaching artifact rather than the normative guidance. [Official] Anthropic's Prompt Engineering Interactive Tutorial · AnthropicT2-release-notes original

[Tip]

The “3–5 examples” figure is the only number this craft actually quotes — and it is a dosage guideline, not a measured effect size. There is no source figure for “how much accuracy examples buy”; the docs make a reliability claim, not a quantified one.

A few good examples beat a long rulebook

When you want a specific output shape, the instinct is to describe it in prose — every field, every edge case. Examples short-circuit that: three to five that mirror the real use case demonstrate the shape instead of describing it, and demonstration is both shorter in the window and less ambiguous than a rulebook the model must interpret. This is the same context-frugality the volume’s spine applies to tools, applied to the prompt.

What changed: reasoning is now elicited by the model, not the prompt

The third move — eliciting step-by-step reasoning — is the first of two that have shifted under newer models, and the shift is a role-reversal. Manual chain-of-thought (telling the model to “think step by step”) used to be the default reasoning lever. It is now documented as a fallback: “When thinking is off, you can still encourage step-by-step reasoning by asking Claude to think through the problem.” [Official] Prompting best practices · AnthropicT1-official original The conditional — when thinking is off — is the whole point. Adaptive thinking now handles most multi-step reasoning internally as a model feature, so the prompting technique survives mainly for the case where that capability is unavailable.

[Caveat]

This is official-but-volatile, current as of the Claude Opus 4.7-era docs — recheck it per model release. As thinking becomes more default-on, the “when thinking is off” condition narrows, and the manual technique may shrink further. Keep the prompting technique (ask the model to reason) cleanly separate from the model feature (the extended-thinking budget); the latter is an API-surface deep-dive, not this chapter’s subject.

The practical instruction for an agent builder: do not reach reflexively for “think step by step” in the system prompt. If thinking is available, it is doing that work already, and the manual instruction is redundant context. Reserve the prompting technique for the case it is now documented for.

Structure and roles — durable craft, with one deprecation inside it

The fourth move is the most stable part of the craft — except for one technique that has been retired outright.

Two of the three structuring techniques are documented-once, durable craft. XML tags “help Claude parse complex prompts unambiguously,” [Official] Prompting best practices: use XML tags · AnthropicT1-official original which matters most when a prompt mixes instructions, context, examples, and variable inputs — wrap each content type in its own tag so the model never has to guess where one ends and the next begins. And a role assignment in the system prompt is a one-line steering tool: “setting a role in the system prompt focuses Claude’s behavior and tone for your use case,” [Official] Prompting best practices · AnthropicT1-official original where even a single sentence makes a difference.

The third technique — prefilling the last assistant turn to steer format or skip a preamble — has been deprecated, not refined. On Claude 4.6+ models, “prefilled responses on the last assistant turn are no longer supported,” [Official] Prompting best practices · AnthropicT1-official original and a request that includes a prefilled assistant message now returns a 400 error. This is a former best-practice that became an error, and it must be rendered as the deprecation it is. The documented migration is to structured outputs for format control and to direct system-prompt instructions for skipping preambles.

Teaching prefill as a current technique

Older prompting material recommends prefilling the assistant turn — opening the model’s response yourself to force a JSON object or strip a preamble. On current models this is not a weaker technique; it is a 400. If you find a guide that teaches “start the assistant message with { to force JSON,” treat it as stale: the format-control job now belongs to structured outputs (the next chapter), and the skip-the-preamble job belongs to a plain system-prompt instruction. Migrate it; do not carry it forward.

[Caveat]

Both the deprecation and the model gate are official-but-volatile — current as of the Claude Opus 4.7-era docs; recheck per model release. The deprecation is scoped to the last assistant turn on 4.6+; assistant messages elsewhere in a conversation are unaffected.

Chaining is the escape hatch — and it is not orchestration

The fifth and most expensive move is to stop trying to do the job in one prompt and decompose it. Prompt chaining “decomposes a task into a sequence of steps, where each LLM call processes the output of the previous one,” [Official] Building effective agents · Erik Schluntz and Barry Zhang (2024)T1-official original and the workflow “is ideal for situations where the task can be easily and cleanly decomposed into fixed subtasks.” [Official] Building effective agents · Erik Schluntz and Barry Zhang (2024)T1-official original Explicit chaining stays worthwhile “when you need to inspect intermediate outputs or enforce a specific pipeline structure,” [Official] Prompting best practices · AnthropicT1-official original and its canonical shape is self-correction — generate a draft, have the model review it against criteria, have it refine based on the review. [Official] Prompting best practices · AnthropicT1-official original

Here is the boundary the rest of the volume depends on. Chaining is single-thread prompt decomposition: one conversation, a fixed sequence of calls, each feeding the next. It is not multi-agent orchestration — there is no second isolated window, no delegation, no runtime-chosen control flow. Reaching for sub-agents when a sequential pipeline would do is exactly the additive reflex the spine warned against, applied to coordination.

Key idea

Chaining ≠ orchestration. A prompt chain is a predefined sequence of calls on one thread — the craft’s escape hatch for a task too big for a single prompt. An orchestration is many isolated windows coordinating. If the task decomposes into fixed steps you can lay out in advance, chain it; you do not need another agent. The coordination half of this volume is where the other case — genuinely independent or quarantined work — gets its primitive.

Where this connects

Two threads run out of this chapter. The first is the prefill deprecation: its migration target is structured outputs, the subject of the next chapter, which takes up forcing reliable machine-readable output where prefill used to. The second is the chaining boundary: the moment a task stops being a fixed sequence on one thread and becomes independent or quarantined work across windows, you have crossed from the prompting craft into coordination — the later, orchestration half of this volume, where the sub-agent and multi-agent patterns live. This chapter shapes the input to a single agent thread; those are the two places its edges hand off.

Patterns

Brief the new hire. Sketch: state the goal explicitly, give steps in order when order matters, and explain the motivation. When to use: first — always, before any heavier move. Prompting best practices · AnthropicT1-official original Mechanics: numbered/bulleted steps for ordered work; one sentence of why behind each non-obvious instruction. Remember: under-specification gets filled with a plausible default, rarely the one you wanted.

Show, don’t describe. Sketch: demonstrate the output shape with 3–5 examples instead of a prose rulebook. When to use: whenever you care about a specific format, tone, or decision boundary. Prompting best practices · AnthropicT1-official original Mechanics: mirror the real use case closely; cover edge cases; wrap each example in a tag. Remember: examples are the most reliable steering lever, and 3–5 is the documented dosage.

Structure the messy prompt. Sketch: tag distinct content types and assign a role. When to use: when a prompt mixes instructions, context, examples, and variable input. Prompting best practices: use XML tags · AnthropicT1-official original Mechanics: <instructions>/<context>/<input> tags; a one-line role in the system prompt. Prompting best practices · AnthropicT1-official original Remember: tags remove parsing ambiguity; a role focuses behavior and tone — durable craft, unlike prefill.

Chain only fixed sequences. Sketch: decompose a too-big task into a predefined sequence of calls, each feeding the next. When to use: the task cleanly decomposes into fixed subtasks, or you must inspect intermediate output. Building effective agents · Erik Schluntz and Barry Zhang (2024)T1-official original Mechanics: the self-correction shape — generate, review against criteria, refine — as separate calls. Prompting best practices · AnthropicT1-official original Remember: this is single-thread; if the work is independent or needs isolation, that is orchestration, not chaining.

Quick reference

  • The order is the ladder: clarity → examples → reasoning → structure/roles → chaining; climb only as far as the task forces you. Prompt engineering overview · AnthropicT1-official original
  • Lead with clarity: the brilliant-but-new-employee model — be explicit, sequence when order matters, give the why. Prompting best practices · AnthropicT1-official original
  • Examples are the most reliable lever: mirror the use case; documented dosage is 3–5 (the one anchored number). Prompting best practices · AnthropicT1-official original
  • Changed — CoT: manual “think step by step” is now a fallback for when thinking is off; adaptive thinking does it by default (volatile; recheck per model release). Prompting best practices · AnthropicT1-official original
  • Changed — prefill: prefilling the last assistant turn is deprecated on 4.6+ (returns a 400); migrate format control to structured outputs (volatile; recheck per model release). Prompting best practices · AnthropicT1-official original
  • Durable structure: XML tags remove parsing ambiguity; Prompting best practices: use XML tags · AnthropicT1-official original a role in the system prompt focuses behavior. Prompting best practices · AnthropicT1-official original
  • Boundary: chaining is single-thread decomposition, not orchestration. Building effective agents · Erik Schluntz and Barry Zhang (2024)T1-official original
  • Tier honesty: this is single-vendor official guidance (Anthropic’s own docs + one corroborating tutorial), authoritative by construction — not triangulated across independent practitioners, so no convergence claim.

Practice

Exercise

The chapter calls the five-move order an “escalation ladder” rather than a checklist. Name the five moves in order, and explain in one sentence why reading them as a ladder — climb only as far as the task forces you — is the right framing for an agent system specifically. (Hint: what is a prompt, in the currency the volume cares about?)

Practice ◆◆◆◇

Two of the five moves “changed” under newer models: manual chain-of-thought and prefill. For each, state (a) what the technique used to be for, (b) what its current documented status is, and (c) what you should do instead today. Then say, in one sentence, why this chapter tags both claims volatile and what concrete action that tag obliges before you rely on either. The point is to practice rendering a deprecated technique as the deprecation it is, not the technique it was.

Practice ◆◆◇◇

A teammate says: “Our agent’s report-generation step is unreliable, so let’s split it into a multi-agent system.” The step is actually a fixed pipeline — gather data, draft the report, check it against a rubric, revise. Locate the cheaper fix on the prompting ladder, name the specific pattern, and explain why this is a chaining problem and not an orchestration one. The point is to feel the chaining-≠-orchestration boundary the chapter draws.

Exercise solutions

Solution ↑ Exercise

The five moves, in order, are: be clear and direct, use examples (multishot), elicit reasoning (chain-of-thought), structure with XML tags and roles, and chain complex prompts. Reading them as a ladder is right for an agent system because a prompt is context — the same finite window currency the volume is built around — and each rung up the ladder spends more of that window than the last (a multi-step chain costs far more than a clear instruction with three examples). So you climb only as far as the task forces you: solving it on the bottom two rungs leaves the most window for tools, history, and the actual work, which is the whole capability-axis discipline applied to the prompt rather than to the tool set.

Solution ↑ Exercise

Manual chain-of-thought. (a) It used to be the default reasoning lever — telling the model to “think step by step” before answering to improve multi-step reasoning. (b) It is now documented as a fallback: “when thinking is off, you can still encourage step-by-step reasoning.” (c) Today, rely on adaptive/extended thinking, which handles most multi-step reasoning internally as a model feature; reserve the manual prompt for the case where thinking is unavailable, and do not add a redundant “think step by step” when it is on.

Prefill. (a) It used to steer output format or skip a preamble by opening the assistant’s response for it. (b) It is now deprecated on Claude 4.6+ — prefilled responses on the last assistant turn are no longer supported, and such a request returns a 400 error. (c) Migrate format control to structured outputs and skip-the-preamble to a direct system-prompt instruction.

Both claims are tagged volatile because they are model-version-dependent statements about a moving feature surface (current as of the Claude Opus 4.7-era docs), not durable principles — the “when thinking is off” condition narrows as thinking becomes default-on, and the prefill gate is tied to specific model versions. The tag obliges a concrete action: recheck the source per model release before relying on either, rather than treating the chapter’s snapshot as permanent.

Solution ↑ Exercise

The cheaper fix is the fifth rung of the prompting ladder — prompt chaining, specifically the self-correction pattern (generate a draft, review it against criteria/rubric, refine based on the review), run as a fixed sequence of calls so you can inspect each intermediate output. This is a chaining problem and not an orchestration one because the work is a predefined, single-thread sequence: gather → draft → check → revise, each step feeding the next on one conversation, with no independent or quarantined work that needs its own isolated window. Splitting it into multiple agents adds coordination cost and extra windows to paper over what is really an unstructured single prompt — the additive reflex applied to the wrong axis. Chaining ≠ orchestration: if the task decomposes into fixed steps you can lay out in advance, chain it; reach for separate agents only when the work is genuinely independent or needs context isolation, which the coordination half of the volume takes up.

Part 2 Chapter 17 Last verified 2026-06-13 Fresh

Shaping Output — Structured & Reliable

The output half of shaping I/O — four levers that force reliable machine-readable output, ordered strongest-guarantee to lightest. tool_choice forces the call while strict guarantees the args; structured outputs add a grammar-backed guarantee that holds except for refusals and max_tokens cutoffs and only over the supported schema subset; prevent beats recover, so the retry loop is the fallback, not the primary path.

Volatility: feature-surface
Tools compared: claude-codecross-tool
Before you start: The prompting-craft chapter just before this one (it shapes what goes in; this chapter shapes what comes out) and the spine's capability axis — output shape is a capability-axis decision. Tool use as a mechanism (the MCP and tool-minimization chapters) helps but is not required.
You will learn
  • The four output levers as one continuous surface — strongest guarantee down to lightest weight — and that choosing one is a guarantee-vs-flexibility decision
  • Why tool_choice and strict are two different levers: one forces the call, the other guarantees the arguments
  • The structured-outputs guarantee and its documented limits — so you never write “always valid JSON”
  • Why prevent beats recover, which makes the validation/retry loop the fallback rather than the primary path

The previous chapter shaped what goes into the model — the prompting craft. This one shapes what comes out: how you force output a program can parse instead of prose a human must read. It picks up a loose end from there — prefilling the assistant turn, a classic JSON-extraction trick, is being deprecated on the newest models, and the question “what do I reach for instead?” is exactly this chapter’s subject. The answer is four levers, arranged on one surface from the strongest guarantee to the lightest touch. The mechanics are Anthropic’s own API documentation — authoritative by construction, but a moving target: structured outputs is mid-GA-transition and the prefill gate grows with each model release, so this is a chapter to re-check per release, not to memorize.

One problem, four levers

Forcing reliable machine-readable output looks like several unrelated tricks — tool calls, a strict flag, retry loops, prefilling a brace — but they are one problem with four levers, and the levers form a single continuous surface from strongest guarantee to lightest weight.

  • Tool use turns a tool definition into an output mechanism: a tool’s input_schema is “[a JSON Schema] object defining the expected parameters for the tool,” [Official] Define tools · AnthropicT1-official original so you define one tool whose schema is the shape you want and parse the call it returns. [Official] Tool use with Claude · AnthropicT1-official original
  • Structured outputs / strict: true add a grammar-backed guarantee on top — schema-compliant output by construction. [Official] Structured outputs · AnthropicT1-official original
  • The validation + retry loop is the recover path for when the guarantee is unavailable or insufficient — define a schema, “and the SDK validates the output against it, re-prompting on mismatch.” [Official] Get structured output from agents · AnthropicT1-official original
  • The prompt-craft recipes — specify the format, prefill, stop sequences — are the lightest-weight, guarantee-free option. [Official] Increase output consistency · AnthropicT1-official original
Key idea

Choosing a lever is a guarantee-versus-flexibility decision. Reach for structured outputs / strict when you need a hard schema guarantee; reach for the prompt-craft recipes when you need flexibility beyond a strict schema. The docs draw the boundary explicitly: “Structured outputs provide guaranteed schema compliance and are specifically designed for this use case.” [Official] Increase output consistency · AnthropicT1-official original Everything else trades guarantee for reach.

This is a capability-axis decision in the spine’s sense: the output shape is part of what the agent’s tool surface costs and promises. The rest of the chapter walks the levers in order, and the discipline that orders them — prevent beats recover — falls out at the end.

The four output levers as one ladder, strongest guarantee at the top. Structured outputs / strict give a grammar-constrained guarantee — but only except refusals and max_tokens cutoffs, over the supported schema subset. tool_choice + strict force a call and constrain its arguments. The validation/retry loop recovers after an invalid output rather than preventing one. Prompt-asked JSON shapes likelihood, not the token grammar, so it carries no guarantee. Up the ladder is more guarantee; down it is more flexibility.A vertical ladder of four rungs. Top rung (teal): 'Structured outputs / strict: true — grammar-constrained decoding', annotated 'schema-valid except refusals and max_tokens cutoffs, over the supported schema subset'. Second rung (blue): 'tool_choice + strict — force the call, guarantee the args', annotated 'a call is emitted; strict makes its arguments conform'. Third rung (orange): 'Validation + retry loop — re-prompt on mismatch', annotated 'no prevention — recovers after an invalid output (bounded retries)'. Bottom rung (gray): 'Prompt-asked JSON — specify format, stop sequence', annotated 'no guarantee — shapes likelihood, not the token grammar'. A left-hand axis runs from 'stronger guarantee' at the top to 'more flexibility' at the bottom.
The four output levers as one ladder, strongest guarantee at the top. Structured outputs / strict give a grammar-constrained guarantee — but only except refusals and max_tokens cutoffs, over the supported schema subset. tool_choice + strict force a call and constrain its arguments. The validation/retry loop recovers after an invalid output rather than preventing one. Prompt-asked JSON shapes likelihood, not the token grammar, so it carries no guarantee. Up the ladder is more guarantee; down it is more flexibility.

Tool use: tool_choice forces the call, strict guarantees the args

The first lever is the one most teams already have wired, because they use tool calls for everything else. Using a tool to shape output has two separable moves, and conflating them is the most common framing error.

The first move forces the model to emit a call. The tool_choice option {"type":"tool","name":"..."} “forces Claude to always use a particular tool.” [Official] Define tools · AnthropicT1-official original Give the model a single tool whose input_schema is your target shape, force that tool, and you are guaranteed it calls it.

The second move is not implied by the first. Forcing the call does not make the call’s arguments schema-valid — the model can still emit the wrong types or drop a required field. That guarantee is a separate addition: setting strict: true on the tool “guarantees Claude’s tool inputs match your JSON Schema by constraining the model’s token sampling to schema-valid outputs.” [Official] Strict tool use · AnthropicT1-official original

Key idea

tool_choice and strict are two different levers doing two different jobs. input_schema declares the target shape; tool_choice forces the model to emit a call; strict is what makes the call’s arguments conform to the schema. “I forced the tool, so the output is valid” is wrong — you forced the call, not the contents. Without strict, the docs note, Claude “might return incompatible types or missing required fields.” [Official] Strict tool use · AnthropicT1-official original

[Note]

The conceptual loop the overview names — Claude “returns a structured call that your application executes” Tool use with Claude · AnthropicT1-official original — is why tool use doubles as an output mechanism: the call’s input is JSON conforming to the tool’s input_schema, which your code parses directly.

This chapter treats tool use as an output mechanism only. Which tools to expose, how few, and how to design their surface is the subject of the tool-minimization and MCP chapters earlier in this volume — a different question from “how do I get a known shape back.”

The structured-outputs guarantee — and its limits

The second lever is the strongest, and the one most often overstated. Structured outputs “guarantee schema-compliant responses through constrained decoding,” [Official] Structured outputs · AnthropicT1-official original and the guarantee is mechanistically grounded rather than a strong-prompt effect: structured outputs “use constrained sampling with compiled grammar artifacts” [Official] Structured outputs · AnthropicT1-official original — the schema is compiled into a grammar that constrains which tokens the model may sample. That compiled-grammar mechanism is why the guarantee holds and what distinguishes it from prompt-only JSON, which carries no such constraint.

But the guarantee is conditioned, and the conditions are load-bearing — you must render them every time.

Key idea

The structured-outputs / strict guarantee is: schema-compliant output except for refusals (stop_reason: refusal) and max_tokens cutoffs, and only over the supported JSON-Schema subset. It is conditioned on a normal completion. Never write “always valid JSON in every circumstance” — a refused or truncated generation can still be non-conforming, and a schema feature outside the supported subset is not covered. State the guarantee with its exceptions or you have misstated it.

“It always returns valid JSON”

The single most common error here is dropping the conditions and quoting the headline. Two failure modes survive the guarantee by design: a refusal ends generation with stop_reason: refusal before a complete object exists, and a max_tokens cutoff truncates mid-object. Both can yield output that does not parse, with the guarantee fully in force — because the guarantee is over normal completions, not over every API response. A production path still checks stop_reason and still handles a parse failure; the guarantee shrinks that handling to a rare edge, it does not delete it.

[Caveat]

Structured outputs is now generally available, but mid-transition: the old beta header (structured-outputs-2025-11-13) and the output_format param keep working for a transition period, with the current param being output_config.format. This is feature-surface — confirm the param name and whether the transition has closed against the live docs before relying on it.

This is the bridge to the first lever: tool_choice forces a call, and strict runs this same constrained-decoding pipeline over the call’s arguments. Schema-shaped return value (structured outputs) and schema-valid tool arguments (strict tool use) are the same guarantee applied to two surfaces.

Prevent beats recover

The third lever closes the loop when the output is still malformed — but its place in the ordering is the real lesson. There are two ways to deal with bad output, and they are not equals.

The recover path responds after the fact. On the tool-use API, when a client tool returns a tool_result with is_error: true, “Claude will then incorporate this error into its response,” [Official] Handle tool calls · AnthropicT1-official original and for an invalid or missing-parameter call, “Claude will retry 2-3 times with corrections before apologizing to the user.” [Official] Handle tool calls · AnthropicT1-official original On the Agent SDK, the same instinct is wired as a loop: you define a JSON Schema “and the SDK validates the output against it, re-prompting on mismatch,” [Official] Get structured output from agents · AnthropicT1-official original erroring out — surfaced as error_max_structured_output_retries — if validation does not succeed within the retry limit.

The prevent path is the second lever: strict / structured outputs eliminate the invalid call by construction, so there is nothing to recover from. The handle-tool-calls docs themselves point at strict as the way to “eliminate invalid calls” rather than retry them. [Official] Handle tool calls · AnthropicT1-official original

Key idea

The design ordering is prevent, then recover. Use the grammar-constrained guarantee wherever you can — it removes the failure rather than reacting to it. Fall back to the validation/retry loop only where the guarantee is unavailable (a model or path without structured outputs) or insufficient (a schema richer than the supported subset). A retry loop layered on top of a guarantee you could have used is paying twice — latency on every failure plus the engineering of the loop — for a failure you could have prevented.

Recovery is not free and it is not certain. The SDK names three documented ways generation still fails: “This typically happens when the schema is too complex for the task, the task itself is ambiguous, or the agent hits its retry limit trying to fix validation errors.” [Official] Get structured output from agents · AnthropicT1-official original Each is a reason to prefer prevention: a simpler schema, a less ambiguous task, and a guarantee that needs no retries at all.

The retry loop is a fallback, not a default

It is tempting to reach for a validate-and-re-prompt loop first — it is familiar and it works without the API’s newer features. But a loop is recovery: it spends a round trip on every malformed output and can still exhaust its retries. The guarantee is prevention: no malformed output, no round trip. Order them prevent-first, and keep the loop for exactly the cases the guarantee cannot reach.

The lightest lever: prompt-craft recipes

The fourth lever is the oldest and the weakest — prompt-craft recipes that ask for a shape without constraining the tokens. They carry no schema guarantee at all, which is precisely when you want them: for flexibility beyond a strict schema, or on a path where the guarantee is not available.

The base recipe is to be explicit: “Precisely define your desired output format using JSON, XML, or custom templates so that Claude understands every output formatting element you require.” [Official] Increase output consistency · AnthropicT1-official original Two narrower JSON tricks have long ridden on top: prefilling the assistant turn to “skip the preamble and go straight to the JSON,” [Official] Prompting Claude for JSON mode · AnthropicT1-official original and pairing it with a stop sequence — “You can get rid of text that comes after the JSON by using a stop sequence.” [Official] Prompting Claude for JSON mode · AnthropicT1-official original

Here is where the loose end from the prompting chapter gets tied off. Prefilling the assistant turn is being deprecated on the newest models — it is not supported on Claude Opus 4.7, Opus 4.6, Sonnet 4.6, or Mythos Preview; on those models the documented replacement is structured outputs or system-prompt instructions. [Official] Increase output consistency · AnthropicT1-official original So the classic prefill-{ recipe is now legacy on current models — reach for the guarantee (lever two) or a system-prompt instruction instead. The stop-sequence recipe is unaffected by the gate and remains useful for trimming trailing prose.

[Caveat]

The prefill model gate grows with each release — the unsupported-model list will get longer. Treat the named set (Opus 4.7 / Opus 4.6 / Sonnet 4.6 / Mythos Preview) as a snapshot and re-check the prefill note on the increase-consistency page per model release before using prefill in any example.

These recipes shape the likelihood of a good shape; they do not constrain the grammar. That is the whole reason the docs redirect to structured outputs the moment you need a guarantee — the recipes are for the cases where you deliberately don’t.

A note on evidence

Everything in this chapter is Anthropic’s own API documentation — T1, authoritative by construction. That is the right tier for a vendor-API reference, but it is mono-vendor: there is no independent benchmark of how often structured outputs fail on a complex schema, or of the real-world distribution of refusal-versus-cutoff. This chapter does not invent one. The one number it quotes — “2-3 times” — is the docs’ own documented retry range, not a measured rate. If you want a failure rate for your schemas, you measure it; the docs tell you the guarantee and its conditions, not your distribution.

Patterns

Schema-as-output tool. Sketch: define one tool whose input_schema is your target shape, force it, parse the call. When to use: you already speak tool use and want a known shape back. Define tools · AnthropicT1-official original Mechanics: set tool_choice to {"type":"tool","name":"..."} to force the call; add strict: true to constrain the arguments. Strict tool use · AnthropicT1-official original Remember: tool_choice forces the call; strict is what makes the arguments conform — they are two levers.

Grammar-constrained guarantee. Sketch: use structured outputs / strict for a hard schema guarantee. When to use: you need schema-compliant output by construction, within the supported subset. Structured outputs · AnthropicT1-official original Mechanics: the schema compiles to a grammar that constrains token sampling; same pipeline drives strict tool use. Remember: the guarantee holds except refusals and max_tokens cutoffs, over the supported schema subset — still check stop_reason and a parse failure.

Prevent, then recover. Sketch: prevent invalid output with the guarantee; keep a retry loop for what it can’t reach. When to use: always order it this way. Get structured output from agents · AnthropicT1-official original Mechanics: structured outputs / strict first; fall back to validate-and-re-prompt (the SDK loop, or is_error feedback) where the guarantee is unavailable or insufficient. Handle tool calls · AnthropicT1-official original Remember: recovery costs a round trip per failure and can exhaust retries (complex schema, ambiguous task, retry-limit hit) — prevention costs neither.

Prompt-craft for flexibility. Sketch: specify the format and (optionally) use a stop sequence when you need reach beyond a strict schema. When to use: output flexibility the guarantee can’t express, or a path without it. Increase output consistency · AnthropicT1-official original Mechanics: state the format precisely; stop-sequence to trim trailing prose; do not prefill on current models (Opus 4.7 / Opus 4.6 / Sonnet 4.6 / Mythos Preview) — use system-prompt instructions or structured outputs. Prompting Claude for JSON mode · AnthropicT1-official original Remember: these shape likelihood, not the token grammar — no guarantee.

Quick reference

  • One surface, four levers: tool use → structured outputs / strict → validation/retry → prompt-craft — strongest guarantee to lightest weight. Structured outputs · AnthropicT1-official original
  • Two levers, not one: tool_choice forces the call; strict guarantees the arguments. Strict tool use · AnthropicT1-official original
  • The guarantee, stated honestly: schema-compliant output except refusals and max_tokens cutoffs, over the supported schema subset — never “always valid JSON.” Structured outputs · AnthropicT1-official original
  • Prevent beats recover: use the guarantee first; the retry loop is the fallback, not the default. Handle tool calls · AnthropicT1-official original
  • Prefill is deprecated on current models (Opus 4.7 / Opus 4.6 / Sonnet 4.6 / Mythos Preview) — use structured outputs or system-prompt instructions; the gate grows per release. Increase output consistency · AnthropicT1-official original
  • Volatility: structured-outputs GA is mid-transition and the prefill gate moves — re-check the param name and unsupported-model list per release.
  • Evidence: mono-vendor T1 docs; the only number (“2-3 times”) is the docs’ documented retry range, not a measured failure rate. Handle tool calls · AnthropicT1-official original

Practice

Exercise

A teammate sets tool_choice to force a single extract_invoice tool and says, “Now the output is guaranteed to match the schema.” Name the two distinct levers they have conflated, and state precisely what forcing the tool does and does not guarantee. What one addition closes the gap?

Practice ◆◆◆◇

Write the structured-outputs guarantee in a single sentence a teammate could paste into a design doc — including its exceptions — and then explain in two or three sentences why “it always returns valid JSON” is not just sloppy shorthand but an operational error: name the two completion conditions that break it and say what a production path must still do despite the guarantee being in force.

Practice ◆◆◆◆

You are designing the output path for an agent that must return a moderately complex nested object. Lay out, in order, which levers you reach for and why — and where prompt-craft (specifically prefill) drops out given current models. State explicitly where the validation/retry loop sits in your ordering and the conditions under which you would expect it to still fail even as a fallback. The point is to practice the prevent-then-recover ordering, not to enumerate the API surface.

Exercise solutions

Solution ↑ Exercise

The two conflated levers are tool_choice (forces the model to emit a call) and strict (guarantees the call’s arguments match the JSON Schema by constraining token sampling). Forcing the tool with tool_choice {"type":"tool","name":"extract_invoice"} guarantees only that Claude calls extract_invoice — it does not guarantee the arguments are schema-valid; without strict, the docs note, the model “might return incompatible types or missing required fields.” The one addition that closes the gap is strict: true on the tool definition, which runs the constrained-decoding pipeline over the arguments so they conform to the input_schema. The teammate forced the call and assumed they had also constrained the contents; those are separate levers.

Solution ↑ Exercise

A pasteable sentence: “Structured outputs / strict guarantee schema-compliant output through constrained decoding — except for refusals (stop_reason: refusal) and max_tokens cutoffs, and only over the supported JSON-Schema subset.” “It always returns valid JSON” is an operational error, not just loose phrasing, because the guarantee is conditioned on a normal completion: a refusal ends generation before a complete object exists, and a max_tokens cutoff truncates mid-object — both can yield output that does not parse, with the guarantee fully in force. So a production path must still check stop_reason and still handle a parse failure; the guarantee shrinks that handling to a rare edge case but does not remove the need for it. Treating the guarantee as absolute is what removes those checks and turns a rare refusal or truncation into an unhandled crash.

Solution ↑ Exercise

Order by prevent-then-recover. First, the guarantee: reach for structured outputs / strict (lever two) — for a nested object you want the grammar-constrained guarantee, provided the schema stays inside the supported JSON-Schema subset; if the return is naturally a tool call, use tool_choice to force it and strict to constrain its arguments (the same pipeline). Prompt-craft drops out early: prefill is deprecated on current models (Opus 4.7 / Opus 4.6 / Sonnet 4.6 / Mythos Preview), so the classic prefill-{ recipe is off the table — if you need any prompt-side help, it is a system-prompt format instruction (and a stop sequence to trim trailing prose), used only for flexibility the schema can’t express. The validation/retry loop sits last, as the fallback — engaged when the guarantee is unavailable (a path or model without structured outputs) or insufficient (a schema richer than the supported subset). Even as a fallback it can still fail, and the docs name when: the schema is too complex for the task, the task itself is ambiguous, or the agent exhausts its retry limit — which is exactly the argument for keeping the schema focused and the task unambiguous so prevention carries the load and the loop rarely runs.

Part 2 Chapter 18 Last verified 2026-06-14 Fresh

Sub-Agents: The Context-Isolation Primitive

The first move of the coordination axis — a sub-agent is isolation, not capability. A fresh window that inherits nothing and returns only the relevant result; the fresh-in / result-only-out contract that makes it composable; separation of concerns; roles as description plus system prompt plus scoped tools; and when the isolation earns its keep versus when it is pure overhead.

Volatility: architectural-pattern
Tools compared: claude-codecross-tool
Before you start: The spine's coordination axis (a new unit is a fresh window, not an added skill). The tool-minimization and context-as-budget ideas help; no orchestration specifics assumed.
You will learn
  • Why a sub-agent is a context-isolation primitive, not added capability — the single most important thing to understand before reaching for one
  • The fresh-in / relevant-result-only-out contract that makes isolation composable (and how fork_session inverts the input side)
  • How isolation buys separation of concerns, and how a role is just description + system prompt + a scoped tool set
  • When a sub-agent earns its keep — and when it is pure overhead

The spine’s second axis was coordination, and its reframing was sharp: a new unit of work is not a new skill, it is a fresh window. This chapter develops the primitive that embodies that — the sub-agent. Everything here follows from one claim the rest of the chapter unpacks: a sub-agent’s value is the separate context window, not any capability the model otherwise lacks. Get that backwards and you reach for a sub-agent to “make the model smarter”; get it right and you reach for one to quarantine context.

Isolation, not capability

Start from what a sub-agent is, because the common mental model — “a helper that can do something the main agent can’t” — is wrong and leads to misuse. Across Anthropic’s own descriptions the sub-agent is defined by its isolation, not by any new ability. It is, plainly, “an isolated Claude instance with its own context window.” [Official] How and when to use subagents in Claude Code · Anthropic (2026)T1-official original When a side task would otherwise flood the main conversation, “the subagent does that work in its own context and returns only the summary.” [Official] Create custom subagents · AnthropicT1-official original And from the caller’s side the isolation is total: “The subagent works in its own separate context window. None of its file reads touch yours.” [Official] Explore the context window · AnthropicT1-official original

So the value is the window, not the worker. A sub-agent runs the same model you already have; what it adds is a clean, separate place for that model to do focused work whose verbose middle never lands in your conversation.

Key idea

A sub-agent is a context-isolation primitive, not an added capability. The point is to quarantine context — protect the main window and do focused work out of band — not to give the model a skill it otherwise lacks. Reach for a sub-agent when you have context to isolate, never to “add intelligence.”

[Note]

“Isolation, not capability” is this book’s framing — a lens laid over the sources’ isolation statements. The primaries describe the mechanism (own window, inherits nothing, returns only the summary); the “not a capability” gloss is the interpretation that organizes them.

The contract: fresh in, relevant result only out

Isolation is only composable if the boundary is well-defined in both directions, and it is. On the way in, a sub-agent inherits nothing: “Each subagent starts fresh, unburdened by the history of the conversation or invoked skills.” [Official] How and when to use subagents in Claude Code · Anthropic (2026)T1-official original The SDK is precise about how fresh: a sub-agent “runs in its own fresh conversation” [Official] Subagents in the SDK · AnthropicT1-official original and does not receive the parent’s conversation history, tool results, or system prompt — the only channel from parent to sub-agent is the prompt string you pass it. On the way out, the return is equally narrow: “only its final message returns to the parent” [Official] Subagents in the SDK · AnthropicT1-official original — every intermediate tool call and result stays inside the sub-agent.

Concept · The fresh-in / result-only-out contract

A sub-agent takes one input (a prompt string — not your history, tools, or system prompt) and gives back one output (its final message). The verbose middle — its reads, its tool calls, its dead ends — never crosses either boundary. That narrow contract is exactly what lets you delegate without polluting your own window in either direction.

There is one deliberate exception, and naming it sharpens the rule. Fork mode (fork_session) inverts the input side: it carries the parent’s context into the sub-agent, for cases where the side task genuinely needs the conversation so far. The return contract is unchanged — still only the final message comes back. The configuration mechanics of fork mode belong to the SDK reference, not here; what matters for design is that the default is fresh, and fork is the explicit opt-out when isolation-on-input would cost you more than it saves.

[Note]

fork_session is named here as the deliberate inversion of the fresh-start default; its API surface is the Agent SDK’s subject, not this chapter’s.

Isolation buys separation of concerns

Why is a separate window worth the trouble? Because it makes each unit of work independent and non-interfering. Anthropic’s multi-agent write-up names the payoff directly: a sub-agent “provides separation of concerns—distinct tools, prompts, and exploration trajectories—which reduces path dependency and enables thorough, independent investigations.” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original Sub-agents “facilitate compression by operating in parallel with their own context windows,” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original each compressing its slice of the work rather than sharing one crowded window.

The separate window is not just hygiene, then — it is the property that makes the next move (assigning roles) possible. Two investigations in one window contaminate each other’s reasoning; the same two in separate windows stay clean.

Roles: description + system prompt + scoped tools

If isolation gives you independent units, a role is how you specialize one. A role is not a new mechanism — it is three ordinary knobs set deliberately: a description, a system prompt, and a scoped tool set. The docs make the tool-scoping explicit (“limiting which tools a subagent can use” [Official] Create custom subagents · AnthropicT1-official original ) and the design principle blunt: “each subagent should excel at one specific task.” [Official] Create custom subagents · AnthropicT1-official original

The most productive role decomposition the sources demonstrate is generate-then-verify — focused generators plus a separate reviewer. Claude Code’s code review runs it concretely: “Each agent looks for a different class of issue,” [Official] Code Review · AnthropicT1-official original and then “a verification step checks candidates against actual code behavior to filter out false positives.” [Official] Code Review · AnthropicT1-official original The product announcement says the same in plainer words — the agents “look for bugs in parallel” [Official] Bringing Code Review to Claude Code · Anthropic (2026)T1-official original and then “verify bugs to filter out false positives.” [Official] Bringing Code Review to Claude Code · Anthropic (2026)T1-official original

A clean-room verifier beats self-evaluation

The reason the verifier is a separate sub-agent, not the generator checking its own work, is isolation again: a fresh window reviewing candidates against actual behavior has no stake in the path that produced them. Self-evaluation inherits the generator’s blind spots; a clean-room verifier does not. Separating generation from evaluation is the single highest-leverage use of the isolation primitive.

[Note]

The Planner / Generator / Evaluator naming is this book’s framing for that decomposition — a vocabulary laid over the anchored one-task principle and the generate-then-verify split, not a term the primaries use for sub-agents.

When a sub-agent earns its keep

Isolation is not free, and the honest rule is a tradeoff, not a default-on. The benefit side: delegating verbose work (running tests, fetching docs, processing logs) keeps that output in the sub-agent so “only the relevant summary returns to your main conversation,” [Official] Create custom subagents · AnthropicT1-official original and the multi-agent architecture “distributes work across agents with separate context windows to add more capacity for parallel reasoning.” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original The cost side, stated just as plainly: “Subagents start fresh and may need time to gather context” [Official] Create custom subagents · AnthropicT1-official original — a latency hit — and the task’s value “must be high enough to pay for the increased performance.” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original

Key idea

Reach for a sub-agent when there is large or irrelevant context to quarantine, a parallelizable focused subtask, or a clean-room review to run. Stay on the main thread when latency dominates (a fresh agent must re-gather context) or the task value doesn’t clear the cost. Isolation is the lever; the question is always whether this task is worth a fresh window.

The quantified side of that cost — how many more tokens, how much more latency — is the Operations volume’s subject; this chapter asserts the tradeoff qualitatively and leaves the numbers to where they can be measured.

The isolation contract. The main conversation holds the full context; the only channel into a sub-agent is a single prompt string (not history, tool results, or system prompt), so it starts fresh. The sub-agent's reads, tool calls, and intermediate steps stay inside its own window; only its final message returns. fork_session deliberately inverts the input side — carrying the parent's context in — while leaving the result-only-out return unchanged.Two boxes. Left: 'Main conversation' holding the full context. A single arrow labelled 'prompt string (the only input channel)' runs to the right box: 'Sub-agent — fresh window', annotated 'inherits nothing: no history, no tool results, no system prompt'. Inside the sub-agent box, smaller items labelled 'reads · tool calls · dead ends' are marked 'stay here'. A return arrow labelled 'only the final message' runs back to the main conversation. A dashed arrow under the input channel is labelled 'fork_session: carries parent context in (return unchanged)'.
The isolation contract. The main conversation holds the full context; the only channel into a sub-agent is a single prompt string (not history, tool results, or system prompt), so it starts fresh. The sub-agent's reads, tool calls, and intermediate steps stay inside its own window; only its final message returns. fork_session deliberately inverts the input side — carrying the parent's context in — while leaving the result-only-out return unchanged.
Two delegations: one earns the window, one doesn't Worked example

An agent is mid-task and faces two side jobs. Which should be a sub-agent?

  • “Run the full test suite and tell me what failed.” Verbose output (thousands of lines), most of it irrelevant to the main thread. Delegate it: the sub-agent runs the suite in its own window and returns only the relevant summary, Create custom subagents · AnthropicT1-official original keeping the noise out of your context. The isolation pays.
  • “Rename this variable in the file you’re already editing.” Tiny, and it needs the context you already hold. A fresh sub-agent would start cold and “need time to gather context” Create custom subagents · AnthropicT1-official original you already have — pure overhead. Stay on the main thread.

Same primitive, opposite verdicts — because the question was never “can a sub-agent do this?” (it always can; it’s the same model) but “is there context worth quarantining here?”

Reaching for a sub-agent to add intelligence

The misuse that follows from the wrong mental model: spinning up a sub-agent to “think harder” or “be the expert.” It is the same model with no extra ability — all you have added is a fresh window and the latency of filling it. If the task doesn’t have context to isolate, work to parallelize, or a review to run clean, a sub-agent makes the system slower and no smarter. Isolation is the only thing on offer; if you don’t need it, don’t pay for it.

Patterns

Quarantine verbose work. Sketch: delegate high-volume, low-relevance work (tests, doc fetches, log scans) to a sub-agent; keep only the summary. When to use: any side task whose output would crowd the main window. Create custom subagents · AnthropicT1-official original Mechanics: pass a focused prompt; the sub-agent’s verbose middle stays in its window, only the final message returns. Remember: the win is the quarantine, not the work — the main agent could do it, just not without the noise.

Clean-room verify. Sketch: generate with one (or several) focused sub-agents, then review with a separate verifier. When to use: anything where self-evaluation would inherit the generator’s blind spots. Code Review · AnthropicT1-official original Mechanics: focused generators look for distinct issue classes; a verifier checks candidates against actual behavior to filter false positives. Remember: the verifier is separate on purpose — a fresh window has no stake in the path it reviews.

Scope the role, not just the prompt. Sketch: define a role by description + system prompt + a deliberately limited tool set. When to use: whenever a sub-agent should excel at one task and not wander. Create custom subagents · AnthropicT1-official original Mechanics: allowlist/denylist its tools; one task per sub-agent. Remember: the tool scope is part of the role — an unscoped sub-agent is an unfocused one.

Default fresh, fork on purpose. Sketch: let sub-agents start fresh; use fork_session only when the task genuinely needs the parent’s context. When to use: fork when re-gathering context would cost more than carrying it in. Subagents in the SDK · AnthropicT1-official original Mechanics: fresh is the default (one prompt string in); fork inverts the input side, return unchanged. Remember: fork trades isolation-on-input for context — reach for it deliberately, not by habit.

Quick reference

  • What it is: an isolated instance with its own context window — isolation, not capability. How and when to use subagents in Claude Code · Anthropic (2026)T1-official original
  • The contract: fresh in (one prompt string; no inherited history/tools/system prompt), relevant-result-only out (final message only). Subagents in the SDK · AnthropicT1-official original
  • fork_session: the deliberate inversion of the fresh-start default (carries parent context in; return unchanged).
  • Why isolate: separation of concerns — distinct tools/prompts/trajectories, non-interfering. How we built our multi-agent research system · Anthropic (2025)T1-official original
  • Roles: description + system prompt + scoped tools; the productive split is generate-then-verify with a clean-room reviewer. Code Review · AnthropicT1-official original
  • When: quarantine context / parallelize / clean-room review. When not: latency dominates, or task value doesn’t clear the cost. How we built our multi-agent research system · Anthropic (2025)T1-official original (Quantified cost → the Operations volume.)

Practice

Exercise

A teammate says, “Let’s add a sub-agent so the system can handle the hard math it keeps getting wrong.” Name the mental-model error in one sentence, and state what a sub-agent actually adds. Under what different framing (if any) could a sub-agent legitimately help with the math task?

Practice ◆◆◆◇

Take an agent you run. Identify one side task that should be a sub-agent and one that should not, and justify each by the contract — what context gets quarantined (or wastefully re-gathered), and what single result would return. For the one that should be a sub-agent, decide whether it should run fresh or fork_session, and say why. The point is to feel the decision turn on context to isolate, not on capability needed.

Exercise solutions

Solution ↑ Exercise

The error is treating a sub-agent as added capability — it is the same model with no extra ability, so it will get the hard math wrong in its own window just as the main agent does. What a sub-agent actually adds is context isolation: a fresh, separate window. A sub-agent could legitimately help the math task only under an isolation framing, not a capability one — e.g. a clean-room verifier that checks the main agent’s answer against actual computation (running code, not re-deriving by hand), where the value is the independent check in a window with no stake in the original derivation, or quarantining a verbose symbolic-computation step so its output doesn’t flood the main thread. The reframing is the whole lesson: “make it smarter” is the wrong reason; “isolate or independently check” is the right one.

Solution ↑ Exercise

A representative pass. Should be a sub-agent: “summarize the 2,000-line dependency-audit log.” It quarantines a large, low-relevance output — the sub-agent reads the log in its own window and returns only the few findings that matter, so the main conversation never carries the 2,000 lines. It should run fresh (fork_session off): the task needs only the log, not the conversation so far, so carrying parent context in would cost window for no benefit. Should not be a sub-agent: “fix the typo in the function we’re editing.” It needs context the main agent already holds, the output is trivial, and a fresh sub-agent would pay startup latency to re-gather what’s already in hand — pure overhead. The decision turned on context to isolate (lots, in the first; none, in the second), exactly as the contract predicts — never on whether a sub-agent could do it.

Part 2 Chapter 19 Last verified 2026-06-14 Fresh

Multi-Agent: Coordinating Many

Coordinating many agents as one decision chain — topology, then coordinator, then verifier, then a cost gate. Orchestrator-worker and the centralized-to-decentralized axis; the decompose-delegate-aggregate loop two independent first-party posts describe; the in-orchestration verifier; and the genuinely open, unflattened question of when multi-agent is worth its cost — Anthropic ships it, Cognition argues against it, and they share the parallelizability test.

Volatility: architectural-pattern
Tools compared: claude-codecross-tool
Before you start: The sub-agent chapter (the isolation primitive this coordinates) and the spine's coordination axis. The tool-minimization cost instinct helps.
You will learn
  • Multi-agent design as one decision chain — topology → coordinator → verifier → cost gate
  • Orchestrator-worker as the spine, and the centralized↔decentralized axis (supervisor / swarm)
  • The decompose → delegate → aggregate coordinator loop, and the in-orchestration verifier
  • The genuinely open question of when multi-agent is worth its cost — held unflattened, with both camps and the test they share

The sub-agent chapter gave you the unit: a fresh, isolated window. This chapter coordinates many of them. The temptation is to treat “multi-agent” as a capability tier — more agents, more power — but it is better read as one decision chain: choose a topology, implement a coordinator, add a verifier, and gate the whole thing on cost. The last gate is the one that matters most, and it is where the field genuinely disagrees — so this chapter ends not with a verdict but with an honest, dated map of an open question.

One decision chain

Multi-agent design looks like four separate topics — topologies, coordination, verification, cost — but they are four sequential moves in one decision. You pick a topology (how the agents are arranged and who directs them), implement the coordinator (how the lead decomposes and recombines), add a verifier (how worker output is checked), and gate the whole thing on cost (whether the work is parallelizable enough to be worth it). Reading the chapter in order is reading the decision in order.

Key idea

Multi-agent is topology → coordinator → verifier → cost gate, in that order. The first three are mechanics you can get right; the fourth is a go/no-go that most candidate multi-agent systems fail. Design the chain, but expect the cost gate to send you back to a single agent more often than not.

Orchestrator-worker, and the centralized↔decentralized axis

The canonical shape is orchestrator-worker. Anthropic’s research system “uses a multi-agent architecture with an orchestrator-worker pattern, where a lead agent coordinates the process while delegating to specialized subagents that operate in parallel.” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original On a query, “the lead agent analyzes it, develops a strategy, and spawns subagents to explore different aspects simultaneously.” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original

Framework vocabulary names the two ends of an axis. The centralized end is the supervisor: “The supervisor controls all communication flow and task delegation, making decisions about which agent to invoke based on the current context and task requirements.” [Official] LangGraph Multi-Agent Supervisor · LangChain (langchain-ai)T2-release-notes original The decentralized end is the swarm, “where agents dynamically hand off control to one another based on their specializations” [Official] LangGraph Multi-Agent Swarm · LangChain (langchain-ai)T2-release-notes original with no central coordinator.

Concept · Orchestrator-worker on a centralized↔decentralized axis

Orchestrator-worker and supervisor are the same centralized shape under two vocabularies: one lead directs the workers. The swarm is the decentralized alternative: peers hand off control with no lead. Most production systems sit at the centralized end — a lead is easier to reason about, verify, and cost — so treat orchestrator-worker as the default and the swarm as the exception you justify.

[Note]

The supervisor/swarm names come from a framework (LangGraph) — vendor-authoritative on their own framework, and corroborating Anthropic’s centralized/decentralized shapes (so official-relative-to-the-vendor, not practitioner, which this chapter reserves for Cognition). The names are stable, durable vocabulary; the library APIs behind them are in flux (the LangChain 1.0 transition) — the names anchor, not the API surface.

The coordinator: decompose → delegate → aggregate

Inside the centralized shape, the lead runs one reusable loop. It decomposes the query — “the lead agent decomposes queries into subtasks and describes them to subagents,” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original where each brief carries an objective, an output format, tool guidance, and clear boundaries. It delegates those briefs to workers running in parallel. And it aggregates their results.

That this is a pattern and not one team’s idiom is the strongest evidence in the chapter, because two independent first-party posts describe the same loop.

Convergence claude-codecross-tool

Anthropic’s 2025 research-system post states the lead “decomposes queries into subtasks and describes them to subagents.” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original The earlier “Building effective agents” taxonomy defines the orchestrator-workers workflow independently as one where “a central LLM dynamically breaks down tasks, delegates them to worker LLMs, and synthesizes their results.” [Official] Building effective agents · Erik Schluntz and Barry Zhang (2024)T1-official original Two independent first-party posts, the same decompose-delegate-aggregate loop. [Convergence]

The convergence is what licenses treating the loop as the reusable coordinator pattern rather than a single system’s design choice.

The verifier: separating generation from review

A coordinator that only generates is incomplete; the pattern’s natural complement is a verifier — a dedicated reviewer, separate from the workers. In practice, Anthropic “used an LLM judge that evaluated each output against criteria in a rubric” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original (factual accuracy, citation accuracy, completeness, source quality, tool efficiency) — LLM-as-judge applied inside the orchestrated system. At the workflow level this is the evaluator-optimizer: “one LLM call generates a response while another provides evaluation and feedback in a loop.” [Official] Building effective agents · Erik Schluntz and Barry Zhang (2024)T1-official original

This is the same separate-the-reviewer move the sub-agent chapter’s clean-room verifier made, now applied across the orchestrated system: workers generate, a verifier reviews. How to calibrate that judge — its score scale, rubric reliability — is the Operations volume’s evaluation subject, not this chapter’s; here the verifier is a structural role, not a measured instrument.

The cost gate — and a genuinely open question

Now the gate that decides whether any of the above should exist. Multi-agent systems are expensive, and the first-party figure is the one to hold: “agents typically use about 4× more tokens than chat interactions, and multi-agent systems use about 15× more tokens than chats.” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original

[Caveat]

That ~15× is a single first-party datapoint — Anthropic’s own number for its own research system, quoted verbatim. It is not a measured cross-system law; do not generalize it to “multi-agent always costs 15×.” The deeper token-economics is the Operations volume’s subject.

Anthropic itself draws a boundary on the same page: tasks that “require all agents to share the same context or involve many dependencies between agents are not a good fit for multi-agent systems today.” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original So even the camp that ships orchestrator-worker reserves it for fan-out-friendly work.

And here the field does not agree — which is worth presenting honestly rather than resolving.

When is multi-agent worth it? A live, open question

As of mid-2026 this is genuinely unresolved, and the two most-cited positions point opposite ways. The honest move is to lay them side by side, dated, and let the reader weigh them.

AnthropicCognition (Walden Yan)
StanceBuilds and ships orchestrator-worker for a production research systemArgues against multi-agent collaboration for most cases
Recommended defaultOrchestrator-worker for fan-out-friendly tasksA single-threaded linear agent
On the worth-it testReserve it for parallelizable work; shared-context/high-dependency tasks are a poor fitSame boundary, read pessimistically: most real work shares too much context to parallelize cleanly
Provenance (dated)“How we built our multi-agent research system,” 2025-06-13”Don’t Build Multi-Agents,” 2025-06; “Multi-Agents: What’s Actually Working,” 2026-04-22

The positions, in each camp’s own words. Cognition argues that multi-agent collaboration is fragile because “the decision-making ends up being too dispersed and context isn’t able to be shared thoroughly enough between the agents,” [Practitioner] Don't Build Multi-Agents · Walden Yan (Cognition)T3-practitioner original and that “the simplest way to follow the principles is to just use a single-threaded linear agent.” [Practitioner] Don't Build Multi-Agents · Walden Yan (Cognition)T3-practitioner original Its 2026 follow-up refines rather than reverses that: “parallel agents make implicit choices about style, edge cases, and code patterns … these decisions often conflicted with each other, leading to fragile products.” [Practitioner] Multi-Agents: What's Actually Working · Walden Yan (Cognition) (2026)T3-practitioner original

They share the test; they disagree on the window

This is the book’s reading, not a sourced claim. The two camps are not as far apart as the headlines suggest: both converge on a parallelizability test (multi-agent earns its keep only when work fans out into genuinely independent subtasks), and they disagree on how wide that window is. Anthropic finds enough fan-out-friendly work (research) to ship orchestrator-worker; Cognition finds most real work (coding) too interdependent and defaults to single-threaded. The open question is not whether the test is right but how much work passes it — and that is unsettled. Design for the work in front of you, keep the choice reversible, and re-check the field; do not treat either camp as the settled answer.

[Caveat]

This is a moving front: the Cognition follow-up is recent (2026-04-22) and narrows to designs where writes stay single-threaded. Re-check both camps’ current positions before treating this map as current — it is a 2026 snapshot of an open debate, not a resolved one.

Multi-agent as one decision chain. Choose a topology (orchestrator-worker on a centralized supervisor ↔ decentralized swarm axis), implement the coordinator (decompose → delegate → aggregate), add a verifier (an LLM judge separating generation from review), then gate on cost — a single first-party datapoint puts multi-agent at ~15× a chat's tokens, so the work must fan out into genuinely independent subtasks to clear the gate. The gate is where the field disagrees: Anthropic ships orchestrator-worker for fan-out-friendly work; Cognition argues most work is too interdependent and prefers a single-threaded agent.A downward decision chain of four stages. Stage 1 'Topology': a lead box over parallel worker boxes (orchestrator-worker), on a left-to-right axis from 'supervisor (centralized)' to 'swarm (decentralized)'. Stage 2 'Coordinator': decompose to delegate to aggregate. Stage 3 'Verifier': an LLM judge reviewing worker outputs against a rubric. Stage 4 'Cost gate': annotated '~15x tokens (one first-party datapoint) - clear only if the work is genuinely parallelizable', with a fork showing 'fans out -> multi-agent' versus 'shares context -> single-threaded agent'.
Multi-agent as one decision chain. Choose a topology (orchestrator-worker on a centralized supervisor ↔ decentralized swarm axis), implement the coordinator (decompose → delegate → aggregate), add a verifier (an LLM judge separating generation from review), then gate on cost — a single first-party datapoint puts multi-agent at ~15× a chat's tokens, so the work must fan out into genuinely independent subtasks to clear the gate. The gate is where the field disagrees: Anthropic ships orchestrator-worker for fan-out-friendly work; Cognition argues most work is too interdependent and prefers a single-threaded agent.
Walking the chain on a real task Worked example

A team wants a multi-agent system to “modernize our legacy service.” Walk the chain:

  • Topology. If anything, centralized (orchestrator-worker / supervisor) — a swarm’s dispersed control is harder to verify and cost.
  • Coordinator. The lead would decompose “modernize” into subtasks and brief workers. How we built our multi-agent research system · Anthropic (2025)T1-official original But notice the briefs: “update the data layer,” “refactor the API,” “migrate the tests” — these are not independent. They share types, contracts, and patterns.
  • Verifier. You could add an LLM judge over each worker’s diff. Building effective agents · Erik Schluntz and Barry Zhang (2024)T1-official original
  • Cost gate — and this is where it fails. The subtasks are highly interdependent (shared context, mutual dependencies), exactly the “not a good fit” regime Anthropic names. How we built our multi-agent research system · Anthropic (2025)T1-official original Cognition’s critique bites here too: parallel agents would make conflicting implicit choices about patterns and edge cases. Multi-Agents: What's Actually Working · Walden Yan (Cognition) (2026)T3-practitioner original The work doesn’t pass the parallelizability test, and at ~15× the token cost, How we built our multi-agent research system · Anthropic (2025)T1-official original it isn’t worth it.

The verdict: a single-threaded agent (or sequential sub-agent delegations for the genuinely-isolable bits, per the previous chapter), not a multi-agent system. The chain did its job by sending you back to one agent.

Reaching for multi-agent because the task feels big

Task size is not the gate; parallelizability is. A big, interdependent task (most coding) fails the cost gate — it shares too much context to fan out, so multi-agent multiplies tokens (~15×) and invites conflicting implicit choices without buying parallel speed. Reach for multi-agent when the work decomposes into genuinely independent subtasks (research-style fan-out), not when it is merely large. When in doubt, the field’s skeptical camp defaults to a single-threaded agent — a reasonable prior.

Patterns

Default to orchestrator-worker. Sketch: one lead coordinates parallel workers; reach for a swarm only with cause. When to use: any multi-agent system that clears the cost gate. How we built our multi-agent research system · Anthropic (2025)T1-official original Mechanics: lead analyzes, strategizes, spawns workers; supervisor controls flow + delegation. Remember: centralized is easier to verify and cost than decentralized.

Run the coordinator loop. Sketch: decompose → delegate (focused briefs) → aggregate. When to use: the lead’s core job. Building effective agents · Erik Schluntz and Barry Zhang (2024)T1-official original Mechanics: each brief carries objective, output format, tool guidance, boundaries; the lead synthesizes results. Remember: two independent first-party posts describe this same loop — it is a pattern, not an idiom.

Add a clean-room verifier. Sketch: a dedicated LLM judge reviews worker output against a rubric. When to use: whenever generation should be checked by something other than the generator. How we built our multi-agent research system · Anthropic (2025)T1-official original Mechanics: rubric dimensions (accuracy, completeness, source quality, …); generation and review separated. Remember: judge calibration is the Operations volume’s job; here it’s a structural role.

Gate hard on parallelizability. Sketch: go multi-agent only when the work fans out into independent subtasks. When to use: the go/no-go before building anything. How we built our multi-agent research system · Anthropic (2025)T1-official original Mechanics: shared-context/high-dependency work is a poor fit; multi-agent runs ~15× a chat’s tokens (one first-party datapoint). Remember: most tasks fail this gate — defaulting back to a single agent is the common, correct outcome.

Quick reference

  • The chain: topology → coordinator → verifier → cost gate.
  • Topology: orchestrator-worker (= supervisor, centralized) is the default; swarm (decentralized) is the exception. LangGraph Multi-Agent Supervisor · LangChain (langchain-ai)T2-release-notes original LangGraph Multi-Agent Swarm · LangChain (langchain-ai)T2-release-notes original
  • Coordinator: decompose → delegate → aggregate — described by two independent first-party posts. Building effective agents · Erik Schluntz and Barry Zhang (2024)T1-official original
  • Verifier: an in-orchestration LLM judge separates generation from review (calibration → Operations volume). How we built our multi-agent research system · Anthropic (2025)T1-official original
  • Cost: ~15× a chat’s tokens — one first-party datapoint, not a law; don’t generalize. How we built our multi-agent research system · Anthropic (2025)T1-official original
  • The open question: Anthropic ships orchestrator-worker; Cognition argues for single-threaded; both share the parallelizability test, disagree on the window width — unsettled as of 2026, recheck. Don't Build Multi-Agents · Walden Yan (Cognition)T3-practitioner original

Practice

Exercise

State the multi-agent decision chain in order, and say which stage is a go/no-go rather than a mechanic. Then explain why “the task is large” is the wrong trigger for multi-agent and what the right trigger is.

Practice ◆◆◆◆

Take a task you might be tempted to make multi-agent. Walk the chain: pick a topology, sketch the coordinator’s decomposition, name a verifier, then apply the cost gate honestly — are the subtasks genuinely independent, or do they share context? Decide go or no-go, and tie your reasoning to both the ~15× cost figure (without generalizing it) and the parallelizability test. The point is to feel the gate do its work — most honest walks end in “no-go, use one agent.”

Practice ◆◆◆◆

The chapter presents the Anthropic↔Cognition disagreement as a live, open question, not a settled one. In three or four sentences, state both positions fairly (with their dates), name exactly what they agree on and what they disagree on, and explain why a responsible architect treats this as a reversible, re-checkable decision rather than picking a permanent side. The point is to practice holding a genuine disagreement unflattened — neither flattening Cognition into “multi-agent is bad” nor Anthropic into “multi-agent is good.”

Exercise solutions

Solution ↑ Exercise

The chain is topology → coordinator → verifier → cost gate. The first three are mechanics (choosing a shape, implementing the decompose-delegate-aggregate loop, adding a reviewer); the cost gate is the go/no-go — it decides whether the system should exist at all. “The task is large” is the wrong trigger because size is not what makes multi-agent pay: a large but interdependent task shares too much context to fan out, so multiple agents multiply tokens (~15× a chat, on the one first-party datapoint) and risk conflicting implicit choices without buying parallel speed. The right trigger is parallelizability — the work must decompose into genuinely independent subtasks (research-style fan-out), which is exactly the regime both camps’ test points to and which most large coding tasks fail.

Solution ↑ Exercise

The shape of a good answer (the verdict matters less than the honest walk). Take a tempting task — “build a new end-to-end feature across our stack.” Topology: if anything, orchestrator-worker (centralized is easier to verify and cost than a swarm). Coordinator: the lead decomposes into “data layer,” “API,” “UI,” “tests” and briefs a worker each. Verifier: an LLM judge over each worker’s diff. Cost gate — and here it fails: those subtasks are not independent — they share types, contracts, and patterns, so they must share context (the “not a good fit” regime), and parallel workers would make conflicting implicit choices about those shared patterns. The work fails the parallelizability test, and at ~15× a chat’s tokens (one first-party datapoint, not a law to generalize) the spend is not bought back by parallel speed. Verdict: no-go — a single-threaded agent, or sequential sub-agent delegations for the genuinely isolable bits (a self-contained migration script, a doc-generation pass), not a multi-agent system. A task that would pass the gate: “survey ten unrelated libraries and summarize each” — genuinely fan-out, no shared context, the rare go. The exercise’s point is that the honest walk usually ends in no-go, and that the gate — not task size — is what decides.

Solution ↑ Exercise

A fair statement: Anthropic (multi-agent-research, 2025-06-13) builds and ships orchestrator-worker for a production research system, and reserves it for fan-out-friendly work — it explicitly says shared-context/high-dependency tasks are a poor fit. Cognition (Walden Yan: “Don’t Build Multi-Agents,” 2025-06; “Multi-Agents: What’s Actually Working,” 2026-04-22) argues that multi-agent collaboration is fragile because decision-making is too dispersed and context can’t be shared thoroughly enough, and defaults to a single-threaded linear agent. They agree on the underlying test — multi-agent is worth it only when work fans out into independent subtasks; they disagree on how much real work passes that test (Anthropic finds enough in research; Cognition finds most coding too interdependent). A responsible architect treats it as reversible and re-checkable because the question is empirically open and moving (the Cognition follow-up is from 2026-04-22), so betting the architecture permanently on either camp — rather than designing for the work in front of you and re-checking — would be flattening a live disagreement into a false certainty.

Part 2 Chapter 20 Last verified 2026-06-14 Fresh

Composing Tools & Orchestration: The Two Axes as One System

The capstone of the Tools & Orchestration volume — composing its chapters into one sequenced design workflow on the spine's two axes (capability and coordination), the recurring decision points, an honest map of the evidence tiers, and the boundary this volume leaves to Operations.

Volatility: architectural-pattern
Tools compared: claude-codecross-tool
Before you start: The whole volume — the capability chapters (build-vs-buy, tool minimization, MCP, shaping I/O) and the coordination chapters (sub-agents, multi-agent).
You will learn
  • How the volume’s chapters compose into one sequenced design workflow, not eight separate concerns
  • A decision order for designing an agent’s tools and orchestration together
  • The recurring trade-offs, and how the volume resolves them
  • An honest map of the evidence — what is official, converged, single-datapoint, openly contested, and volatile

This chapter is integrative. It introduces no new evidence — it composes the volume’s grounded claims into a design workflow and a decision guide. Where it restates a load-bearing fact, it points back to the chapter that established it; the rest is synthesis.

[Note]

Integrative synthesis, grounded in the prior chapters — not a new evidence chapter. Read it as the “how to put it together,” not as new claims.

The two axes are one system

The volume opened on two axes the spine drew: capability — what you expose to the agent — and coordination — how many isolated windows you run. Eight chapters in, the payoff is that they are not separate subjects but two ways of spending one currency. Context is “a critical but finite resource for AI agents,” [Official] Effective context engineering for AI agents · Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, and Jeremy Hadfield (2025)T1-official original and both axes draw on it: capability spends the window directly (every tool, abstraction, and prompt sits in it), and coordination spends it by multiplication (every agent is another window to fill and pay for).

Key idea

Both axes reduce to the context window, so both inherit the same default: add only when the workflow demonstrably needs it. On the capability axis that means subtract; on the coordination axis it means isolate only when there’s context to quarantine or independent work to fan out. “Can I add this?” is the wrong question on either axis; “what does it cost in the window, and is the work worth it?” is the right one.

A design workflow

The chapters fall into a natural order when you design a real agent’s tools and orchestration together.

  1. Start direct; add a harness only when earned (ch13). Write thin on the API first; configure/wrap/extend a production harness before building one, and treat any framework’s convenience as abstraction you pay for in lost visibility. [Official] Building effective agents · Erik Schluntz and Barry Zhang (2024)T1-official original
  2. Subtract the tool set to the workflow (ch14). The smallest set that covers the work beats a complete one: “more tools don’t always lead to better outcomes.” [Official] Writing effective tools for agents — with agents · Aizawa (Anthropic) (2025)T1-official original Consolidate overlaps; make each tool’s response high-signal; load on demand only when scale forbids subtracting.
  3. Wire external capability least-privilege (ch15). Reach across MCP against a capability-negotiated protocol, designing to its security obligations rather than assuming it enforces them — and against a known moving target.
  4. Shape the I/O (ch16, ch17). Use the prompting craft for what goes in (examples first), and the output levers for what comes back — preferring the grammar-backed guarantee, stated with its limits, over a recover-after-the-fact retry loop.
  5. Reach for coordination only when the work fans out (ch18, ch19). A sub-agent is isolation, not capability — use it to quarantine context, parallelize, or clean-room review. Escalate to a multi-agent topology only when subtasks are genuinely independent and the value clears the cost.
Concept · Capability first, coordination last

Steps 1–4 are the capability axis (what to expose, and how to shape its I/O); step 5 is the coordination axis (how many windows). The order matters: most “my agent is slow/unreliable” problems are capability-axis problems (too many overlapping tools, verbose responses) that are cheaper to fix than any topology change. Exhaust the capability axis before spending windows on coordination.

Decision points

The recurring trade-offs, and how the volume resolves them:

  • Add vs. subtract (capability). Default to subtract: the smallest tool set that covers the workflow, the minimal harness, the prompt that achieves its effect with examples rather than elaborate structure. Every addition is paid in the window whether or not it fires.
  • Build vs. buy (harness). Start direct; adopt a configurable harness when a concrete need earns the abstraction; build from scratch only when nothing fits — because a custom harness is a standing maintenance cost as models move.
  • Guarantee vs. flexibility (output). Reach for structured outputs / strict when you need a hard schema guarantee (stated with its refusal/max_tokens/supported-subset limits); reach for prompt-craft when you need flexibility beyond a strict schema. Prevent beats recover.
  • Primitive vs. topology (coordination). A sub-agent is the unit (one isolated window); a multi-agent system is how units coordinate. Don’t build a topology where one isolated sub-agent would do, and don’t expect a lone sub-agent to deliver what only coordination can.
  • The cost gate (multi-agent). Go multi-agent only when the work is genuinely parallelizable — a single first-party datapoint puts the cost at ~15× a chat’s tokens, [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original so interdependent, shared-context work fails the gate.
The one question that locates most decisions

“What does this cost in the context window, and is the work worth it?” Every move in this volume — a tool, an abstraction, an MCP server, a verbose prompt, a sub-agent, a whole topology — is a debit against the same finite window. Locating a decision means pricing it in that currency and asking whether the workflow’s need clears the price. Most do not, which is why subtract and stay single-agent are the defaults.

An honest map of the evidence

The volume’s claims sit at different evidence tiers, and designing well means weighting them accordingly.

  • Official mechanics (authoritative by construction). The tool-design guidance, the MCP spec, the structured-outputs guarantee, the sub-agent and orchestrator-worker mechanics are first-party Anthropic — authoritative on what they are. Much of it is single-vendor; treat it as the platform’s design, not independently-benchmarked efficacy.
  • Converged (two kinds, of different strength). Two places earn a convergence tag: tool-minimization’s three independent vendor self-reports (Vercel, GitHub, Block — three separate companies), and the decompose-delegate-aggregate loop that two Anthropic posts state independently of each other Building effective agents · Erik Schluntz and Barry Zhang (2024)T1-official original (same vendor, two publications — a weaker independence than three separate companies). Convergence of direction, not transferable numbers.
  • A single datapoint (hold loosely). The ~15× multi-agent token figure is one first-party number for one system, How we built our multi-agent research system · Anthropic (2025)T1-official original quoted verbatim — a cost gate, not a law to generalize.
  • Openly contested. Whether multi-agent is worth it is a live 2026 disagreement: Anthropic ships orchestrator-worker, Cognition argues for single-threaded, and the two share a parallelizability test while disagreeing on how much work passes it. Design for the work in front of you; keep the choice reversible.
  • Volatile (re-check). The MCP release candidate (2026-07-28) and the prefill deprecation move per release. Build to the stable core; date your snapshots.
Designing as if it's all settled

The biggest capstone error is flattening the tiers — quoting the ~15× as a universal law, treating a first-party mechanic as proven efficacy, or picking a permanent side in the multi-agent debate. Design to the evidence you have: lean on the converged practices, hold single datapoints as single datapoints, keep the contested and volatile parts (the worth-it window, the MCP RC, prefill) behind reversible, re-checkable choices.

The boundary of this volume

This volume engineers two of the harness’s moves — the tools an agent reaches for and the orchestration of more than one. It stops, deliberately, at measuring and operating them. How to evaluate an agent (the harness, the suite, judge calibration), how to model cost beyond the single ~15× datapoint, how to make a system observable, how to keep a human in the loop, and how to defend against adversarial input (the MCP threat model this volume only pointed at) are the Operations volume’s subject, not this one’s. What this volume owns of them is only their footprint — the token cost a sub-agent or topology incurs, the design-time security posture MCP asks for — flagged where it lands.

The volume composed as one sequenced workflow. The capability axis comes first — start direct and add a harness only when earned, subtract the tool set to the workflow, wire external capability least-privilege over MCP, shape the I/O — and the coordination axis comes last: reach for a sub-agent (isolation) or a multi-agent topology only when the work genuinely fans out and clears the ~15× cost gate. Every step is a debit against the same finite context window.A left-to-right workflow. A 'Capability axis' group of four sequential boxes: 'build vs. buy — start direct', 'tool minimization — subtract', 'MCP — least-privilege', 'shape I/O — prompting + structured output'. An arrow leads to a 'Coordination axis' gate: 'work genuinely fans out?' forking to 'sub-agent (isolation) / multi-agent topology' on yes and 'stay single-agent' on no, the multi-agent branch annotated '~15x cost gate'. A banner underneath reads 'every step debits one finite context window'.
The volume composed as one sequenced workflow. The capability axis comes first — start direct and add a harness only when earned, subtract the tool set to the workflow, wire external capability least-privilege over MCP, shape the I/O — and the coordination axis comes last: reach for a sub-agent (isolation) or a multi-agent topology only when the work genuinely fans out and clears the ~15× cost gate. Every step is a debit against the same finite context window.

Quick reference

  • Two axes, one currency: capability (what’s in a window) and coordination (how many windows) both spend context. Effective context engineering for AI agents · Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, and Jeremy Hadfield (2025)T1-official original
  • Workflow: start direct → subtract the tools → wire MCP least-privilege → shape the I/O → coordinate only when the work fans out.
  • The locating question: what does it cost in the window, and is the work worth it?
  • Defaults: subtract on capability; stay single-agent on coordination.
  • Weight the evidence: official-mechanic → converged → single-datapoint → contested → volatile.
  • Boundary: evaluation, cost-modeling, observability, human-in-the-loop, and security are the Operations volume.

Practice

Exercise

Walk the five-step workflow for an agent you’d actually build. At each step name the one decision you’d make and which chapter grounds it. Where is your design leaning on the weakest-tier evidence (a single datapoint, or the contested multi-agent question), and how would you keep that choice reversible?

Practice ◆◆◆◆

Take a real or imagined agent and design it end-to-end across both axes: the harness (build vs. buy), the tool set (what survives subtraction), any MCP wiring, the output shape, and whether any coordination (sub-agent or multi-agent) is justified. For each decision, price it in the context-window currency and state what would have to be true to spend more. Then identify the single place your design is most exposed to weak-tier evidence and make that choice reversible. The point is to feel the volume as one discipline — capability priced first, coordination only when the work earns it.

Exercise solutions

Solution ↑ Exercise

A representative pass for a code-review agent: (1) build vs. buy — configure an existing harness, don’t build (start direct, grounded in ch13); (2) tool set — a small set (read diff, post comment, run tests), consolidating any overlapping search tools (ch14’s subtract-first); (3) MCP — if it reaches an external code host, wire it least-privilege with audience-bound tokens (ch15); (4) output — use structured outputs / strict for the machine-read review payload, stated with its limits (ch17); (5) coordination — a clean-room verifier sub-agent to filter false positives is justified (isolation, ch18), but a full multi-agent topology is not — review subtasks share too much context to fan out and would fail the cost gate (ch19). The weakest-tier exposure is the coordination choice (the multi-agent worth-it question is contested) and any reliance on the ~15× figure; keep it reversible by starting single-agent-plus-verifier and only escalating if a genuinely parallel workload appears — re-checking the field rather than committing to a topology up front.

Solution ↑ Exercise

The shape of a good answer, not a single right one: the design names each decision and its price. Harness: configure, not build — the price of building is standing maintenance as models move, only worth it if no configurable option fits. Tools: the minimal set covering the workflow — each extra tool is paid in definition tokens at rest plus selection risk, so the bar for adding one is “the workflow genuinely needs it.” MCP: only if external capability is required, designed least-privilege. Output: the strongest guarantee the schema allows, retry loop only as fallback. Coordination: stay single-agent unless subtasks are genuinely independent — every agent is another ~15× window, so the bar is real parallelizability, not task size. The most weak-tier-exposed choice is almost always the coordination one (contested) or any quoted cost number (single datapoint); making it reversible means defaulting to the cheaper option (single agent, fewer tools) and escalating only on demonstrated need — which is exactly the volume’s two defaults, subtract and stay single-agent, applied as one discipline.

Part 3 Chapter 21 Last verified 2026-06-14 Fresh

Measuring & Operating Agents: The Discipline

The spine of the Evaluation & Operations volume. Once an agent is built, the discipline shifts from construction to operation — and the first move is to make what counts as good measurable before scaling. The chapter maps the volume's five operational surfaces (eval, observability, cost, oversight, security) and states the volume's evidence-honesty rule up front — that five of the six rest on first-party-authoritative evidence rather than triangulation, with security the one genuine convergence.

Volatility: stable-principle
Tools compared: claude-codecross-tool
Before you start: Vols 1–2 — the agent as Model + Harness, context as a finite budget, and the tools and orchestration the harness coordinates. This volume operates what those volumes built; no new construction is assumed.
You will learn
  • The shift from building an agent to operating one — and why measurement comes first
  • The volume’s five operational surfaces — eval, observability, cost, oversight, security — and the one mental model each owns
  • The evidence-honesty rule this volume runs on: why most of operations is first-party-authoritative rather than triangulated, and where the one genuine convergence lives
  • The map of the volume, and what each chapter owns

Vols 1 and 2 built the agent — the environment it acts in, the context it reasons over, the tools and orchestration its harness coordinates. This volume takes what is left once the thing actually runs: how you know it works, see what it did, pay for it, keep a human over it, and defend it against forged instructions. The thesis of this chapter is that all five are one discipline wearing five faces — and that the discipline begins with measurement, because you cannot operate what you cannot measure.

From building an agent to operating one

The first two volumes were construction. Vol 1 engineered the environment and the context; Vol 2 took the harness’s tools and orchestration. Both answered how do I build this? Once an agent is in production, the questions change shape entirely: Is it actually good? What did it just do? What is it costing? Who approves the irreversible step? Who is really issuing the instruction it just followed?

These are operational questions, and they share a precondition. An agent, in the working definition the series uses, is a system where models “dynamically direct their own processes and tool usage” [Official] Building effective agents · Erik Schluntz and Barry Zhang (2024)T1-official original — which is exactly what makes operation hard. The behavior is not fixed by the spec; the agent decides at run time. So you cannot read whether it works off the source the way you read a function’s contract — you have to measure it. Operation is the discipline of measuring a system whose behavior you deliberately did not pin down.

Key idea

You cannot operate what you cannot measure. Every operational surface in this volume is downstream of a measurable target: eval defines what “good” means, observability shows what happened against it, cost prices the run, oversight gates the risky step, and security defends the input. Skip the measurement and the other four become guesswork dressed up as dashboards.

Measure before you scale

If operation begins with measurement, the first move is eval — and the ordering matters more than it looks. The cost of inverting it is concrete: “evals get harder to build the longer you wait.” [Official] Demystifying evals for AI agents · Grace, Hadfield, Olivares and De Jonghe (Anthropic)T1-official original Once an agent already exists, any measurable target you retrofit tends to be shaped around the behavior the agent currently has — survivorship bias baked into the ruler. Define the target first, and let it pull the rest of the system into existence.

That is why this volume opens with two evaluation chapters before any of the operational surfaces: the eval is the target the other surfaces serve. Observability shows runs against that target; cost is the price of hitting it; oversight gates the steps that could miss it expensively; security defends the inputs that could subvert it. None of them means anything until “good” is something you can run, not something you feel.

Concept · Operating an agent

Operating an agent is everything that happens after it is built and running: knowing whether it is good (eval), seeing what it did (observability), pricing what it cost (cost), keeping a human over its irreversible moves (oversight), and defending it against forged instructions (security). Each surface is downstream of a measurable target — which is why the discipline begins with eval, not with the dashboards.

The five operational surfaces

The volume is organized as five surfaces, in the order the work naturally flows — measure → see → spend → oversee → defend:

  • Eval — measure. Define what “good” means and make it runnable: for a single prompt (ch22) and for an agent’s whole trajectory (ch23).
  • Observability — see. The session log is the ground truth; tracing, attribution, and cost-surfacing all derive from it (ch24).
  • Cost — spend. Input context, not output, is the cost driver; four composable levers manage it (ch25).
  • Oversight (human-in-the-loop) — oversee. Keep a human in control of the irreversible or wrong action (ch26).
  • Security — defend. Establish who is really issuing the instruction — prompt injection and the lethal trifecta (ch27).
One discipline, five faces

The surfaces are not five separate subjects bolted together — they are one loop. Eval defines the target; observability shows the run against it; cost prices it; oversight gates it; security defends it; and the failures every surface exposes flow back into the eval suite to harden it (the loop ch28 closes). The unifying move is measurement: each face turns an operational question into something you read off an instrument instead of guessing at.

[Note]

“Measure → see → spend → oversee → defend” is this book’s framing of the five surfaces, laid over the dossiers’ separate treatments — a reading order, not a phrase the primary sources use.

The evidence this volume runs on

One thing must be said before the chapters begin, because it changes how every claim in them should be read. Most of this volume rests on a different evidence base than Vol 2 did.

Vol 2 could often point to several independent voices agreeing — Anthropic, framework vendors, and third-party practitioners converging on the same move. Operations is not like that. Five of the six bodies of evidence behind these chapters are single-vendor or first-party by construction: Anthropic’s evaluation methodology, Claude Code’s observability, cost, and oversight mechanics, and the OpenTelemetry specification. These are authoritative — the vendor and the spec are the definitive sources for how their own systems behave. But authoritative is not the same as triangulated.

So this volume refuses to dress first-party authority up as independent agreement. The eval, observability, cost, and oversight chapters cite official sources and say they are official — <Tag kind="official">, not <Tag kind="convergence">. There is exactly one exception, and it is earned: in security (ch27), the principle that you defend by construction rather than by detection is asserted by multiple independent research groups, and there — and only there — the book tags genuine convergence.

This inversion is itself a finding worth stating up front: operations is the part of the discipline where the evidence is most authoritative and least triangulated at the same time. Naming that lets a reader calibrate every downstream claim by the company it keeps, rather than assuming a uniform standard of proof that does not hold.

What each chapter owns

The chapters move along the five surfaces, eval first.

Eval — defining the target.

  • Evaluating a prompt (ch22) — the four-step loop that tells you a prompt is good and lets you iterate it. Unit of analysis: a prompt.
  • Evaluating an agent (ch23) — harnesses, task suites, and the LLM judge for a trajectory. Unit of analysis: a run.

Operations — running against the target.

  • Observability (ch24) — four surfaces over one session log: tracing, attribution, and cost-surfacing.
  • Cost (ch25) — input context as the cost driver, and the four levers that manage it.
  • Human-in-the-loop (ch26) — the oversight workflow layered on top of Vol-1’s permission model.
  • Security (ch27) — the lethal trifecta as the threat model, and design-by-construction as the defense.

Closing.

  • Operating the whole (ch28) — the five surfaces as one operate-and-improve loop, with the unsolved trade-offs stated honestly.

Each chapter owns a precise slice, and the boundaries are deliberate: ch23 owns the judge’s calibration, while ch22 only uses a judge; ch24 records what ran, while ch23 scores whether it was correct; ch25 models the economics of the numbers ch24 merely surfaces; ch26 is the oversight workflow on top of the permission model Vol 1 already built; and ch27 is the authorized-but-forged instruction, the counterpart to Vol 1’s authorized-but-risky one. Holding those seams keeps each surface a single, measurable idea.

The volume's five operational surfaces as one left-to-right arc: measure (eval) → see (observability) → spend (cost) → oversee (human-in-the-loop) → defend (security). Eval sits first because every later surface is downstream of a measurable target; the dashed return arrow shows the failures each surface exposes flowing back into the eval suite — the operate-and-improve loop ch28 closes.A left-to-right arc of five labeled stages. Stage 1 'Measure — Eval (ch22–23)'; stage 2 'See — Observability (ch24)'; stage 3 'Spend — Cost (ch25)'; stage 4 'Oversee — Human-in-the-loop (ch26)'; stage 5 'Defend — Security (ch27)'. A dashed arrow loops from the right end back to stage 1, labeled 'failures harden the eval suite (ch28)'. A caption strip beneath reads that you cannot operate what you cannot measure, and that every surface is downstream of a measurable target.
The volume's five operational surfaces as one left-to-right arc: measure (eval) → see (observability) → spend (cost) → oversee (human-in-the-loop) → defend (security). Eval sits first because every later surface is downstream of a measurable target; the dashed return arrow shows the failures each surface exposes flowing back into the eval suite — the operate-and-improve loop ch28 closes.
Locating an operational question Worked example

A team says: “Our agent feels worse since last week’s prompt change, and the bill went up. What do we do?”

Locate each part on a surface before acting on any of it:

  • “Feels worse” is an eval question (ch22/ch23). A feeling is not a measurement. Without a suite that scores the old prompt against the new one, “worse” is a vibe — and the first move is to make the regression measurable, not to revert on a hunch.
  • “What did it do differently” is an observability question (ch24). The session logs of the failing runs are the ground truth; read them before theorizing about causes.
  • “The bill went up” is a cost question (ch25) — and probably a context one: a longer prompt or more verbose tool output inflates input tokens, which is the cost driver.
  • Notice what is not here: no irreversible action is waiting on a human gate (ch26), and nothing suggests a forged instruction (ch27) — so those two surfaces stay quiet. Naming a surface also means knowing when it does not apply.

The five surfaces turned a panicked “what do we do?” into four located, instrument-able questions — and the eval one comes first, because until “worse” is measurable, everything after it is guesswork.

Operating on vibes

The most common operational failure is acting on impressions instead of instruments: reverting a prompt because the output “feels worse,” guessing at cost from a single invoice, trusting an agent’s irreversible action because it “seemed fine.” Every surface in this volume exists to replace a vibe with a reading. The discipline is not the dashboards themselves — it is refusing to operate on anything you have not measured, starting with the definition of “good.”

Quick reference

  • The shift: Vols 1–2 build the agent; Vol 3 operates it — eval, observability, cost, oversight, security.
  • The premise: you cannot operate what you cannot measure, so eval comes first; “evals get harder to build the longer you wait.” Demystifying evals for AI agents · Grace, Hadfield, Olivares and De Jonghe (Anthropic)T1-official original
  • The arc: measure (eval) → see (observability) → spend (cost) → oversee (HITL) → defend (security).
  • The evidence rule: five surfaces are first-party-authoritative, not triangulated — stated as such, never dressed as convergence; security is the one genuine convergence.
  • Boundary discipline: each chapter owns a precise slice — ch23 calibrates the judge ch22 only uses; ch24 records what ran, ch23 scores whether it was correct; ch25 prices what ch24 surfaces.
  • The reflex to build: turn every “feels worse” or “seemed fine” into “measured against what?”

Practice

Exercise

The chapter claims the five operational surfaces “are one discipline wearing five faces.” Name the five surfaces in the measure → see → spend → oversee → defend order, and state in one sentence why eval is placed first. Then explain what specifically goes wrong if you invert the order and build the eval after the agent.

Practice ◆◆◇◇

Take an agent you run or have read about. For each of the five surfaces, write one question you cannot currently answer with an instrument — for example, “I have no suite that scores this prompt,” “I can’t see which tool calls actually ran,” “I don’t know what a typical run costs,” “there’s no gate before it touches main,” “I’ve never asked whether a malicious input could redirect it.” Then rank the five gaps by which unanswered question is costing you most right now. The point is to feel that operating an agent is a set of measurable questions — and to notice which instruments you are missing.

Exercise solutions

Solution ↑ Exercise

The five surfaces are eval (measure), observability (see), cost (spend), oversight / human-in-the-loop (oversee), and security (defend). Eval is first because every other surface is downstream of a measurable notion of “good”: observability shows what ran against that target, cost prices it, oversight gates the steps that could miss it expensively, and security defends the inputs that could subvert it — but none of them can tell you whether the agent is actually working without an eval that defines working. Inverting the order — building the eval after the agent — bakes in survivorship bias, because “evals get harder to build the longer you wait”: you end up retrofitting the measurable target around the agent’s current behavior, so the ruler is shaped to pass what the agent already does instead of defining what it must achieve. The eval should pull the agent into existence, not the reverse.

Solution ↑ Exercise

A worked example. Take a documentation-writing agent. Eval (measure): “I have no suite that scores whether a generated doc is accurate and complete — I read a few by hand and trust my impression.” Observability (see): “When a doc comes out wrong I can’t see which sources the agent actually read; the run is opaque after the fact.” Cost (spend): “I don’t know what one doc costs — the monthly bill is a single number I can’t attribute to runs.” Oversight (oversee): “The agent can open a pull request automatically, and I’m not certain there’s a gate before it touches the main branch.” Security (defend): “The agent ingests arbitrary web pages, and I’ve never asked whether a malicious page could redirect it.” Ranking: if the agent is autonomously opening PRs, the oversight gap is the most expensive — an irreversible, wrong action can ship unreviewed — so close that gate first; then the eval gap, so “is it any good?” stops being a hand-wave; cost and observability are the diagnostics you will reach for the moment either of the first two misbehaves; security ranks by how exposed the ingested content is. The exercise’s value is that it turns “operating the agent” from a vague responsibility into five concrete, instrument-shaped gaps — and forces a priority among them rather than a dashboard for each.

Part 3 Chapter 22 Last verified 2026-06-14 Fresh

Evaluating a Prompt: The Four-Step Loop

How you know a prompt is good and iterate it — a four-step loop, not a one-shot check. Define measurable criteria, build a representative test set, iterate with tooling, and grade by reliability-per-effort, with criteria and tests fixed before you touch the prompt. The unit of analysis is a single prompt, and the LLM judge here is merely used — its calibration belongs to the next chapter.

Volatility: feature-surface
Tools compared: claude-codecross-tool
Before you start: Ch21 — measurement comes first, and eval is the target every other operational surface serves. This chapter is the prompt half of that target; ch23 is the agent half.
You will learn
  • The four-step loop that tells you a prompt is good and lets you iterate it — define criteria, build test cases, iterate with tooling, grade outputs
  • Why criteria and tests are preconditions — fixed before you touch the prompt, so improvement is measured rather than asserted
  • How to engineer the eval set for the real tension — real-world fidelity against automatable volume
  • How tool-assisted iteration works — the prompt improver drafts the change, the Console eval tool measures the variant
  • The grading hierarchy ranked by reliability-per-effort — and why the LLM judge here is used, not calibrated (that is ch23’s job)

Ch21 said the discipline begins with eval, because you cannot operate what you cannot measure. This chapter is the first and smallest unit of that measurement: a single prompt. The question “is this prompt good?” has a deceptively simple-looking answer — run it and see — but a one-shot look is exactly the vibe ch21 warned against. The thesis here is that knowing a prompt is good is a loop: you commit to what “good” means, build a way to measure it, iterate, and grade — and then go round again. The unit is a prompt; ch23 will take the same shape up to a whole agent trajectory.

The four-step loop

Anthropic frames building an LLM application as a cycle: first define success criteria, then build evaluations to measure against them — and “This cycle is central to prompt engineering.” [Official] Define success criteria and build evaluations · AnthropicT1-official original That single sentence is the spine of this chapter. Evaluating a prompt is not a checkpoint you pass once; it is a feedback loop you stay inside while the prompt is alive.

The loop has four steps, and they run in order:

  1. Define success criteria — pin down what “good” means for this prompt, measurably.
  2. Build test cases — assemble a representative set of inputs to run the prompt against, favoring automatable volume over a small hand-curated set (the opposite of ch23’s small, expensive trajectory suites).
  3. Iterate with tooling — change the prompt (the prompt improver drafts a candidate) and re-run.
  4. Grade outputs — score each run against the criteria, with a grader matched to the criteria.

Then you loop. The grade tells you whether the last change helped; if not, you iterate again. The shape is identical to how you treat code: a measurable target, a test set, an iteration tool, and a grader — looped until the bar is met. The rest of this chapter takes the four steps in turn, but the order itself carries the first lesson: steps 1 and 2 are not interchangeable with 3 and 4.

Key idea

Evaluating a prompt is a four-step loop — define criteria → build test cases → iterate with tooling → grade outputs — not a one-shot check. The first two steps are preconditions you fix before touching the prompt; the last two are the iteration cycle you run against them. Skip the preconditions and the cycle has nothing to measure against, so “better” collapses back into a feeling.

Criteria and tests are preconditions

The reason steps 1 and 2 come first is that Anthropic’s prompt-engineering overview lists them as prerequisites before prompt engineering begins. The first listed prerequisite is “A clear definition of the success criteria for your use case” [Official] Prompt engineering overview · AnthropicT1-official original , and the second is “Some ways to empirically test against those criteria.” [Official] Prompt engineering overview · AnthropicT1-official original (The third is a first-draft prompt to improve — the thing the loop then iterates.) The criteria and the test set are the entry gate to the loop, not artifacts you produce along the way.

And the criteria have to be measurable. The guidance is explicit: good criteria “Use quantitative metrics or well-defined qualitative scales.” [Official] Define success criteria and build evaluations · AnthropicT1-official original They are typically multidimensional, too — accuracy, output format, latency, and cost are different axes, and a prompt can win on one while losing on another. A criterion you cannot express as a number or a consistently applied scale is not a criterion; it is a hope.

This is the anti-vibes move, and it is the whole reason the loop exists. If you fix “what good is” and “how to measure it” before you start changing the prompt, then every later change is judged against a target that does not move. Improvement becomes something you measure, not something you assert. Invert the order — tweak the prompt first, then decide whether you like the output — and you have no fixed reference, so “it feels better now” is the best you can honestly say. It is the same attribute-first discipline good engineering applies everywhere: name the target, then chase it.

Concept · Success criteria (for a prompt)

The measurable definition of “good” for a single prompt, fixed before iteration begins. Criteria are expressed as quantitative metrics or well-defined qualitative scales — not impressions — and are usually multidimensional (accuracy, format, latency, cost). They are a precondition of the loop, alongside an empirical test set: together they are the fixed reference every later change is measured against, which is what lets “better” mean something other than “feels better.”

Engineering the eval set

The test set is the second precondition, and it is engineered, not merely collected. Two design principles do most of the work, and they pull against each other.

The first is fidelity: be task-specific. “Design evals that mirror your real-world task distribution” [Official] Define success criteria and build evaluations · AnthropicT1-official original — the set should look like the inputs the prompt will actually see in production, weighted the way they actually occur. And it must deliberately include edge cases: irrelevant or nonexistent input, overly long input, harmful input, ambiguous cases. A test set that only contains the happy path tells you nothing about the inputs that break things.

The second is throughput, and it is where most teams flinch: “More questions with slightly lower signal automated grading is better than fewer questions with high-quality human hand-graded evals.” [Official] Define success criteria and build evaluations · AnthropicT1-official original A large set you can grade automatically beats a small set you grade by hand — even though each automated grade is individually noisier. Volume buys statistical signal; hand-grading caps volume at whatever a human can sustain. So you structure the questions to be machine-gradable where possible (multiple-choice, string match, code-graded, or LLM-graded), and you prioritize covering the distribution over polishing a handful of items.

The tension between these two — real-world fidelity against automatable volume — is the central design problem of the eval set. You want it representative and large enough to grade cheaply at scale, and those goals trade against each other at the margin. Resolving that tension is exactly why the grading step matters so much, which is where the loop is heading.

Volume over polish, automation over hand-grading

The instinct to hand-curate a small, pristine eval set is the wrong default. A representative set graded mostly by machine beats a tiny hand-graded one, because the noise of automated grading averages out over volume while the ceiling on hand-grading does not move. This is why “engineer the eval set” and “choose the grader” are the same decision viewed twice: you design the test cases to be automatically gradable so you can afford enough of them to mirror the real distribution. Fidelity and volume both depend on grading being cheap.

Tool-assisted iteration

With the preconditions fixed, the loop’s third step is iteration — and Anthropic ships two complementary tools for it, one to draft the change and one to measure it.

The drafting tool is the prompt improver, which “helps you quickly iterate and improve your prompts through automated analysis and enhancement.” [Official] Console prompting tools · AnthropicT1-official original It proposes the next version of the prompt — it is reported to excel at making prompts more robust for complex, high-accuracy tasks, enhancing a prompt in steps (identifying examples, drafting, chain-of-thought refinement, example enhancement). Its companion on the same page, the prompt generator, drafts a first prompt from a task description. The improver is the generate half of iteration: it gives you a candidate to test.

The measuring tool is the Console Evaluation tool, which closes the loop on variants. Its prompt-versioning affordance lets you “Create new versions of your prompt and re-run the test suite to quickly iterate and improve results” [Official] Using the Evaluation Tool · AnthropicT1-official original , and it offers side-by-side comparison as the A/B mechanism — you put two or more prompt versions next to each other on the same test cases and read which one scores better. That is the measure half: the eval set plus versioning decides whether the candidate is actually an improvement.

So iteration is generate-then-measure. A tool drafts the candidate; the test set and the version comparison decide whether it earns the change. Neither half is optional — a draft you do not measure is a guess, and a measurement with no candidate to score is idle.

[Caveat]

The prompt improver and the Console Evaluation tool are product surfaces that ship per release; the exact tool names, the versioning UI, and the side-by-side affordance can change. Verify the current Console UI before relying on a specific button — recheck after 2026-08-25.

Grading by reliability-per-effort

The fourth step is grading: turning each run into a score against the criteria. The guidance for which grader to use is a single optimization — when deciding how to grade, “choose the fastest, most reliable, most scalable method.” [Official] Define success criteria and build evaluations · AnthropicT1-official original The score itself is “A score, generated by one of the grading methods discussed below” [Official] Building evals · Anthropic (2024)T1-official original — one part of an eval’s four-part anatomy (input prompt, output, golden answer, score), produced by comparing the output to the golden answer.

The methods rank by reliability-per-effort:

  • Code-based grading comes first. It “is by far the best grading method if you can design an eval that allows for it” [Official] Building evals · Anthropic (2024)T1-official original — exact match, string-contains, or a regex over the output — because it is fast and highly reliable. If the criterion can be checked by code, check it by code; nothing else is cheaper or more dependable.
  • Human grading comes next, for quality that code cannot capture. In the Console, this is concrete: quality grading lets you “Grade response quality on a 5-point scale to track improvements in response quality per prompt.” [Official] Using the Evaluation Tool · AnthropicT1-official original It is reliable but does not scale — a human caps the volume.
  • LLM-based grading comes last, for judgement at scale. Its profile is “Fast and flexible, scalable and suitable for complex judgement. Test to ensure reliability first then scale.” [Official] Define success criteria and build evaluations · AnthropicT1-official original It can grade nuanced quality that code cannot express, across far more items than a human can — once you trust it.

That final clause — “Test to ensure reliability first then scale” — is the seam between this chapter and the next, and it is worth being precise about what it does and does not say. Here, the LLM judge is used: you pick it because the criterion needs judgement, you sanity-check that it agrees with you on a sample, and then you scale it across the set. That is the prompt-grading use of a judge. It is emphatically not a calibration project. Measuring the judge as an instrument — its agreement rate against human graders, its biases, the error bars on its scores — is a different and heavier discipline. This chapter only borrows the judge; ch23 calibrates it.

[Note]

“Reliability-per-effort” is this book’s framing of the source’s grader ranking (code-based, then human, then LLM-based) — a lens for reading why the order is what it is, not a phrase the primary uses.

The four-step prompt-evaluation loop. Two fixed preconditions — (1) define criteria (measurable: metrics or scales) and (2) build test cases (mirror the real distribution) — gate entry into a two-step iterate/grade cycle: (3) iterate with tooling (the prompt improver drafts; the Console eval measures) produces a variant, (4) grade outputs by reliability-per-effort returns a measured result, and the cycle repeats. A dashed return arrow carries findings back to revise the criteria and tests. A caption strip reads that criteria and tests are fixed before you touch the prompt, so improvement is measured, not asserted.A diagram in two zones. On the left, two blue boxes stacked and joined by a brace labeled 'fixed preconditions': box 1 'Define criteria — measurable: metrics or scales' and box 2 'Build test cases — mirror the real distribution'. Arrows labeled 'gate' lead right into a two-step cycle of teal boxes: box 3 'Iterate with tooling — prompt improver drafts; Console eval measures' and box 4 'Grade outputs — by reliability-per-effort'. An arrow labeled 'variant' runs from box 3 down to box 4, and an arrow labeled 'measured' loops box 4 back to box 3. A dashed orange arrow runs from box 3 back to box 1, labeled 'findings revise criteria + tests'. A caption strip beneath reads that criteria and tests are fixed before you touch the prompt, so improvement is measured, not asserted.
The four-step prompt-evaluation loop. Two fixed preconditions — (1) define criteria (measurable: metrics or scales) and (2) build test cases (mirror the real distribution) — gate entry into a two-step iterate/grade cycle: (3) iterate with tooling (the prompt improver drafts; the Console eval measures) produces a variant, (4) grade outputs by reliability-per-effort returns a measured result, and the cycle repeats. A dashed return arrow carries findings back to revise the criteria and tests. A caption strip reads that criteria and tests are fixed before you touch the prompt, so improvement is measured, not asserted.
Iterating a support-ticket classifier prompt Worked example

A team has a prompt that classifies an incoming support ticket into one of eight categories. Someone says: “I rewrote the prompt and the outputs look sharper — ship it?” Run the four-step loop instead of trusting “looks sharper.”

  • Define criteria (precondition). “Good” here is mostly measurable in numbers: top-1 accuracy against the correct category, plus an output-format constraint (the answer must be exactly one of the eight category strings, nothing else). Both are measurable — a metric and a well-defined constraint — so the criteria are real, not a vibe.
  • Build test cases (precondition). Assemble several hundred real tickets, weighted the way they actually arrive (mostly billing and login, rarely the obscure categories), and deliberately seed edge cases: empty tickets, a 5,000-word rant, a ticket in the wrong language, an ambiguous one that could be two categories. The set mirrors the real distribution and stresses the corners. It is large because the next step lets it be graded automatically.
  • Iterate with tooling. Keep the old prompt as version A and the rewrite as version B. Run both over the same test set and compare side by side; if neither clearly wins, let the prompt improver draft a version C and add it to the comparison.
  • Grade by reliability-per-effort. Top-1 accuracy and the format constraint are perfect for code-based grading — exact string match against the golden category and a membership check — fast and highly reliable, no judgement needed. Only if you later add a fuzzy criterion (“did it pick a reasonable category for a genuinely ambiguous ticket?”) do you reach for an LLM judge, and then you sanity-check that judge on a sample before trusting it across the set.

The result is a number: version B is two points more accurate but violates the format constraint on long inputs three percent of the time, while version C is even on accuracy and clean on format. “Looks sharper” never enters it. Notice that the grader was decided by the criteria — a code check, because the criteria were code-checkable — which is the reliability-per-effort rule doing its job.

Iterating before the criteria and tests exist

The most common way to break this loop is to start at step 3 — change the prompt, look at a few outputs, and decide by eye whether it improved. Without fixed criteria and a representative test set, there is no reference, so “better” is just the last output you happened to like, and you are tuning toward a moving target shaped by whatever you looked at most recently. The discipline is non-negotiable order: criteria and tests before the first edit. A second, subtler failure is over-claiming the LLM judge — treating its score as ground truth before sanity-checking it. Here the judge is only used; its calibration is a separate job (ch23). Trusting an unchecked judge is just vibes wearing a number.

Quick reference

  • The loop: define criteria → build test cases → iterate with tooling → grade outputs — then repeat. “This cycle is central to prompt engineering.” Define success criteria and build evaluations · AnthropicT1-official original
  • Preconditions: criteria and tests are fixed before iterating — they are prerequisites, not by-products. Prompt engineering overview · AnthropicT1-official original
  • Measurable criteria: “Use quantitative metrics or well-defined qualitative scales” Define success criteria and build evaluations · AnthropicT1-official original — multidimensional (accuracy, format, latency, cost).
  • Eval-set tension: mirror the real distribution Define success criteria and build evaluations · AnthropicT1-official original and prioritize automatable volume — “More questions with slightly lower signal automated grading is better than fewer questions with high-quality human hand-graded evals.” Define success criteria and build evaluations · AnthropicT1-official original
  • Iterate = generate-then-measure: the prompt improver drafts the candidate Console prompting tools · AnthropicT1-official original ; the Console eval tool versions and compares it. Using the Evaluation Tool · AnthropicT1-official original (Console UI is volatile — recheck after 2026-08-25.)
  • Grading hierarchy: code-based first (the “best grading method if you can design an eval that allows for it” Building evals · Anthropic (2024)T1-official original ), then human (5-point scale Using the Evaluation Tool · AnthropicT1-official original ), then LLM-based (“Test to ensure reliability first then scale” Define success criteria and build evaluations · AnthropicT1-official original ).
  • The ch23 seam: the LLM judge here is used, not calibrated — calibrating the judge as an instrument is the next chapter.
  • Unit of analysis: a prompt. Switch to ch23 the moment the thing under test is an agent’s behavior over a task suite.

Practice

Exercise

The chapter insists the four steps run in a fixed order — criteria and tests before iteration and grading. Name the four steps in order, state which two are the “preconditions,” and explain in one or two sentences what specifically goes wrong if you start at step 3 (iterate) before steps 1 and 2 exist.

Exercise

The chapter says the LLM judge in this chapter is “used, not calibrated,” and that calibration belongs to ch23. In your own words, distinguish using an LLM judge to grade a prompt’s outputs from calibrating it as an instrument. Why is it dishonest to report an LLM judge’s score as ground truth on the strength of this chapter alone?

Practice ◆◆◇◇

Take a prompt you run or have read about — a classifier, a summarizer, an extraction prompt, anything. Write down (a) two measurable success criteria for it (a metric or a well-defined scale, not “good output”); (b) what its test set would need to look like to mirror the real input distribution, plus two edge cases you would deliberately include; and (c) for each of your two criteria, which grading method fits — code-based, human, or LLM-based — and why. The point is to feel the reliability-per-effort rule fall out of the criteria: a code-checkable criterion wants a code grader, a judgement criterion wants a sanity-checked LLM grader.

Exercise solutions

Solution ↑ Exercise

The four steps in order are (1) define success criteria → (2) build test cases → (3) iterate with tooling → (4) grade outputs, then loop. The two preconditions are steps 1 and 2 — they are fixed before you touch the prompt, because Anthropic’s prompt-engineering overview lists “a clear definition of the success criteria” and “some ways to empirically test against those criteria” as prerequisites before prompt engineering begins. If you start at step 3 (iterate) before 1 and 2 exist, you have no fixed reference against which to judge the change, so “better” reduces to whichever output you happened to like most recently — you are tuning toward a moving target shaped by what you looked at, which is exactly the vibes-driven failure the loop is designed to prevent. Improvement can only be measured once the target and the measurement are pinned down first.

Solution ↑ Exercise

Using an LLM judge means treating it as a convenient grader for this prompt: you pick it because the criterion needs judgement code cannot express, you sanity-check that it agrees with you on a sample of outputs, and then you scale it across the test set — the source’s “Test to ensure reliability first then scale.” Calibrating it means treating the judge itself as the object of measurement: quantifying its agreement rate against human graders, characterizing its biases, and putting error bars on its scores so you know how much to trust the number it produces. This chapter only does the former; it never establishes how reliable the judge actually is, only that you should sanity-check it before scaling. Reporting the judge’s score as ground truth on that basis is dishonest because the sanity check confirms the judge is plausible, not that it is accurate — without the calibration work (ch23’s job), an unchecked judge’s number is a vibe dressed up as a measurement, which is precisely what the discipline forbids.

Solution ↑ Exercise

A worked example. Take a meeting-notes summarizer prompt. (a) Two measurable criteria. (1) Action-item recall: the fraction of action items present in the transcript that appear in the summary — a metric, gradable as a number against a golden list. (2) Format conformance: the summary must contain exactly the three required sections (Decisions, Action items, Open questions) with no others — a well-defined constraint. Neither is “good summary”; both are checkable. (b) Test set. Several dozen real transcripts weighted the way meetings actually occur (mostly short stand-ups, occasionally a long planning session), with golden summaries written once by hand. Two deliberate edge cases: a transcript with no action items at all (does the summary correctly produce an empty Action-items section rather than inventing one?), and an extremely long, rambling transcript (does recall collapse when the input is huge?). (c) Grader per criterion. Format conformance is pure code-based grading — a structural check for exactly the three section headers — fast and highly reliable, no judgement. Action-item recall is trickier: matching a summarized action to a transcript action involves paraphrase, so a strict string match under-counts; this is the LLM-based case — have a model judge whether each golden action item is covered, after you sanity-check the judge against your own labels on a sample. The reliability-per-effort rule falls straight out of the criteria: the structural criterion got a code grader because it was code-checkable, and the semantic criterion got a sanity-checked LLM grader because it needed judgement — and you would only trust that judge’s aggregate number after the calibration work the next chapter covers.

Part 3 Chapter 23 Last verified 2026-06-14 Fresh

Evaluating an Agent: Harnesses, Suites & the Judge

Evaluating a whole agent rather than a single prompt — the unit of analysis is a trajectory, a run. The chapter builds the eval before the harness, keeps the task suite small and failure-derived, reads every result as a measurement with uncertainty rather than a point score, and treats the LLM judge as a calibrated instrument with known error rather than an oracle.

Volatility: stable-principle
Tools compared: claude-codecross-tool
Before you start: ch21's premise — you cannot operate what you cannot measure, and eval comes first. ch22 evaluated a single prompt; this chapter evaluates a whole agent. No statistics background assumed beyond the idea that a measurement can be noisy.
You will learn
  • Why evals come before harnesses — the ordering is the discipline, and inverting it bakes in survivorship bias
  • What a good task suite looks like — small, discriminating, drawn from real failures — and why the grader is half the design
  • Why an eval result is a measurement with uncertainty, not a point score — resample, put error bars on it, test whether a difference is real
  • How to treat the LLM judge as a calibrated instrument with known error, not an oracle — and where the eval ends and the harness begins

ch22 scored a single prompt; this chapter scores a whole agent. The unit of analysis changes from a prompt to a trajectory — one complete run, with its tool calls, its detours, and its final state. Evaluating a trajectory is harder than grading a prompt’s output, because the thing under measurement decides its own steps. The thesis of this chapter is that you tame that with discipline in a fixed order: define what “good” means and make it runnable first, keep the suite small and drawn from real failures, read every number as a measurement that carries uncertainty, and treat the judge as an instrument you have calibrated — never as an oracle.

Evals before harnesses: the ordering is the discipline

The chapter’s title lists three things — harnesses, suites, the judge — but the first lesson is about none of them. It is about sequence. An evaluation harness “is the infrastructure that runs evals end-to-end” [Official] Demystifying evals for AI agents · Grace, Hadfield, Olivares and De Jonghe (Anthropic)T1-official original — it provides the instructions and tools, runs tasks, records steps, grades outputs, and aggregates results. That definition contains the ordering: the harness runs evals, so the eval is the target and the harness is built toward it. Reverse the two and you have built a beautiful runner with nothing well-defined to run.

The actionable form of the principle is eval-driven development: “build evals to define planned capabilities before agents can fulfill them.” [Official] Demystifying evals for AI agents · Grace, Hadfield, Olivares and De Jonghe (Anthropic)T1-official original Write the measurable target for a capability the agent does not yet have, and let it pull the agent into existence. The cost of doing it the other way is concrete and stated plainly: “evals get harder to build the longer you wait.” [Official] Demystifying evals for AI agents · Grace, Hadfield, Olivares and De Jonghe (Anthropic)T1-official original The reason is survivorship bias. Once the agent already runs, any target you retrofit is shaped — consciously or not — around the behavior the agent currently produces. The ruler ends up calibrated to pass what already happens, instead of defining what must happen. Build the eval first and the ruler is honest.

Key idea

The eval is the target; the harness is built toward it, never the reverse. Define what “good” means and make it runnable before the agent can satisfy it — because evals get harder to build the longer you wait, and an eval retrofitted to a running agent measures what the agent already does instead of what it should do.

This is also where the unit of analysis shifts. ch22 measured a prompt — one input, one output, graded. Here the unit is a trajectory: the full run an agent takes from a task to a final state, including the tool calls it chose and the order it chose them in. A trajectory can reach the right answer by a wrong path, or a defensible path to a wrong answer, and a serious agent eval has to be able to say which. That is the harder measurement, and it is why the rest of this chapter is about keeping it disciplined.

A good suite is small, discriminating, and failure-derived — and the grader is half the design

The instinct when building an eval suite is to chase coverage — hundreds of tasks spanning everything the agent might meet. That instinct is wrong, and the corrective is specific: “20–50 simple tasks drawn from real failures is a great start.” [Official] Demystifying evals for AI agents · Grace, Hadfield, Olivares and De Jonghe (Anthropic)T1-official original Two words in that sentence carry the weight. Real failures — the tasks come from behavior you have actually observed going wrong, not from imagined coverage; each task earns its place by having caught a bug. Simple — a small, discriminating suite that separates good runs from bad ones beats a large redundant one that mostly re-tests what already passes.

The quality bar for an individual task is inter-rater reproducibility. A well-posed task is one where “two domain experts would independently reach the same pass/fail verdict.” [Official] Demystifying evals for AI agents · Grace, Hadfield, Olivares and De Jonghe (Anthropic)T1-official original This is the test for whether a task is discriminating or merely vague. If two experts who both understand the domain disagree on whether a run passed, the task is underspecified — the fault is in the task, not the model. Tightening it until the verdict is unambiguous is most of the work of suite design, and it is what makes the resulting number trustworthy rather than just available.

The other half of the design is the grader — and it is genuinely half, not an afterthought. “An essential component of effective evaluation design is to choose the right graders for the job.” [Official] Demystifying evals for AI agents · Grace, Hadfield, Olivares and De Jonghe (Anthropic)T1-official original A task is the case; a grader is how that case is scored; and the two are separate design decisions. Checkable outcomes — did the function compile, did the test pass, does the JSON parse — call for a programmatic grader, which is deterministic and free. Open-ended outcomes — is this summary faithful, is this explanation clear — call for a model-based grader, an LLM judge (the subject of the section after next). Some call for a human. Picking the wrong grader is how a suite produces numbers nobody should trust: a programmatic exact-match grader on an open-ended task fails correct answers for trivial wording differences, and a judge on a task with a checkable answer adds cost and noise where a string comparison would have been exact.

Concept · Task, grader, and harness

Three layers a serious agent eval keeps distinct. The task is the case — a specific scenario, ideally drawn from a real failure, posed so two domain experts would reach the same pass/fail verdict. The grader is how that one case is scored — programmatic for checkable outcomes, a model-based judge for open-ended ones, a human where neither reaches. The harness is the runner — the infrastructure that runs the tasks end-to-end, records each trajectory, applies the graders, and aggregates the results. Confusing the three is how teams build a runner with no well-posed tasks, or score good tasks with the wrong grader.

Results are measurements with uncertainty, not point scores

A single run of an agent is a noisy sample, not a fact. The agent is stochastic; rerun the same task and you may get a different trajectory and a different verdict. So an eval result is a measurement with uncertainty, and reading it as a bare point score is the most common statistical error in the whole discipline. The corrective is a three-move loop, and Anthropic’s statistical guidance states each move.

Resample. Do not run each task once. For evals that use chain-of-thought reasoning, the recommendation is to “resampl[e] answers from the same model several times, and using the question-level averages as the question scores fed into the Central Limit Theorem.” [Official] A statistical approach to model evaluations · Anthropic (2024)T1-official original Run each task several times, average per task, and the averages behave well enough statistically to reason about. Report error bars. When you compare two agents, report “mean differences, standard errors, confidence intervals, and correlations” [Official] A statistical approach to model evaluations · Anthropic (2024)T1-official original — not bare percentages. Test whether the difference is real. Before believing “B beats A,” ask the question the guidance poses directly: “could a measured difference between two models be due to the specific choice of questions in the eval, and randomness in the models’ answers?” [Official] A statistical approach to model evaluations · Anthropic (2024)T1-official original If a two-point gap sits inside the noise floor, it is not a result — it is a coin flip you mistook for a finding.

None of this is exotic, and it is not hand-built either: a production eval framework exposes the resampling step as a first-class knob. Inspect’s --epochs option is “Number of times to repeat each sample (defaults to 1)” [Official] Inspect — Options · UK AI Security InstituteT1-official original — set it above one and the framework runs each task that many times so you can average and quantify. The mechanism is right there in the runner; the discipline is choosing to use it instead of trusting a single pass.

[Warning]

The default is the trap. Inspect’s epochs defaults to 1 — one run per task — so the out-of-the-box behavior produces exactly the bare point scores this section warns against. Resampling is a deliberate choice you have to make, not a thing the runner does for you.

A score without an error bar is not a result

“Agent B scored 84%, Agent A scored 82%” is not a finding — it is two numbers with no stated uncertainty. Until you know the noise floor, you cannot tell whether B is better or whether you reran a coin flip and it landed differently. The discipline is to treat every eval number as a value with a confidence interval, and to refuse to act on a difference you have not shown is larger than the noise. Resample to estimate the noise; report the interval; test the gap. A point score that hides its own uncertainty is worse than no score, because it invites a confident decision the data does not support.

The LLM judge is a calibrated instrument, not an oracle

When the outcome is open-ended — faithfulness, helpfulness, tone — no programmatic grader reaches it, and the grader has to be a model: an LLM judge. The encouraging evidence is real and worth stating precisely. A peer-reviewed study of LLM-as-a-judge found that “strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans.” Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena · Zheng et al.T3-practitioner original Read that figure for exactly what it is. It is a measured result from one study — GPT-4 on the MT-Bench and Chatbot Arena benchmarks — not an Anthropic-stated guarantee and not a universal law about every judge on every task. It says an LLM judge can be good enough to use. It does not say the judge is right.

[Caveat]

The “>80% agreement” figure is Zheng et al.’s own result for GPT-4 on MT-Bench and Chatbot Arena — a peer-reviewed academic finding, not an Anthropic-official statement and not a property of every judge. Cited here as evidence that a judge can be useful, never as a guarantee that one is correct.

The framing that follows is the whole point of the section: 80% agreement describes an instrument with known error, not an oracle. An instrument you trust blindly is a liability; an instrument whose error you have measured is a tool. And that is precisely why the judge must be wrapped in the statistical discipline of the previous section. The judge is itself a stochastic measurement, so its verdicts get the same treatment as any other noisy reading: resample them across epochs and report confidence intervals on the judge’s pass-rate, not a single judged number. [Official] A statistical approach to model evaluations · Anthropic (2024)T1-official original But resampling only quantifies the judge’s consistency — how stable its verdicts are — not whether they are correct; accuracy is a separate question, and only calibration against ground truth answers it. So the practical obligation is to calibrate: score a sample of trajectories with both the judge and human labels, measure the judge’s agreement rate on your task, and report it alongside the judge’s verdicts — so a reader can discount the score by the instrument’s known error rather than assuming it is truth.

This is also the chapter that owns the judge’s calibration. ch22 used a judge to grade a prompt; it did not have to ask how reliable the judge was. Here the judge is the instrument under examination — its agreement rate is the thing you measure and report — which is why the calibration discipline lives in this chapter and not the last one.

The eval-first ordering and the measurement-with-uncertainty discipline, left to right. Stage 1 'Define the target — the eval (what good means)'; stage 2 'Build the harness — toward the target (the runner)', with a dashed back-arrow labeled 'built toward, not the reverse' running from the harness to the target; stage 3 'Small task suite — 20–50 tasks from real failures'; stage 4 'Resample — multiple epochs per task'; stage 5 'Report with error bar — value ± CI, not a point score', drawn with a value-and-error-bar glyph. A caption strip beneath reads that a score without an error bar is not a result, and that the unit measured is a trajectory (a run).A left-to-right flow of five stages. Stage 1 'Define the target — the eval (what good means)'; stage 2 'Build the harness — toward the target (the runner)'; stage 3 'Small task suite — 20–50 tasks from real failures'; stage 4 'Resample — multiple epochs per task'; stage 5 'Report with error bar — value ± CI, not a point score'. A dashed orange back-arrow loops from stage 2 to stage 1 labeled 'built toward, not the reverse'. Below stage 5 is a small glyph of a dot with a vertical error bar through it. A dashed caption strip beneath reads that a score without an error bar is not a result, and that the unit measured is a trajectory, a run.
The eval-first ordering and the measurement-with-uncertainty discipline, left to right. Stage 1 'Define the target — the eval (what good means)'; stage 2 'Build the harness — toward the target (the runner)', with a dashed back-arrow labeled 'built toward, not the reverse' running from the harness to the target; stage 3 'Small task suite — 20–50 tasks from real failures'; stage 4 'Resample — multiple epochs per task'; stage 5 'Report with error bar — value ± CI, not a point score', drawn with a value-and-error-bar glyph. A caption strip beneath reads that a score without an error bar is not a result, and that the unit measured is a trajectory (a run).

The eval/harness boundary

It is worth holding the seam between the two words in the title, because conflating them is a real source of confusion. The harness is the runner — it runs tasks, records trajectories, applies graders, aggregates results. [Official] Demystifying evals for AI agents · Grace, Hadfield, Olivares and De Jonghe (Anthropic)T1-official original The eval is the measurement — the tasks, the graders, the statistical reading. The harness is plumbing you can reuse across projects; the eval is the judgment about what “good” means for this agent, and it cannot be reused, because it is specific to the capability you are building. When a team says “our evals are weak,” the fix is almost never a better runner — it is better-posed tasks, the right graders, and error bars on the numbers. Build the eval first, and the harness is the easy part.

A note on how strong this guidance is, so you can calibrate it. The spine here is Anthropic’s first-party evaluation methodology — authoritative for how Anthropic recommends evaluating agents, and tagged as official because that is what it is. But authoritative is not the same as independently triangulated: the eval-first ordering and the small-suite heuristic are first-party guidance, not yet corroborated by independent practitioner or academic studies of the methodology itself. This book does not dress that up as agreement-across-sources. The one genuinely external result in the chapter — the judge’s >80% human agreement — is a peer-reviewed academic finding, cited as such, and never laundered into an official endorsement. Naming the difference lets you weight each claim by the evidence behind it.

Reading an agent eval result honestly Worked example

A team reports: “We swapped the agent’s model. On our 30-task suite, the new model scored 87% and the old one 84%. Ship the new one?”

Walk the discipline before answering:

  • Is the suite the right shape? Thirty tasks is in the 20–50 range, and the question to ask is whether they are failure-derived and discriminating — would two domain experts agree on every pass/fail verdict? If some tasks are vague, those points are noise before any statistics enter.
  • How many runs per task? If each task ran once, 87% and 84% are two single noisy samples — possibly the runner’s epochs default of 1. There is no error bar, so there is no result yet. Resample: run each task several times, average per task.
  • Is the three-point gap real? With error bars in hand, ask the load-bearing question — could the gap be due to the specific tasks chosen and the randomness in the models’ answers? If the confidence intervals overlap heavily, “87 beats 84” is inside the noise floor and the honest answer is “we cannot tell yet,” not “ship it.”
  • Were any tasks judge-graded? If open-ended tasks used an LLM judge, the judge is itself a noisy instrument. Has its agreement with human labels been measured on these tasks? An unbenchmarked judge’s verdicts carry an unknown error that propagates straight into the 87/84 comparison.

The disciplined answer is not yes or no — it is “that is not a result yet.” Resample, put intervals on both numbers, test the gap, and confirm any judge is calibrated. Only then does “ship it” become a decision the data can support, rather than a coin flip dressed as a finding.

Trusting the judge as an oracle

The seductive failure in agent eval is treating the LLM judge’s pass/fail as ground truth. The judge agrees with humans most of the time — over 80% in one study — which means it is wrong a meaningful fraction of the time, and you do not know which fraction unless you have measured it on your tasks. Trusting it blindly imports that unmeasured error straight into every decision you make from the score. The discipline is the opposite: calibrate the judge against human labels, report its agreement rate, resample its verdicts, and read its output as an instrument with known error — never as an oracle.

Quick reference

  • Unit of analysis: a trajectory — one full run, not a single prompt (that was ch22).
  • Ordering is the discipline: build the eval first, the harness toward it; “evals get harder to build the longer you wait.” Demystifying evals for AI agents · Grace, Hadfield, Olivares and De Jonghe (Anthropic)T1-official original
  • Suite shape: small, discriminating, failure-derived — “20–50 simple tasks drawn from real failures”; a good task is one two experts would score the same. Demystifying evals for AI agents · Grace, Hadfield, Olivares and De Jonghe (Anthropic)T1-official original
  • The grader is half the design: programmatic for checkable outcomes, a model judge for open-ended ones — choosing the right grader is essential, not an afterthought. Demystifying evals for AI agents · Grace, Hadfield, Olivares and De Jonghe (Anthropic)T1-official original
  • Results carry uncertainty: resample (Inspect --epochs), report confidence intervals, test significance — a score without an error bar is not a result. A statistical approach to model evaluations · Anthropic (2024)T1-official original Inspect — Options · UK AI Security InstituteT1-official original
  • The judge is a calibrated instrument: over 80% human agreement is known error, not an oracle — calibrate it, report its agreement rate, wrap it in the statistics. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena · Zheng et al.T3-practitioner original

Practice

Exercise

The chapter insists you “build evals before harnesses” and that the unit of analysis is a trajectory. In one sentence each: (a) state what the harness is and why the eval has to come first, and (b) explain what specifically goes wrong if you build the eval after the agent already runs. Then say in one sentence how a trajectory differs from the prompt that ch22 evaluated.

Exercise

A colleague says: “Agent B got 84% on our suite and Agent A got 82%, so B is better — let’s switch.” Using only this chapter, give the two questions you must answer before accepting “B is better,” and explain why “84 beats 82” is not yet a result.

Practice ◆◆◇◇

Take an agent you run or have read about. Sketch a 5-task eval suite for it where every task is drawn from a real (or plausible) failure you have seen. For each task, write the one-sentence scenario and decide whether its grader should be programmatic (a checkable outcome) or a model judge (an open-ended outcome), and justify the choice. Then, for any task you assigned to a judge, state how you would calibrate that judge — what you would measure and report before trusting its verdicts. The point is to feel the three layers (task, grader, harness) come apart, and to see that the grader choice is a real design decision, not a default.

Exercise solutions

Solution ↑ Exercise

(a) The harness is the runner — the infrastructure that runs evals end-to-end, recording each trajectory, applying the graders, and aggregating results — and the eval has to come first because the harness runs the eval, so the eval is the target the harness is built toward, not the other way around. (b) Building the eval after the agent runs bakes in survivorship bias: because “evals get harder to build the longer you wait,” any target you retrofit is shaped around the behavior the agent currently produces, so the ruler is calibrated to pass what already happens instead of defining what should happen. A trajectory differs from ch22’s prompt in that it is a whole run — the sequence of tool calls and steps the agent chose on the way to a final state — so it can reach the right answer by a wrong path (or a defensible path to a wrong answer), which a single prompt’s input-output pair cannot express.

Solution ↑ Exercise

The two questions are: (1) How many runs produced each number? If each task ran once, 84% and 82% are two single noisy samples with no error bars — quite possibly the runner’s default of one epoch per task — so the first move is to resample each task several times and average per task. (2) Is the gap larger than the noise floor? With confidence intervals in hand, ask whether the two-point difference could be due to the specific tasks chosen and the randomness in the models’ answers; if the intervals overlap, the gap is inside the noise. “84 beats 82” is not yet a result because a single run of a stochastic agent is a sample, not a fact — until you have resampled, reported intervals, and shown the gap exceeds the uncertainty, switching agents on a two-point difference is acting on a coin flip you have mistaken for a finding. (If any tasks were judge-graded, a third question follows: has the judge’s agreement with human labels been measured on these tasks, since its unmeasured error propagates into the comparison too.)

Solution ↑ Exercise

A worked example. Take a documentation-writing agent that turns a code module into a reference page. Five failure-derived tasks. (1) “Module exports a function the agent omitted from the docs last time” — grader: programmatic, assert every exported symbol appears in the output (checkable). (2) “A function whose signature changed; the agent documented the old signature” — grader: programmatic, diff documented signatures against the source (checkable). (3) “Code block in the generated doc didn’t compile” — grader: programmatic, extract and compile every code block (checkable). (4) “The overview paragraph was technically correct but unreadable” — grader: model judge, score clarity for a target reader (open-ended, no string match reaches it). (5) “The doc described behavior the code doesn’t have — a hallucinated guarantee” — grader: model judge for faithfulness against the source, since “is this claim supported by the code?” is a judgment, not an exact match. Calibrating the judges (tasks 4 and 5): before trusting either judge, hand-label a sample of, say, 30 generated docs as clear/unclear and faithful/unfaithful, run the judge on the same sample, and compute its agreement rate with my labels; I report that agreement rate alongside the judge’s verdicts and resample the judge across several epochs so its pass-rate carries a confidence interval — so a reader discounts the score by the judge’s known error rather than treating it as truth. The exercise’s value is that it forces the task/grader split into the open: three tasks have checkable outcomes a programmatic grader nails for free, two are genuinely open-ended and need a calibrated judge — and choosing wrong (a judge on task 1, or exact-match on task 4) would produce numbers nobody should trust.

Part 3 Chapter 24 Last verified 2026-06-14 Fresh

Observability: Seeing What the Agent Did

Observability is four instrumentation surfaces stacked on one ground truth — the session-log transcript. Logging persists it, OpenTelemetry GenAI conventions trace it, attribution ties a diff back to it, and cost-surfacing shows the price. The chapter holds two boundaries — attribution is a provenance hook not an approval gate, and surfacing a cost number is not modeling the economics.

Volatility: feature-surface
Tools compared: claude-codecross-tool
Before you start: ch21's frame — observability is the 'see' surface, downstream of an eval target. ch23 scores whether a run was correct; this chapter only records what ran. Vol 1's permission model is assumed; the oversight workflow on top of it is ch26.
You will learn
  • Why observability is four surfaces over one ground truth — the session-log transcript — and why everything else derives from it
  • The two retention stories in logging — the local 30-day sweep versus the SDK’s external mirror — and why conflating them loses run history
  • Why you instrument to the OpenTelemetry GenAI convention, not a vendor schema — and why those names are still moving
  • The two boundaries that keep the surface honest: attribution is a provenance hook, not an approval gate, and surfacing a cost number is not modeling the economics

ch21 placed observability as the see surface: once an eval defines what “good” means, observability shows what actually happened against it. This chapter takes that surface apart. The thesis is that all of agent observability is four instrumentation surfaces stacked on one ground truth — the session log — and that getting the layering right (what is the record, what derives from it, and what each derived surface does and does not claim) is the whole discipline. Two of those surfaces are easy to over-read, so the chapter spends its honesty budget keeping their boundaries crisp.

Four surfaces over one ground truth

An agent run produces exactly one authoritative record, and everything you later want to see is a different view of it. Claude Code writes a session transcript for each run — “every message, tool call, and tool result” [Official] Explore the .claude directory · AnthropicT1-official original — as a per-session JSONL file, by default under ~/.claude/projects/, one JSON-safe object per line. [Official] Explore the .claude directory · AnthropicT1-official original That transcript is the ground truth. It is not a summary, not a dashboard, not a metric — it is the literal sequence the agent emitted and received, persisted to disk.

The other three things people mean by “observability” — tracing, attribution, cost-surfacing — are not separate sources of truth. They are surfaces derived from that one log, or pointers back to it. A trace re-renders the run as spans; an attributed commit points back to the run that produced it; a cost figure is computed from the tokens the run consumed. So “what did the agent do?” is, in order, first a logging question (is the transcript captured and kept?), then a tracing / attribution / surfacing question (how do I view it, link to it, and price it?). Skip the log and the other three have nothing underneath them.

Key idea

Agent observability is four instrumentation surfaces over one ground truth. The session log — the per-session JSONL transcript of every message, tool call, and result — is the record; tracing, attribution, and cost-surfacing all derive from it or point back to it. Get the layering right and “what did the agent do?” has a single authoritative answer the other three surfaces only re-view.

Concept · The session log as ground truth

The session log is the per-session JSONL transcript Claude Code writes for each run — every message, tool call, and tool result, one JSON-safe object per line. It is the authoritative record of what happened: tracing re-renders it as spans, attribution links a diff back to it, and cost-surfacing prices the tokens in it. Treat the log as primary and the other three surfaces as views, and you never confuse “I have a dashboard” with “I have the record.”

Logging: two records, two retention stories

The single most common logging error is treating “the transcript” as one thing. There are two records, with two different retention owners, and conflating them is how teams lose run history they assumed was safe.

The first is the CLI local record: the JSONL files under ~/.claude/projects/. These are swept automatically — the cleanupPeriodDays setting deletes local transcript files older than a threshold whose “default is 30 days.” [Official] Explore the .claude directory · AnthropicT1-official original That sweep is a feature, not a bug: it keeps a developer’s disk from filling with months of transcripts. But it means the local files are not a durable archive. Run history older than the window is gone unless something else kept it.

The second is the SDK record. From the Agent SDK, transcripts are still written to JSONL by default, but the SessionStore interface lets a deployment mirror those entries — “JSON-safe objects, one per line in the local JSONL” [Official] Persist sessions to external storage · AnthropicT1-official original — to external storage such as S3, Redis, or a database. The retention of that mirror is the adapter’s responsibility, not Claude Code’s. So a production deployment that needs durable run history cannot lean on the local files and their 30-day sweep; it must mirror via SessionStore and own the retention itself.

Two records, two owners

The local files and the SDK mirror are the same transcript with different retention owners. The local sweep (cleanupPeriodDays, 30-day default) is automatic and time-bounded — Claude Code owns it. The SessionStore mirror’s retention is whatever the adapter implements — you own it. Decide which record is your durable archive before an incident, not during one: if the answer is “the local files,” your archive evaporates on a rolling 30-day window.

OpenTelemetry GenAI conventions as the substrate

When you trace an agent — turn the transcript into spans and metrics a backend can query — the design question is which vocabulary do you instrument to? The answer is a vendor-neutral convention, not a vendor-specific schema.

Claude Code exports three OpenTelemetry signals — “metrics as time series data via the standard metrics protocol, events via the logs/events protocol, and optionally distributed traces” [Official] Monitoring · AnthropicT1-official original — and in the trace tree, “each user prompt starts a” [Official] Monitoring · AnthropicT1-official original claude_code.interaction root span, with API calls, tool calls, and hook executions as its children. Crucially, the per-LLM-request span’s attributes align to the “OpenTelemetry GenAI semantic convention.” [Official] Monitoring · AnthropicT1-official original That alignment is the whole point: Claude Code’s span tree is one realization of a standard the spec defines independently.

On the spec side, the OpenTelemetry GenAI semantic conventions define the same vocabulary from the other direction. The standard token-usage metric is gen_ai.client.token.usage, Semantic conventions for generative AI metrics · OpenTelemetry AuthorsT1-official original documented as the “Number of input and output tokens used,” Semantic conventions for generative AI metrics · OpenTelemetry AuthorsT1-official original and the agent-span operation names are create_agent Semantic Conventions for GenAI agent and framework spans · OpenTelemetry AuthorsT1-official original and invoke_agent. Semantic Conventions for GenAI agent and framework spans · OpenTelemetry AuthorsT1-official original Instrument to those names and your backend — any OpenTelemetry collector — reads the run without knowing it came from Claude Code. The vendor’s spans are swappable; the convention is the contract.

Two caveats are load-bearing, and both move on a release cadence. First, the OpenTelemetry GenAI semantic conventions carry Status: Development — the span and metric names above (gen_ai.client.token.usage, create_agent, invoke_agent) may still change before the convention stabilizes, so treat them as current-as-of, not final. Second, of Claude Code’s three signals, metrics and events are GA while distributed traces are beta; [Official] Monitoring · AnthropicT1-official original a team relying on the claude_code.interaction span tree should track the beta-to-GA transition. Both warrant a recheck after 2026-08-25.

[Caveat]

The OTel GenAI conventions are Status: Development and the names here are feature-surface — re-verify gen_ai.client.token.usage, create_agent, and invoke_agent against the spec after 2026-08-25; they may drift before stabilization.

Attribution is the provenance hook, not the approval gate

The third surface ties an agent’s output back to the run that produced it. In Claude Code, attribution to git commits and pull requests is a configurable setting [Official] Claude Code settings · AnthropicT1-official original — by default, commits carry a Co-Authored-By git trailer “which can be customized or disabled.” [Official] Claude Code settings · AnthropicT1-official original The commit itself becomes the handle: from a merged diff you can walk back to the session log and trace that produced it. In CI the same hook holds — a @claude mention triggers an Action so that “Claude can analyze your code, create pull requests, implement features, and fix bugs,” [Official] Claude Code GitHub Actions · AnthropicT1-official original and the commit is stamped with the GitHub-App actor identity rather than the generic Actions user, which is why CI must run “using the GitHub App or custom app (not Actions user)” [Official] Claude Code GitHub Actions · AnthropicT1-official original for those commits to be attributable.

Here is the boundary the chapter will not let blur: attribution is a provenance hook, not an approval gate. It records which run produced this diff — it does not decide whether the diff may be merged. That decision — the human-in-the-loop review and the gate before an irreversible action — is the oversight workflow, and ch26 owns it. The two are easy to conflate because they touch the same pull request, but they answer different questions: provenance is “where did this come from?”, approval is “may this proceed?”. Read a Co-Authored-By trailer as a gate and you have mistaken a label for a checkpoint.

[Note]

Provenance and approval both attach to the same PR, which is why they get confused. Attribution (here) is “which run made this”; the gate (ch26) is “may this proceed.” A trailer is a label, not a checkpoint.

Surfacing cost at three altitudes — surfacing is not modeling

The fourth surface is the cost and usage a team actually watches, and it lives at three altitudes. The most local is the in-CLI /usage command, whose Session block “shows detailed token usage statistics for your current session” [Official] Manage costs effectively · AnthropicT1-official original — what one developer reads mid-run. Above that is the Team/Enterprise analytics dashboard, which surfaces usage and adoption metrics behind a viewer-role gate — “Admins and Owners can view the dashboard.” [Official] Track team usage with analytics · AnthropicT1-official original And at the top is the Console spend view, which surfaces “daily API costs in dollars alongside user count.” [Official] Track team usage with analytics · AnthropicT1-official original

[Tip]

The in-CLI usage command was renamed from /cost to /usage. Command names are feature-surface and drift per release — re-verify after 2026-08-25.

But the surfaced number carries a caveat that defines the boundary. The dollar figure in /usage “is an estimate computed locally from token counts and may differ from your actual bill” [Official] Manage costs effectively · AnthropicT1-official original — the authoritative figure lives in the Console. That single sentence draws the line: surfacing shows the number and points to where the authoritative one lives; it does not model the economics. The per-developer dollar-per-day modeling, the token-reduction tactics, the question of which lever actually moves the bill — that is ch25’s subject, and the input-context cost driver ch25 unpacks. Observability tells you what a run cost as a local estimate; cost modeling tells you how to make it cost less. Mistake the surfaced estimate for the bill, or for an economic model, and you will optimize against a number that was never authoritative.

Four observability surfaces over one ground truth. At the base, the session log — the per-session JSONL transcript of every message, tool call, and tool result. Deriving from it: tracing (OTel GenAI spans and metrics), attribution (diff/PR back to the run), cost-surfacing (three altitudes — /usage, team dashboard, Console spend), and logging/retention (the local 30-day sweep versus the SDK SessionStore). The four surfaces are built on the one log.A wide blue box at the base labeled 'Session log — the ground truth, per-session JSONL transcript: every message, tool call, tool result'. Four teal boxes sit above it, each with an arrow pointing down to the log: 'Tracing — OTel GenAI spans & metrics', 'Attribution — diff / PR back to the run', 'Cost-surfacing — /usage to dashboard to Console spend', and 'Logging / retention — 30-day sweep vs SDK SessionStore'. A dashed caption strip beneath reads 'four surfaces over one ground truth — instrument what the agent did, not just whether it finished'.
Four observability surfaces over one ground truth. At the base, the session log — the per-session JSONL transcript of every message, tool call, and tool result. Deriving from it: tracing (OTel GenAI spans and metrics), attribution (diff/PR back to the run), cost-surfacing (three altitudes — /usage, team dashboard, Console spend), and logging/retention (the local 30-day sweep versus the SDK SessionStore). The four surfaces are built on the one log.
Routing an observability question to its surface Worked example

A team says: “We shipped a bad PR last week. We can’t reconstruct what the agent was reasoning about, the trace backend doesn’t recognize our spans, and the bill looks high — what failed and where?”

Route each part to its surface before fixing anything:

  • “Can’t reconstruct what it was reasoning about” is a logging question. The transcript is the ground truth — but if the run was a week old and lived only in the local files, cleanupPeriodDays may already be irrelevant (30 days is the default), yet the deeper issue is whether it was ever mirrored. If there is no SessionStore mirror, the record may simply not exist to reconstruct from. The fix is durable logging, not a better dashboard.
  • “Trace backend doesn’t recognize our spans” is a tracing-convention question. If the backend was wired to vendor-specific names, it cannot read a standard collector. Instrument to the OTel GenAI convention (gen_ai.*, create_agent/invoke_agent) so any collector reads the run — while remembering those names are still Development-status.
  • “We shipped a bad PR” has an attribution part and a non-attribution part. Attribution can tell you which run produced the diff (the Co-Authored-By trailer, the GitHub-App identity) — that is provenance. Whether a human gate should have stopped it is not an observability question at all; it is ch26’s oversight workflow.
  • “The bill looks high” is a cost-surfacing question only up to the number. /usage surfaces a local estimate that “may differ from your actual bill”; the Console holds the authoritative figure. Why it is high and how to reduce it is ch25’s modeling — surfacing stops at showing the number.

Four surfaces turned one panicked failure into four located questions — and pulled the two that aren’t observability (the approval gate, the cost model) out to their real owners.

Reading a derived surface as the record

The recurring observability failure is treating a view as the record. Three forms: trusting a dashboard while the underlying transcript was swept on the 30-day local window and never mirrored; reading a Co-Authored-By trailer as an approval gate when it is only provenance; and treating the /usage dollar figure as the bill when it is “an estimate computed locally from token counts.” In every case the fix is the same — go back to the ground truth. The session log is the record; tracing, attribution, and surfacing are views of it, and a view is only as durable, as authoritative, and as load-bearing as the record beneath it.

Quick reference

  • One ground truth: the session log — the per-session JSONL transcript of every message, tool call, and result — is the record; tracing, attribution, and cost-surfacing all derive from it. Explore the .claude directory · AnthropicT1-official original
  • Two retention stories: local files swept on a 30-day default (cleanupPeriodDays) versus the SDK SessionStore mirror whose retention the adapter owns — don’t rely on the local files for durable history. Persist sessions to external storage · AnthropicT1-official original
  • Trace to the convention: instrument to the OTel GenAI names (gen_ai.client.token.usage, create_agent, invoke_agent), not a vendor schema; Claude Code’s claude_code.interaction tree is one realization. Semantic conventions for generative AI metrics · OpenTelemetry AuthorsT1-official original
  • Moving target: the GenAI conventions are Status: Development and Claude Code’s traces signal is beta — recheck names after 2026-08-25. Monitoring · AnthropicT1-official original
  • Attribution = provenance, not approval: the Co-Authored-By trailer ties a diff to its run; the gate is ch26. Claude Code settings · AnthropicT1-official original
  • Surfacing ≠ modeling: /usage shows a local estimate that “may differ from your actual bill”; modeling the economics is ch25. Manage costs effectively · AnthropicT1-official original

Practice

Exercise

The chapter claims observability is “four surfaces over one ground truth.” Name the ground truth and the four surfaces, and state in one sentence why the ground truth is primary and the other three are views. Then explain the boundary the chapter draws around attribution — what it does claim, what it does not claim, and which chapter owns the thing it does not claim.

Practice ◆◆◇◇

Take an agent you run or have read about that writes code or files. For each of the four surfaces, write down the current state of your instrumentation: (1) logging — is the transcript captured, and where is it durably kept beyond the local 30-day sweep? (2) tracing — are you emitting to the OTel GenAI convention, a vendor schema, or nothing? (3) attribution — can you walk from a merged diff back to the run that produced it? (4) cost-surfacing — where do you watch the cost, and do you know which displayed number is an estimate versus authoritative? Then mark which gap would hurt most during an incident — and note explicitly any place where you were about to write down “the gate” (ch26) or “reduce the bill” (ch25), because those are not observability gaps.

Exercise solutions

Solution ↑ Exercise

The ground truth is the session log — the per-session JSONL transcript Claude Code writes for each run, recording every message, tool call, and tool result. The four surfaces are logging (capturing and retaining that transcript), tracing (re-rendering the run as OTel GenAI spans and metrics), attribution (tying a commit/PR back to the run that produced it), and cost-surfacing (showing token/dollar usage at the CLI, team-dashboard, and Console altitudes). The ground truth is primary because it is the literal, authoritative record of what the agent did; the other three are views — a trace re-renders it, an attributed commit points back to it, a cost figure is computed from the tokens in it — so each is only as durable and authoritative as the log beneath it. On attribution: it claims provenance — which run produced this diff, via the Co-Authored-By trailer and the GitHub-App commit identity — and it does not claim approval, i.e. it does not decide whether the diff may be merged; that gate is the human-in-the-loop oversight workflow, which ch26 owns. Conflating the two mistakes a label for a checkpoint.

Solution ↑ Exercise

A worked example. Take a documentation-writing agent that opens PRs. Logging: “Transcripts are written to the local ~/.claude/projects/ JSONL, but we never set up a SessionStore mirror — so anything past the 30-day cleanupPeriodDays sweep is gone. That is our durable-history gap.” Tracing: “We enabled telemetry but pointed it at a vendor-named dashboard; a standard OTel collector wouldn’t recognize our spans. Re-instrumenting to the GenAI convention (gen_ai.*, claude_code.interaction aligning to it) would make the backend swappable — though the names are Development-status, so I’ll pin a recheck.” Attribution: “Commits carry the default Co-Authored-By trailer, so I can walk from a merged doc-PR back to the run — provenance is fine.” Cost-surfacing: “I watch /usage mid-run and the Console for the monthly figure, but I’d been treating the /usage dollar number as the bill — it’s a local estimate that ‘may differ from your actual bill.’” Most painful during an incident: the logging gap — if a bad PR shipped and the run is older than 30 days with no mirror, there is no transcript to reconstruct from, and every other surface is a view of a record that no longer exists, so I’d mirror via SessionStore first. Pulled out as non-observability: “add a human gate before the PR merges” is ch26 (oversight), not a logging/tracing gap; and “the bill is too high, reduce it” is ch25 (cost modeling), not cost-surfacing — surfacing only shows the number. The exercise’s value is feeling that two of the four surfaces are easy to over-read into decisions they don’t make.

Part 3 Chapter 25 Last verified 2026-06-14 Fresh

Cost: The Economics of Running Agents

The economics of running an agent, on one premise — context is compute, so the input context an agent reprocesses each turn is the dominant cost driver, not output generation. Four composable levers manage that spend — reduce input context, cache the stable prefix, route by model tier, and batch the non-urgent work — and they stack rather than compete. Cache economics are stated as ratios and the model ladder qualitatively, because the underlying pricing surface is volatile.

Volatility: feature-surface
Tools compared: claude-codecross-tool
Before you start: ch21's framing of cost as one of five operational surfaces, and Vol 1's account of context as a finite budget. ch24's observability surfaces the cost numbers; this chapter models their economics.
You will learn
  • Why input context, not output generation, is the cost driver — the “context is compute” premise the whole chapter rests on
  • Prompt-cache economics as a read-vs-write asymmetry — why a one-time write premium buys order-of-magnitude-cheaper reads, and why a high read-to-creation ratio means caching is working
  • The multi-agent token multiplier as a cost-modeling input, not a flat anti-pattern — Anthropic’s own ~15× measurement, and why spend can be the rational choice
  • The two cheapen-the-spend levers — model-tier routing and the Batch API — and how all four levers compose into one playbook

ch24 surfaced the numbers; this chapter asks what they mean. Cost is the third operational surface, and it has a single organizing premise: the money is in the input. An agent does not spend its budget mostly on what it writes — it spends it on the context it reprocesses every turn. Once you see that, four levers fall out, and they are not rivals competing for the same fix — they stack. This chapter states the premise, then each lever, then composes them into one cost discipline.

Context is compute

Start with where the money actually goes, because the intuition is usually wrong. It is tempting to picture an agent’s cost as the text it produces — the long answer, the generated code, the written report. But generation is the small side of the ledger for an agent: output tokens are individually pricier than input, yet an agent reprocesses far more input than it ever writes, so the input dominates the bill — the context the model has to read and re-read on every turn.

The premise has a first-party name. Context is “a critical but finite resource for AI agents” [Official] Effective context engineering for AI agents · Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, and Jeremy Hadfield (2025)T1-official original — a budget, not a free input slot. The reason is mechanical: like a person with limited working memory, an LLM has a finite attention budget “that they draw on when parsing large volumes of context.” [Official] Effective context engineering for AI agents · Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, and Jeremy Hadfield (2025)T1-official original Every token in the window is a token the model spends attention parsing, and every token parsed is billed.

What makes this the cost story for agents — rather than for a one-shot prompt — is accumulation. An agent runs a loop: it calls a tool, the result lands in the context, it reasons over the now-larger context, calls another tool, and so on. The conversation history, the tool outputs, the system instructions — all of it is reprocessed each turn. So the input the agent pays for grows as the run goes on, and on a long trajectory the input side dwarfs anything the model writes back. That is the sense in which “context is compute”: the tokens you feed the model are the spend you engineer down.

Key idea

The input context an agent reprocesses each turn — not the output it generates — is the dominant cost driver. Context is a finite resource, billed by the attention spent parsing it, and it accumulates across an agent’s loop. Every lever in this chapter is therefore a move on the input side of the bill; none is about generating less output, because for the context-heavy agent loops this volume concerns, the reprocessed input dominates the bill.

[Caveat]

The source asserts no first-party numeric input-to-output ratio, so this chapter states none. “Context is compute” is a qualitative pattern — input dominates — not a fixed multiple. A concrete ratio is a property of one workload, not a universal constant.

Concept · Context is compute

A framing of agent cost in which the input context the model parses each turn is treated as the spend — because context is a finite resource billed by the attention spent on it, and an agent’s loop makes that input accumulate. The design question shifts from “how much will it write?” to “how much context am I making it reprocess, and how do I make that cheaper?”

Prompt-cache economics: a read-vs-write asymmetry

If input context is the spend, the first lever is the one that makes repeated input cheap. Prompt caching stores a prefix of the context server-side so it does not have to be reprocessed from scratch on the next turn — and its economics are a sharp asymmetry between writing the cache and reading it.

The two figures are first-party and worth stating exactly. Writing the 5-minute cache costs about 1.25× the base input price [Official] Prompt caching · AnthropicT1-official original — a one-time premium you pay to populate it. Reading from the cache (a cache hit) costs about 0.1× the base input price [Official] Prompt caching · AnthropicT1-official original — roughly a tenth. And the cache “has a 5-minute lifetime” that “is refreshed for no additional cost each time the cached content is used,” [Official] Prompt caching · AnthropicT1-official original so under sustained traffic the timer keeps resetting and the prefix stays warm for free.

The break-even falls straight out of those numbers. Because a hit costs roughly a tenth of the input price, “caching pays off after just one cache read for the 5-minute duration.” [Official] Pricing - Claude API Docs · AnthropicT1-official original You pay the 1.25× write once; the very first reuse already comes back at 0.1×, and every reuse after that is gravy. The design move this licenses is structural: stabilize a long shared prefix — system instructions, tool definitions, a fixed document set — so it is written to the cache once and then read many times across the run.

A high read-to-creation ratio means caching is working

The cache turns a per-run health metric into a cost signal. Claude Code surfaces per-turn cache reads “billed at roughly 10% of the standard input rate” [Official] How Claude Code uses prompt caching - Claude Code Docs · AnthropicT1-official original — the harness-side restatement of the 0.1× read multiplier. And the signal to watch is the ratio between them: “A high read-to-creation ratio means caching is working well.” [Official] How Claude Code uses prompt caching - Claude Code Docs · AnthropicT1-official original Lots of reads per write means a stable prefix the cache can amortize; persistently high creation means the prefix is changing turn after turn, so you keep paying the write premium and never collect the 0.1× reads. The cost lever and the health metric are the same number read two ways.

[Warning]

The cache ratios (write ~1.25×, read ~0.1×) and the break-even are stated as multiples of the base input price, never as dollars. The pricing surface they sit on is volatile — recheck after 2026-06-26 — so the ~10× cached-versus-uncached gap is the durable takeaway, not any absolute figure.

The ~10× gap between a cold read (full input price) and a warm read (0.1× of it) is the economic core of “context is compute.” It is what makes carrying a large, stable context affordable at all: the first turn pays to write it, and every turn after rides at a tenth of the price.

The multi-agent token multiplier — a modeling input, not a verdict

The second thing the cost surface has to handle is the one the orchestration chapters deferred: multi-agent systems are expensive, and the honest question is when that expense is worth paying. Anthropic’s first-party measurement on its own research system is concrete: “agents typically use about 4× more tokens than chat interactions, and multi-agent systems use about 15× more tokens than chats.” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original A multi-agent topology burns through tokens fast, because each sub-agent runs its own context window — and on the cost surface, that ~15× is a number you have to plan around.

But the same measurement reframes the burn. On their benchmark, “token usage by itself explains 80% of the variance” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original in performance — token spend was the single dominant driver of how well the system did. Read together, the two findings turn the multiplier from an indictment into a lever: if spending more tokens is the strongest predictor of doing better, then on a high-value, genuinely parallelizable task the 15× spend can be the rational choice, not waste. The cost-modeling question is not “is multi-agent wasteful?” — it is “does this task’s value clear the 15× multiplier?”

Reading the 15× as a universal anti-pattern

The ~15× is one first-party datapoint, measured on Anthropic’s own multi-agent research system, and the “80% of the variance” is on the BrowseComp benchmark specifically — both are measurements on one workload, not constants you can carry to yours. Reading “15×” as a flat verdict (“multi-agent is wasteful, avoid it”) misuses it in one direction; reading “80% of variance” as “always spend more tokens” misuses it in the other. The right use is as a cost-modeling input: it tells you the burn is large and that the burn buys performance, so you size the spend against the task’s value rather than reaching for a rule. A different topology or task mix will have different multipliers — gather your own before betting on a number.

This is also why this chapter, not the orchestration ones, is the honest home for the ~15×. Whether to build a multi-agent topology is an orchestration question; what it costs and when that cost is justified is an economics question — and economics is the surface that can hold “expensive” and “worth it” in the same hand without collapsing into a slogan either way.

Cheapening the spend: model tiers and the Batch API

The first two levers reduce how much input you pay for. The last two reduce the price of what you do spend — without touching the architecture at all.

The first is model-tier routing. Anthropic’s guidance is to “Choose Haiku for simple tasks, Sonnet for most production workloads, and Opus for the most complex reasoning.” [Official] Pricing - Claude API Docs · AnthropicT1-official original The tiers form a cost-and-capability ladder — Haiku < Sonnet < Opus, cheapest and least capable up to most capable and most expensive — and the lever is to route the cheapest model that clears each subtask. A classification step, a quick extraction, a routine summarization does not need the top tier; reserve Opus for the reasoning that genuinely requires it. The point is per-subtask: one agent run can dispatch cheap work to a cheap model and keep the expensive model for the part that earns it.

[Tip]

Render the tier ladder as an ordering, not a price list. This chapter prints Haiku < Sonnet < Opus and no per-MTok dollars on purpose — the pricing page is volatile (recheck after 2026-06-26). If you need absolute prices, fetch them live; do not carry a remembered figure into a cost model. The lineup also evolves — treat the three named tiers as the cited example of the ordering, not a complete or fixed roster.

The second is the Batch API, for work that is not time-sensitive. It “allows asynchronous processing of large volumes of requests with a 50% discount on both input and output tokens” [Official] Pricing - Claude API Docs · AnthropicT1-official original — corroborated on the feature’s own documentation, which describes processing large volumes “while reducing costs by 50% and increasing throughput.” [Official] Batch processing — Claude Docs · AnthropicT1-official original The trade is latency for price: you give up real-time response and get half off. An overnight evaluation run, a bulk re-classification, a backfill — none of these needs to answer in seconds, and all of them halve in cost by going through the batch path.

The two levers are orthogonal to caching and to each other: tier-routing changes which model parses the input, batching changes when the work runs, caching changes how much of the input is reprocessed. That orthogonality is what lets them stack.

The four levers, composed

The chapter’s payoff is that these are not four competing fixes you choose between — they are four moves you apply together, each on a different part of the spend.

  • Reduce the input context (the driver). Spend less to begin with: trim the prompt, keep tool outputs lean, do not carry context the turn does not need. This is the lever that attacks the premise directly.
  • Cache the stable prefix. Make the input you do carry cheap to reprocess — write it once at ~1.25×, read it at ~0.1× thereafter, and watch the read-to-creation ratio to confirm it is amortizing.
  • Model the multi-agent burn against value. Decide whether to spend the ~15× at all — pay it only when the task’s value clears the multiplier.
  • Route and batch to cheapen what’s left. Send each subtask to the cheapest model that clears it (Haiku < Sonnet < Opus), and push non-urgent work through the 50% Batch API.

They compose because they act on different variables — tier-routing per subtask, caching on repeated context, batch on deferrable work. One boundary is worth stating: the Batch API is a 50%-off asynchronous path, so it applies to work you can defer (offline evals, bulk jobs), not to a live interactive turn. A cost-disciplined agent runs every applicable lever against one bill — caching and tier-routing on its live turns, batch on whatever can go async.

Four composable cost levers feeding one bill, over the 'context is compute' base. The base reads that input context is the spend, not output generation. Four levers stack onto it: (1) reduce input context — the driver; (2) prompt caching — write ~1.25× / read ~0.1×; (3) model-tier routing — Haiku < Sonnet < Opus; (4) Batch API — 50% discount, async. A note marks them as composable, not competing, and the base feeds one bill — the token spend you pay. No dollar figures appear; ratios and qualitative ordering only.A base bar labeled 'Context is compute — input context is the spend (not output generation)'. Four boxes sit above it and feed into it with arrows: box 1 'Reduce input context — the driver, spend less to begin with'; box 2 'Prompt caching — write ~1.25x / read ~0.1x'; box 3 'Model-tier routing — Haiku < Sonnet < Opus'; box 4 'Batch API — 50% discount (async)'. A note above the four reads 'four composable levers — they stack, they do not compete'. An arrow runs down from the base to a box labeled 'one bill — the token spend you pay'. No dollar amounts appear anywhere.
Four composable cost levers feeding one bill, over the 'context is compute' base. The base reads that input context is the spend, not output generation. Four levers stack onto it: (1) reduce input context — the driver; (2) prompt caching — write ~1.25× / read ~0.1×; (3) model-tier routing — Haiku < Sonnet < Opus; (4) Batch API — 50% discount, async. A note marks them as composable, not competing, and the base feeds one bill — the token spend you pay. No dollar figures appear; ratios and qualitative ordering only.
Modeling an agent's bill with the four levers Worked example

A team says: “Our research agent’s bill tripled this month and we don’t know why. Do we kill the multi-agent design?”

Walk the four levers in order before touching the architecture:

  • Reduce input context (the driver). First ask what is in the window each turn. If the agent is now carrying larger tool outputs or a longer history than last month, the input it reprocesses every turn has grown — and since input is the cost driver, that alone can triple a bill without any change to the design. Trim here first; it is the cheapest fix.
  • Cache the stable prefix. Check the read-to-creation ratio. If creation is persistently high, the cached prefix is changing turn after turn — so the agent keeps paying the ~1.25× write premium and rarely collects the ~0.1× reads. A prompt or tool-set change that destabilized the prefix would show up exactly as a cost spike. Re-stabilize the prefix and the reads come back cheap.
  • Model the multi-agent burn against value. Now weigh the topology. The ~15× multiplier is real, but it is not automatically the culprit — and the team’s own data says token spend strongly predicts performance, so cutting agents to save tokens may cut quality with it. Ask whether the research task’s value clears the 15× before killing the design; the answer might be “the design is fine, the prefix regressed.”
  • Route and batch to cheapen what’s left. If some sub-agents are doing routine extraction on the top model tier, route them down to Haiku/Sonnet. If any of the work is a non-urgent backfill, send it through the 50% Batch API.

Notice the architecture question came last. The cost surface turned “do we kill multi-agent?” into four located moves — and three of them are cheaper than a redesign. The four levers compose into a diagnosis, not a single guess.

Quick reference

  • The premise: input context, not output, is the cost driver — context is a finite resource billed by the attention spent parsing it, and it accumulates across an agent’s loop. Effective context engineering for AI agents · Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, and Jeremy Hadfield (2025)T1-official original
  • Cache asymmetry: a write costs ~1.25× base input, a read ~0.1×, with a 5-minute free-refresh TTL — so “caching pays off after just one cache read.” Prompt caching · AnthropicT1-official original Pricing - Claude API Docs · AnthropicT1-official original
  • Cache health = cost signal: a high read-to-creation ratio means caching is working; persistently high creation means you keep paying the write premium. How Claude Code uses prompt caching - Claude Code Docs · AnthropicT1-official original
  • Multi-agent burn: Anthropic measured ~4× (single agent) and ~15× (multi-agent) the tokens of a chat, with “token usage by itself explains 80% of the variance” — a cost-modeling input on their workload, not a universal constant or a flat anti-pattern. How we built our multi-agent research system · Anthropic (2025)T1-official original
  • Cheapen-the-spend levers: route the cheapest model that clears each subtask (Haiku < Sonnet < Opus, qualitative — no dollars, recheck after 2026-06-26) and push non-urgent work through the 50% Batch API. Pricing - Claude API Docs · AnthropicT1-official original Batch processing — Claude Docs · AnthropicT1-official original
  • The playbook: reduce input → cache the stable prefix → model the burn against value → route and batch — four levers that stack, all against the input side of one bill.

Practice

Exercise

The chapter claims the four cost levers “stack rather than compete.” Name the four levers, state in one sentence which variable each one acts on, and explain why acting on different variables is what lets them compose. Then say which single lever attacks the “context is compute” premise most directly, and why.

Exercise

The chapter calls the ~15× multi-agent multiplier “a cost-modeling input, not a verdict.” Explain what the “80% of the variance” finding adds to the bare ~15× figure, and why the two together turn the multiplier into a lever rather than an indictment. Then state one way a reader could misuse each of the two numbers, and name the honest qualifier the chapter attaches to both.

Practice ◆◆◇◇

Take an agent you run or have read about. Estimate, qualitatively, where its bill goes: roughly how much of each turn’s tokens are input it reprocesses (history, tool outputs, instructions) versus output it generates. Then identify, for each of the four levers, one concrete move you could make on that agent — what input you would trim, what prefix you would stabilize for caching, whether (and why) the topology’s spend is justified, and which subtasks could drop a model tier or move to the Batch API. The point is to feel the bill as an input problem, and to size each lever before reaching for any one.

Practice ◆◆◆◇

The chapter states cache economics as ratios (~1.25× write, ~0.1× read) and the model ladder qualitatively (Haiku < Sonnet < Opus), and prints no dollar figures — with a “recheck after 2026-06-26” note. Argue for this editorial discipline: what would go wrong if the chapter hard-coded per-MTok dollar prices and a precise input-to-output ratio instead? Then argue the other side — what does a reader lose by getting ratios and an ordering rather than absolute numbers, and how would you responsibly recover the absolute figures when a real cost model needs them?

Exercise solutions

Solution ↑ Exercise

The four levers are (1) reduce input context, (2) prompt caching, (3) model-tier routing, and (4) the Batch API. Each acts on a different variable: lever 1 reduces how much input is reprocessed each turn (it attacks the spend at the source); lever 2 reduces the price of reprocessing the input you do carry (write once at ~1.25×, read at ~0.1×); lever 3 changes which model parses the input, per subtask (route the cheapest tier that clears the task); lever 4 changes when the work runs (non-urgent work goes async for a 50% discount). Because they act on different variables — quantity of input, price-per-reprocess, model, and timing — they do not contend for the same fix; they multiply through independently, and some compose further (tier-routing sits underneath caching; the batch discount then applies to whatever work runs async, where its cache hits are best-effort rather than guaranteed). The lever that attacks the “context is compute” premise most directly is reduce input context: the premise says input is the dominant cost, and trimming the input lowers the very quantity the other three only make cheaper to carry, route, or schedule. It is the cheapest move precisely because it removes spend rather than discounting it.

Solution ↑ Exercise

The bare ~15× figure tells you only that a multi-agent system is expensive — it burns roughly fifteen times the tokens of a chat. On its own that reads as an indictment (“multi-agent is wasteful”). The “token usage by itself explains 80% of the variance” finding adds the missing half: on that benchmark, how much a system spent was the single strongest predictor of how well it performed. Put together, the two say the burn is large and that the burn buys performance — so the spend is not waste but a lever you pull when a task’s value justifies it; the modeling question becomes “does this task clear the 15×?” rather than “avoid multi-agent.” A reader could misuse the ~15× by treating it as a flat verdict that multi-agent should always be avoided; and could misuse the “80%” by concluding one should always spend more tokens. The honest qualifier the chapter attaches to both is that they are measurements on one workload — the ~15× is one first-party datapoint from Anthropic’s own multi-agent research system, and the “80%” is specific to the BrowseComp benchmark — so neither is a universal constant, and a different topology or task mix will have different numbers. The right move is to gather your own before betting on a figure.

Solution ↑ Exercise

A worked example. Take a customer-support triage agent that reads a ticket, searches a knowledge base over several tool calls, and drafts a reply. Where the bill goes: the drafted reply is a few hundred output tokens, but each turn re-feeds the system prompt, the tool definitions, the growing tool-result history, and the ticket — so the input it reprocesses dominates, and it grows as the search deepens. That is the bill as an input problem. Lever 1 — reduce input: stop carrying full knowledge-base articles in the window once they have been read; keep only the extracted snippet the draft needs. Lever 2 — cache: the system prompt and tool definitions are a stable prefix — write them to the cache once and collect ~0.1× reads on every subsequent turn; confirm via a high read-to-creation ratio. Lever 3 — model the burn: if this is a single agent (~4×), there is no multi-agent topology to justify — but if triage fanned out into parallel sub-agents per knowledge source, ask whether the support volume’s value clears the ~15× before keeping it. Lever 4 — route and batch: the initial “which category is this ticket?” classification can run on Haiku, reserving the top tier for the draft; and any overnight bulk re-tagging of old tickets goes through the 50% Batch API. The value of the exercise is seeing that the largest, cheapest win (trimming carried articles, caching the prefix) sits on the input side — exactly where the premise says the money is — long before any architecture change.

Solution ↑ Exercise

For the discipline. Hard-coded per-MTok dollar prices would go stale the moment the pricing page changes — and the chapter’s own sourcing flags that page as volatile (recheck after 2026-06-26). A printed dollar figure in a reference book becomes a quietly-wrong number that readers trust precisely because it looks precise; worse, a stale absolute price corrupts any cost model built on it. A precise input-to-output ratio has the same defect with an added one: no first-party source asserts such a ratio, so printing one would be inventing a constant and laundering it as fact — and any real ratio is a property of one workload, not a universal. Ratios and an ordering, by contrast, are the durable part: the ~10× cached-versus-uncached gap and the Haiku < Sonnet < Opus ladder survive a pricing change, because a repricing typically moves the absolute levels while preserving the asymmetry and the ordering. The other side. A reader genuinely loses the ability to compute an absolute budget from the chapter alone — “ratios” cannot tell you whether next month’s bill is $40 or $4,000, only how the levers move it. The responsible recovery is explicit in the chapter’s own discipline: when a real cost model needs absolute figures, fetch them live from the pricing surface at the moment you build the model, treat them as that-day’s volatile numbers, and re-verify on the cadence the volatility implies — rather than carrying a remembered or book-printed price into a decision. The book’s job is to teach the shape of the economics that survives repricing; the live pricing page’s job is to supply the day’s absolute numbers.

Part 3 Chapter 26 Last verified 2026-06-14 Fresh

Human-in-the-Loop: Keeping a Human in Control

The oversight surface of the Evaluation & Operations volume. Keeping a human in control of a production agent is one move — control over the irreversible or wrong action — expressed four ways (the approval gate, plan mode, calibration, escalation in automation), all of them a workflow layered on top of Vol-1's permission model. The chapter draws the workflow-on-model line sharply, names the default-ask versus approval-fatigue trade-off as genuinely open, and treats agent self-calibration as a sparse, explicitly imperfect pattern.

Volatility: feature-surface
Tools compared: claude-codecross-tool
Before you start: Vol 1's permission model — the Always / Ask / Never rules and permissions.deny that classify which actions an agent may take freely, ask about, or never attempt. This chapter is the oversight workflow that rides on top of that model; it restates the model in a sentence but does not re-derive it.
You will learn
  • The one move this chapter is about — human control over the irreversible or wrong action — and the four ways it shows up: the approval gate, plan mode, calibration, and escalation in automation
  • The workflow-on-model split: this chapter owns when a human is consulted and what they review; Vol 1’s guardrails own which actions need consulting
  • Why the default ask-before-acting posture and approval fatigue are an open, unsolved trade-off, not a settled best practice
  • Why agent self-calibration — the agent deciding to stop and ask — is a suggestive, explicitly imperfect pattern on thin evidence, not a guarantee
  • How the gate survives into automation by failing closed when no human is present

Vol 1 built the permission model: the rules that decide which actions an agent may take freely, must ask about, or must never attempt. This chapter takes what sits on top of that model — the oversight workflow that keeps a human in control once the agent is actually running. The thesis is that oversight is one move — a human’s control over the irreversible or wrong action — wearing four faces. See the move once and the four faces stop looking like four separate features and start looking like four places to insert the same human decision.

One move, four expressions

It is tempting to read agent oversight as a checklist of features: approval prompts, a plan mode, some uncertainty signalling, a CI gate. That framing hides the thing they share. Every one of them inserts a human decision at a risky transition — the moment the agent is about to do something irreversible, expensive, or wrong. The whole subject is that single move, applied at four different points in the agent’s life.

The four expressions are: the approval gate (the agent pauses before an irreversible action and waits for a human to approve), plan mode (the same gate moved earlier — the agent stays read-only and proposes a plan a human approves before any edit), calibration (the agent itself decides to stop and ask when it is uncertain), and escalation in automation (the human checkpoint that survives into headless and CI runs). The first two are human-initiated boundaries the operator sets; the third flips initiative to the agent; the fourth is how the gate degrades safely when no human is watching in real time.

Key idea

Oversight is not four features — it is one move expressed four ways. The move is human control over the irreversible or wrong action, inserted at a risky transition. The approval gate inserts it before a single tool call; plan mode inserts it before a whole change-set; calibration lets the agent insert it when it is unsure; escalation keeps it present in automation. Learn the move and the four expressions become one idea you apply in four places.

The approval gate: blocking, default-on, on irreversible actions

The first expression is the approval gate. Out of the box, “Claude Code asks users for approval before running commands or modifying files.” [Official] How we built Claude Code auto mode: a safer way to skip permissions · John Hughes (Anthropic) (2026)T1-official original The gate fires precisely where an action is irreversible or ambiguous — the agent “might need permission before deleting files, or need to ask which database to use for a new project.” [Official] Handle approvals and user input · AnthropicT1-official original When it fires, it is synchronous and blocking: programmatically, the approval callback fires whenever the agent needs input and “pauses execution until you return a response.” [Official] Handle approvals and user input · AnthropicT1-official original Interactively, when Claude wants to edit a file, run a shell command, or make a network request, “it pauses and asks you to approve the action.” [Official] Choose a permission mode · AnthropicT1-official original

The security model frames the same gate as a transparency-and-control safeguard. Anthropic states that “we require approval for bash commands before executing them” [Official] Security · AnthropicT1-official original — the gate sits in front of the command, not after it. And it is explicit about whose job the gate is: “Claude Code only has the permissions you grant it. You’re responsible for reviewing proposed code and commands for safety before approval.” [Official] Security · AnthropicT1-official original The human at the gate is a reviewer, not a rubber stamp — the gate only does its work if the human actually reads what they are approving.

Concept · Approval gate

A blocking, default-on checkpoint placed in front of an irreversible or high-risk action. When it fires, the agent halts the action and waits for a human to approve or deny before proceeding — it “pauses execution until you return a response.” The human’s role at the gate is to review the proposed command or edit for safety, not merely to acknowledge it. The gate is the workflow (when the agent stops, what the human reviews); it is not the rule that decided the action was risky in the first place.

The default-ask posture is an open trade-off, not a solved one

Here is where honesty matters more than tidiness. The default ask-before-acting posture has a well-documented cost, and the same first-party source that states the default also names the cost: asking for approval before every command or file change creates approval fatigue, which is exactly what motivates an auto mode that lets users skip permissions — Anthropic’s engineering write-up is titled “How we built Claude Code auto mode: a safer way to skip permissions.” [Official] How we built Claude Code auto mode: a safer way to skip permissions · John Hughes (Anthropic) (2026)T1-official original

So “keep a human in control” and “let the agent run” pull against each other, and the tension is real and unresolved. A gate that fires too often trains the human to approve reflexively — which is worse than no gate, because it manufactures the appearance of oversight without the substance. A gate that fires too rarely lets an irreversible action slip through unreviewed. The product ships both a default-on gate and a mechanism to skip it, which is the clearest possible signal that the right firing rate is not a settled question. Present the gate and the fatigue cost together; do not pretend the trade-off is closed.

A gate the human ignores is not oversight

The failure mode of the approval gate is not that it is absent — it is that it fires so often the human stops reading and approves on reflex. At that point the gate produces the appearance of control without the substance, which is more dangerous than no gate, because it licenses a false sense of safety. This is why the default-ask posture is in genuine tension with autonomy: tune which actions are gated (a permission-model question, owned by Vol 1) so the gate is rare enough to stay meaningful, rather than gating everything and training the human to ignore it.

Plan mode: the gate moved earlier

The second expression takes the same approval move and slides it earlier in time. Plan mode “tells Claude to research and propose changes without making them. Claude reads files, runs shell commands to explore, and writes a plan, but does not edit your source.” [Official] Choose a permission mode · AnthropicT1-official original The posture is read-only-until-approved: in plan mode Claude “uses read-only tools only, creating a plan you can approve before execution,” [Official] How Claude Code works · AnthropicT1-official original and the everyday recipe is the same — Claude “reads files and proposes a plan but makes no edits until you approve.” [Official] Common workflows · AnthropicT1-official original

The difference from the approval gate is the unit being gated. The approval gate stops a single risky tool call at the moment it would execute; plan mode stops the whole change-set up front, before any of it is irreversible. The human reviews the proposed plan as a plan — separating the research-and-propose phase from the irreversible coding phase — and approves the direction before a single edit lands. It is the proactive form of the same human-control move: rather than catching risky actions one at a time as they arrive, you put the human’s judgment in front of the entire intended change.

[Tip]

Approval gate and plan mode are the same move at two grains: the gate stops one tool call as it fires; plan mode stops the whole change-set before any of it begins. Reach for plan mode when you want to review direction before edits; rely on the gate to catch individual risky calls along the way.

Calibration: agent-initiated escalation, and imperfect

The first two expressions are boundaries the operator sets. The third flips initiative: calibration is the agent deciding, on its own, when to stop and hand back. This is the thinnest-evidenced part of the chapter, and it must be read that way.

Two Anthropic Research findings — and only two — point at it. The autonomy study reports that “on the most complex tasks, Claude Code asks for clarification more than twice as often as on minimal-complexity tasks, suggesting Claude has some calibration about its own uncertainty.” Measuring AI agent autonomy in practice · McCain, Millar, Huang et al. (Anthropic) (2026)T1-official original The trustworthy-agents principle states the design intent behind that behavior: an agent “can only act on what users actually want if it knows when to stop and ask for clarification when it’s uncertain, or when it’s about to make a mistake.” Trustworthy agents in practice · Anthropic (2026)T1-official original The direction is an agent that escalates itself — surfacing a low-confidence decision for review rather than waiting to be stopped.

But the same research is candid that the calibration is imperfect: the autonomy work notes the agent “may not be stopping at the right moments.” Measuring AI agent autonomy in practice · McCain, Millar, Huang et al. (Anthropic) (2026)T1-official original So this is a suggestive pattern Anthropic is measuring, not an established mechanism you can lean on. Two first-party findings, both honest about their own limits, do not make a guarantee. Treat agent self-calibration as a promising direction that supplements the operator-set gates — never as a substitute for them. The reason the approval gate and plan mode exist as human-initiated boundaries is precisely that you cannot yet trust the agent to know, reliably, when it is about to be wrong.

[Caveat]

Calibration rests on exactly two Anthropic Research posts — and even one of them finds the agent “may not be stopping at the right moments.” It is a measured direction, not a proven feature — do not design a system that depends on the agent reliably escalating itself.

Escalation in automation: fail closed when no human is present

The fourth expression asks what happens to the gate when there is no human watching in real time — in headless runs and CI. The answer is a deliberate design with three parts.

First, the managed Code Review check is non-blocking by default. It always completes with a “neutral conclusion so it never blocks merging through branch protection rules.” [Official] Code Review · AnthropicT1-official original It posts findings; it does not, by itself, stop a merge. A reader could easily misread the review bot as “the thing that blocks bad merges” — it is not, unless a team wires it to be. Gating is an explicit opt-in: if you want to gate merges on findings, you “read the severity breakdown from the check run output in your own CI.” [Official] Code Review · AnthropicT1-official original Second, the merge itself stays a human action by design — the documented security best practice for the GitHub integration is to “review Claude’s suggestions before merging.” [Official] Claude Code GitHub Actions · AnthropicT1-official original The agent opens pull requests; humans merge them.

Third, and most important for safety: when there is genuinely no human to prompt, the gate fails closed. In a non-interactive run, a tool call not covered by the allowlist does not silently proceed — “otherwise the run aborts when one is attempted.” [Official] Run Claude Code programmatically · AnthropicT1-official original The unapproved action is refused, forcing a human to widen the allowlist or re-run with approval. The principle is consistent across all three parts: don’t auto-block by default, but — short of a deliberate bypass-permissions override — never let an unapproved action proceed unattended. The human gate does not vanish in automation; it degrades to fail-closed (unless an operator explicitly turns it off).

[Warning]

The managed Code Review check is non-blocking by default — it posts findings but does not gate the merge unless you explicitly wire a CI gate on its severity output. Do not assume the bot stops bad merges; it does not until you make it.

The workflow-on-model split

One distinction underlies all four expressions and is the actionable takeaway of the chapter. This chapter owns the oversight workflow; Vol 1 owns the permission model. The workflow decides when a human is consulted and what they review — the gate, plan mode, the escalation checkpoints. The model decides which actions need consulting in the first place — the Always / Ask / Never rules and the permissions.deny list that Vol 1’s guardrails established.

The two layers are easy to conflate because the same documentation pages carry both: the permission-modes page describes the pause-and-ask workflow and the rule catalogue side by side. But they are different objects, and designing them as two layers is the whole point. You tune which actions are gated in the permission model — so the gate fires on the genuinely irreversible step and not on every read — and you design how the human is brought in here, in the workflow. Conflating them is the most common framing error in agent oversight: teams either bury control logic in the wrong layer or assume that setting permission rules is the same as designing the human’s role at the gate. It is not. The model says what is risky; the workflow says what the human does about it.

Human-in-the-loop oversight as one move expressed four ways, layered on the permission model. The base layer is the permission MODEL — which actions need consulting (Always / Ask / Never; permissions.deny), owned by Vol-1 guardrails. Layered on top, the oversight WORKFLOW: (1) the approval gate (blocking, default-on, on irreversible actions); (2) plan mode (the gate moved earlier — read-only, propose, approve); (3) calibration (agent-initiated escalation, imperfect); (4) escalation in automation (non-blocking by default, opt-in merge gate, fail-closed when no human). Arrows down to the base show each expression riding on the permission model; the move running across the four is human control over the irreversible or wrong action.A base box spanning the width labeled 'The permission MODEL — which actions need consulting: Always / Ask / Never, permissions.deny, owned by Vol-1 guardrails'. Above it, under a banner reading 'The oversight WORKFLOW — one move, four expressions (when a human is consulted, what they review)', sit four boxes left to right: '1. Approval gate — blocking, default-on, on irreversible actions'; '2. Plan mode — the gate moved earlier: read-only, propose, approve'; '3. Calibration — agent-initiated escalation (imperfect)'; '4. Escalation in automation — non-blocking by default, opt-in merge gate, fail-closed when no human'. A thick arrow drops from each of the four boxes down to the permission-model base. A horizontal arrow runs across the four boxes labeled 'one move: human control over the irreversible or wrong action'.
Human-in-the-loop oversight as one move expressed four ways, layered on the permission model. The base layer is the permission MODEL — which actions need consulting (Always / Ask / Never; permissions.deny), owned by Vol-1 guardrails. Layered on top, the oversight WORKFLOW: (1) the approval gate (blocking, default-on, on irreversible actions); (2) plan mode (the gate moved earlier — read-only, propose, approve); (3) calibration (agent-initiated escalation, imperfect); (4) escalation in automation (non-blocking by default, opt-in merge gate, fail-closed when no human). Arrows down to the base show each expression riding on the permission model; the move running across the four is human control over the irreversible or wrong action.
Placing an oversight decision Worked example

A team says: “Our agent keeps doing things we didn’t want, but the approval prompts are so constant that everyone just hits ‘yes’ without reading. How do we fix oversight?”

Locate each part on the right layer before changing anything:

  • “Everyone just hits ‘yes’” is approval fatigue — the open trade-off, not a bug to patch. The gate is firing too often, so it has stopped being oversight and become a reflex. The fix is not a better prompt; it is to make the gate rare enough to stay meaningful.
  • Which actions fire the gate is a permission-model question — Vol 1’s territory. The move is to tighten the Always / Ask / Never classification so the gate fires on the genuinely irreversible step (a force-push, a DROP TABLE, a deploy) and not on every file read or safe edit. That is a model change, not a workflow change.
  • Reviewing direction before edits land is a plan-mode question — a workflow change. If the agent “keeps doing things we didn’t want,” moving the human decision earlier — review the proposed plan before any edit — catches a wrong direction up front instead of one wrong tool call at a time.
  • The agent stopping itself when unsure is calibration — and the honest answer is that you cannot rely on it. It is a sparse, imperfect pattern; it may help at the margin, but it is not the fix here.

The framing turns a panicked “fix oversight” into located moves: thin the gate in the model so it stays meaningful, and move the human decision earlier in the workflow. Notice the fix lives mostly in the layer this chapter does not own — which is exactly why the workflow-on-model split is the load-bearing distinction.

Conflating the workflow with the permission model

The most common oversight error is treating “set the permission rules” and “design the human’s role” as one task. They are two layers. The permission model (Vol 1) decides which actions are risky — Always / Ask / Never, permissions.deny. The oversight workflow (this chapter) decides what the human does about it — when the agent pauses, what the reviewer reads, when it fails closed. Bury the firing logic in the wrong layer and you get either a gate that fires on everything (approval fatigue, reflexive approval) or rules with no human meaningfully in the loop. Tune which actions are gated in the model; design how the human is consulted in the workflow.

Quick reference

  • One move, four expressions: human control over the irreversible or wrong action, inserted at a risky transition — as the approval gate, plan mode, calibration, and escalation in automation.
  • Approval gate: blocking, default-on; fires before irreversible actions; “pauses execution until you return a response”; the human is the reviewer, not a rubber stamp. Handle approvals and user input · AnthropicT1-official original
  • Open trade-off — not solved: the default ask-before-acting posture causes approval fatigue, which motivates skipping permissions; present the gate and the fatigue cost together. How we built Claude Code auto mode: a safer way to skip permissions · John Hughes (Anthropic) (2026)T1-official original
  • Plan mode = the gate moved earlier: read-only, “proposes a plan but makes no edits until you approve” — gates the whole change-set, not one call. Common workflows · AnthropicT1-official original
  • Calibration is sparse and imperfect: two Anthropic Research findings suggest the agent asks for clarification “more than twice as often” on the hardest tasks, but it “may not be stopping at the right moments” — a direction, not a guarantee. Measuring AI agent autonomy in practice · McCain, Millar, Huang et al. (Anthropic) (2026)T1-official original
  • Escalation fails closed: managed Code Review is non-blocking by default (gating is opt-in); the merge stays human; a headless run “aborts” on an unapproved tool call. Code Review · AnthropicT1-official original Run Claude Code programmatically · AnthropicT1-official original
  • Workflow-on-model: this chapter owns when a human is consulted and what they review; Vol 1’s permission model owns which actions need consulting. Tune the model; design the workflow.

Practice

Exercise

The chapter claims oversight is “one move expressed four ways.” Name the four expressions, and state in one sentence each what point in the agent’s action they insert the human decision at. Then explain the difference between the oversight workflow (this chapter) and the permission model (Vol 1) in a single sentence — and say which one decides which actions trip a gate.

Exercise

The chapter insists the default ask-before-acting posture is an open trade-off, not a solved best practice. Explain the tension in two or three sentences: what goes wrong if the approval gate fires too often, and what goes wrong if it fires too rarely? Why is the existence of an auto mode that skips permissions evidence that the right firing rate is unsettled?

Practice ◆◆◇◇

Take an agent you run or have read about. For each of the four oversight expressions, decide whether it currently applies and how: (1) Is there a blocking approval gate before its irreversible actions, and does it fire often enough to be ignored? (2) Could a read-only plan-mode review of the whole change-set replace catching risky calls one at a time? (3) Does the agent ever stop and ask when unsure — and would you trust it to? (4) If it runs in CI or headless, does an unapproved action fail closed or slip through? Then name the one change that would most improve oversight, and say whether that change lives in the workflow (this chapter) or the permission model (Vol 1).

Exercise solutions

Solution ↑ Exercise

The four expressions are the approval gate, plan mode, calibration, and escalation in automation. The approval gate inserts the human decision before a single irreversible tool call, at the moment it would execute. Plan mode inserts it before a whole change-set, while the agent is still read-only and has only proposed a plan. Calibration inserts it whenever the agent itself judges it is uncertain — the agent, not the operator, initiates the pause. Escalation in automation inserts it at the merge or the unapproved-tool boundary in a headless or CI run, where it fails closed if no human is present. The difference between the two layers: the oversight workflow (this chapter) decides when a human is consulted and what they review, while the permission model (Vol 1) decides which actions need consulting — so it is the model, not the workflow, that decides which actions trip a gate. The workflow rides on top of the model: the model classifies risk, the workflow brings the human in.

Solution ↑ Exercise

If the gate fires too often, the human is trained to approve on reflex — they stop reading the proposed command and click “yes” by habit, which produces the appearance of oversight without the substance and is more dangerous than no gate, because it licenses a false sense of safety. If the gate fires too rarely, an irreversible or wrong action slips through unreviewed, which is the exact failure the gate exists to prevent. The right firing rate sits between these, and the chapter’s evidence that it is unsettled is that the product ships both a default-on gate and a documented auto mode whose explicit purpose is to skip permissions because the default causes “approval fatigue”: if the default firing rate were correct, there would be no need to build a sanctioned way around it. The tension between “keep a human in control” and “let the agent run” is therefore a genuine, ongoing product trade-off — present the gate and its fatigue cost together rather than treating the default as a settled best practice. (Note the practical resolution lives mostly in the permission model — gating fewer, genuinely-risky actions — not in a cleverer prompt.)

Solution ↑ Exercise

A worked example. Take an agent that triages incident reports and can comment on tickets, run read-only diagnostic queries, and restart a service. (1) Approval gate: the service restart is irreversible-enough to warrant a blocking gate; commenting on a ticket is not — if the gate currently fires on every comment, it is firing too often and the on-call engineer will approve restarts on the same reflex they approve comments, which is the fatigue failure. (2) Plan mode: for a multi-step remediation, a read-only plan (“I will restart service X, then re-run check Y”) reviewed up front catches a wrong remediation direction before any restart happens — better than approving each step as it arrives. (3) Calibration: the agent might stop and ask when a diagnostic is ambiguous, but you would not trust it to reliably catch the case where it is about to restart the wrong service — calibration is sparse and imperfect, so it supplements the gate, it does not replace it. (4) Headless: if this runs unattended overnight, an un-allowlisted action must fail closed (abort) rather than restart a service no human approved. The single highest-value change: tighten the permission model so the blocking gate fires on the restart and not on comments — which makes the gate rare enough to stay meaningful. That change lives in the permission model (Vol 1), not the workflow — which is exactly why the workflow-on-model split is the load-bearing distinction: the most important oversight fix here is a model change, surfaced by a workflow symptom (reflexive approval).

Part 3 Chapter 27 Last verified 2026-06-14 Fresh

Security: The Adversarial-Input Layer

The adversarial-input layer — who is really issuing the instruction. Prompt injection and Willison's lethal trifecta as the necessary-conditions threat model; the incidents (EchoLeak, Comet, ShadowPrompt) as one attack shape; why detection-only fails by construction and design-by-construction is this volume's one genuine convergence; the honest residual that defenses reduce, not eliminate; and a supply chain whose trust the registry delegates to you. The authorized-but-forged counterpart to Vol 1's authorized-but-risky guardrails.

Volatility: architectural-pattern
Tools compared: claude-codecross-tool
Before you start: Vol 1's guardrails — the permission model that decides which actions an agent may attempt (Always / Ask / Never, the deny-list). This chapter is its counterpart: not what the agent is authorized to do, but whether an attacker can forge the instruction it follows.
You will learn
  • Why prompt injection is a structural problem, not a bug to be patched — the agent cannot reliably tell its principal’s instructions from untrusted content
  • The lethal trifecta as the necessary-conditions threat model, and why every robust defense cuts one of its three legs
  • Why the real incidents (EchoLeak, Comet, ShadowPrompt) are one attack shape, not three
  • Why detection-only fails by construction, and why design-by-construction is the one place this volume finds genuine convergence
  • The honest residual: defenses reduce, not eliminate — and a supply chain whose trust the registry hands to you

Vol 1’s guardrails answered “what may this agent attempt?” — the authorized-but-risky question, governed by the permission model. This chapter answers a different one: “who is really issuing the instruction the agent just followed?” That is the authorized-but-forged question. When an agent reads a web page, an email, or a tool result, it ingests text an attacker may control — and the thesis of this chapter is that the agent cannot, by construction, reliably tell that text apart from its operator’s commands. Security here is the discipline of defending a system that trusts its inputs more than it should.

The authorized-but-forged problem

Start with the definition. A prompt-injection vulnerability “occurs when user prompts alter the LLM’s behavior or output in unintended ways.” [Practitioner] LLM01:2025 Prompt Injection · OWASP Gen AI Security ProjectT3-practitioner original The community standard splits it in two: a direct injection is supplied by the user, while an indirect injection “occur[s] when an LLM accepts input from external sources, such as websites or files.” [Practitioner] LLM01:2025 Prompt Injection · OWASP Gen AI Security ProjectT3-practitioner original The indirect case is this chapter’s threat model, because it is the one the operator did not type: the dangerous content arrives through a web page the agent browses, a document it retrieves, or a tool result it reads back.

The reason this is not “just a bug to be patched” is structural. The model receives one undifferentiated stream of tokens; the operator’s instructions and the ingested content are the same kind of thing to it. There is no reliable, built-in channel that says this half is my principal and that half is data. Patching one injection string does not change that — the next phrasing slips through. This is why the rest of the chapter is about cutting the attack’s preconditions by design rather than spotting its signature after the fact.

Key idea

Prompt injection is not a string to blocklist; it is a consequence of the agent treating untrusted input as trusted instruction. So the question that organizes all of agent security is not “how do I detect the attack?” but “which of the conditions that make the attack catastrophic can I remove?”

The lethal trifecta

The sharpest statement of those conditions is Simon Willison’s lethal trifecta. [Practitioner] The lethal trifecta for AI agents: private data, untrusted content, and external communication · Simon Willison (2025)T3-practitioner original An agent becomes exfiltratable when it simultaneously has three capabilities: access to private data, exposure to untrusted content — “any mechanism by which text (or images) controlled by a malicious attacker could become available to your LLM” [Practitioner] The lethal trifecta for AI agents: private data, untrusted content, and external communication · Simon Willison (2025)T3-practitioner original — and a path for external communication. With all three present, an attacker can trick the agent “into accessing your private data and sending it to that attacker.” [Practitioner] The lethal trifecta for AI agents: private data, untrusted content, and external communication · Simon Willison (2025)T3-practitioner original

The framing is load-bearing for the whole chapter because it is a necessary-conditions model: the catastrophe needs all three legs at once. That has a direct design consequence — the cleanest defenses work by removing one leg: deny the private data, isolate the untrusted content so it cannot become instruction, or block the exfiltration path (and where no leg can be removed outright, hardening the model against the combination is the weaker fallback the next section covers). It also makes the incident landscape legible, because every real case below is the same three legs in a different costume.

Concept · The lethal trifecta

A coinage of Simon Willison’s (a practitioner framing, not a vendor or standards definition): an AI agent is dangerous to expose when it combines (1) access to private data, (2) exposure to attacker-controllable content, and (3) the ability to communicate externally. Any one leg alone is survivable; all three together let injected content read secrets and ship them out. Defenses are best understood as cutting a specific leg.

The lethal trifecta (Willison, a practitioner coinage). Three legs — private-data access, exposure to untrusted content, and an external-communication path — converge on catastrophic exfiltration only when all three are present. Each robust defense cuts one leg: deny the data, isolate the untrusted content, or block the exfiltration channel. Any single leg removed defuses the combination.A triangle with three corner nodes labeled 'Private-data access', 'Untrusted content (attacker-controllable)', and 'External-communication path'. Edges from all three converge on a central red node 'Catastrophic exfiltration', annotated 'needs all three legs'. Each corner carries a dashed 'cut here' label naming the defense that removes that leg: 'deny / scope the data', 'isolate — never let it become instruction', and 'block the exfiltration channel'. A footer reads 'remove any one leg and the combination is defused'.
The lethal trifecta (Willison, a practitioner coinage). Three legs — private-data access, exposure to untrusted content, and an external-communication path — converge on catastrophic exfiltration only when all three are present. Each robust defense cuts one leg: deny the data, isolate the untrusted content, or block the exfiltration channel. Any single leg removed defuses the combination.

The incidents are one attack shape

Indirect injection is not hypothetical, and the public incidents are best read as one attack instantiated three ways. EchoLeak is the keystone: its authoritative record describes an “Ai command injection in M365 Copilot [that] allows an unauthorized attacker to disclose information over a network,” CVE-2025-32711 · NVD (NIST National Vulnerability Database)T1-official original and the disclosing researchers reported that the chains “automatically exfiltrate sensitive and proprietary information from M365 Copilot context, without the user’s awareness or relying on any specific victim behavior” [Practitioner] Breaking down 'EchoLeak', the First Zero-Click AI Vulnerability Enabling Data Exfiltration from Microsoft 365 Copilot · Itay Ravia (Cato Networks / Aim Labs)T3-practitioner original — a zero-click realization of all three legs. Comet, an agentic browser, on a “summarize this page” action fed page content to its model “without distinguishing between the user’s instructions and untrusted content from the webpage,” [Practitioner] Agentic Browser Security: Indirect Prompt Injection in Perplexity Comet · Artem Chaikin and Shivan Kaul Sahib (Brave)T3-practitioner original the instruction/data confusion made concrete. ShadowPrompt disclosed “a vulnerability that allowed any website to silently inject prompts into [Claude’s Chrome extension] as if the user wrote them.” [Practitioner] ShadowPrompt: How Any Website Could Have Hijacked Claude's Chrome Extension · Oren Yomtov (Koi Research) (2026)T3-practitioner original And the generic exfiltration leg is old: the markdown-image channel, where “the individual controlling the data a plugin retrieves can exfiltrate chat history due to ChatGPT’s rendering of markdown images,” [Practitioner] ChatGPT Plugins: Data Exfiltration via Images and Cross Plugin Request Forgery · Johann Rehberger (wunderwuzzi) (2023)T3-practitioner original shows the “external communication” leg needs nothing more exotic than an auto-rendered image URL.

[Note]

Sourcing is deliberately tiered here. EchoLeak’s CVE identifier comes only from the NVD primary; the “zero-click” narrative is the disclosing security vendor. The published severity score is unsettled across sources, so this book quotes the description and omits the number.

Read together they are not a zoo of exotic exploits; they are the trifecta, again and again — private context plus attacker-controllable content plus a way out.

Why detection-only fails by construction

The tempting response is to add a classifier that flags malicious input. The literature is blunt that this is the wrong primary control. A formal analysis of known-answer detection “uncover[s] a structural vulnerability that invalidates its core security premise,” How Not to Detect Prompt Injections with an LLM · Choudhary, Anshumaan, Palumbo, Jha (2025)T3-practitioner original and the authors build an adaptive attack, DataFlip, that “consistently evades KAD defenses.” How Not to Detect Prompt Injections with an LLM · Choudhary, Anshumaan, Palumbo, Jha (2025)T3-practitioner original Independent empirical work against deployed commercial detectors reaches, “in some instances[,] up to 100% evasion success.” Bypassing LLM Guardrails: An Empirical Analysis of Evasion Attacks against Prompt Injection and Jailbreak Detection Systems · Hackett, Birch, Trawicki, Suri, GarraghanT3-practitioner original The lesson is not that detectors are merely imperfect in practice — it is that they are evadable by construction: a classifier that can be probed can be defeated, because the attacker optimizes against exactly the signal the classifier reads.

Design-by-construction is the backbone

If you cannot reliably spot the attack, you must build the defense in by construction — so that untrusted content cannot reliably act as instruction in the first place. This is the one place in this volume where multiple independent research groups converge on the same principle, so it is the one place the book tags genuine convergence.

Convergence

Multiple independent research groups — two academic, two from a model vendor (Meta), but mutually independent — agree that prompt injection is defended by construction, not detection: CaMeL separates control flow from data flow so “the untrusted data retrieved by the LLM can never impact the program flow” Defeating Prompt Injections by Design · Debenedetti, Shumailov, Fan, Hayes, Carlini, et al.T3-practitioner original and secures the agent “even when underlying models are susceptible to attacks” Defeating Prompt Injections by Design · Debenedetti, Shumailov, Fan, Hayes, Carlini, et al.T3-practitioner original ; Beurer-Kellner et al. propose “principled design patterns for building AI agents with provable resistance to prompt injection” Design Patterns for Securing LLM Agents against Prompt Injections · Beurer-Kellner, Buesser, Creţu, Debenedetti, et al.T3-practitioner original ; and Meta SecAlign pushes the same idea to the model itself, “the first fully open-source LLM with built-in model-level defense.” Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks · Chen, Zharmagambetov, Wagner, Guo (Meta)T3-practitioner original Three independent groups, three layers — system, pattern, model — one principle: build the defense in (cut a trifecta leg architecturally, or harden the model itself) rather than classify the input after the fact.

The actionable form is a single rule, and it is genuine convergence, not one vendor’s house style: [Convergence] Defeating Prompt Injections by Design · Debenedetti, Shumailov, Fan, Hayes, Carlini, et al.T3-practitioner original Design Patterns for Securing LLM Agents against Prompt Injections · Beurer-Kellner, Buesser, Creţu, Debenedetti, et al.T3-practitioner original Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks · Chen, Zharmagambetov, Wagner, Guo (Meta)T3-practitioner original do not buy a prompt-injection “detector” as your primary control — defend by construction, whether by cutting a trifecta leg architecturally or hardening the model itself. Runtime monitors still have a place, but as one layer among several: LlamaFirewall is described by its authors as “a final layer of defense,” LlamaFirewall: An open source guardrail system for building secure AI agents · Chennabasappa, Nikolaidis, Song, et al. (Meta)T3-practitioner original not a solution. The honest counterweight even from the design side is that the patterns “discuss their trade-offs in terms of utility and security” Design Patterns for Securing LLM Agents against Prompt Injections · Beurer-Kellner, Buesser, Creţu, Debenedetti, et al.T3-practitioner original — cutting a leg constrains what the agent can do, so security here is bought with capability, not for free.

Defenses reduce, not eliminate

No control in this chapter takes the risk to zero, and the most honest reading is that today’s safety margin is partly accidental. Anthropic’s own browser red-team is the cleanest illustration: “Browser use without our safety mitigations showed a 23.6% attack success rate when deliberately targeted by malicious actors,” [Official] Piloting Claude in Chrome · Anthropic (2025)T1-official original and with mitigations “we reduced the attack success rate of 23.6% to 11.2%.” [Official] Piloting Claude in Chrome · Anthropic (2025)T1-official original Those are first-party, self-reported figures, and the load-bearing fact is the residual: 11.2% is a large reduction, but it is not zero. A benchmark of web-agent security states the point even more starkly — “attacks partially succeed in up to 86% of the case[s], even [as] state-of-the-art agents often struggle to fully complete the attacker goals,” which the authors name “security by incompetence.” WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks · Evtimov, Zharmagambetov, Grattafiori, Guo, ChaudhuriT3-practitioner original In other words, part of why agents are not catastrophically exploited today is that they are not yet good enough at completing an attacker’s goal — a margin that erodes as agents improve. Treat prompt-injection defense as risk reduction and defense-in-depth, never as solved.

The honest tiering is itself the finding

This chapter’s evidence is deliberately mixed, and saying so is part of the result. The design-by-construction principle is genuine convergence across independent academic primaries — the strongest claim here. The incidents are illustrative secondary write-ups, with EchoLeak’s identifier anchored to the NVD primary. Anthropic’s attack-success-rate figures are first-party and self-reported, not an independent benchmark. And the lethal trifecta is an attributed practitioner coinage, not a vendor or standards definition. A reader should weight each claim by the tier it carries, not assume one uniform standard of proof.

The supply chain delegates trust to you

The last surface is the components an agent depends on. OWASP frames the integrity risk across “training data, models, and deployment platforms,” [Practitioner] LLM03:2025 Supply Chain · OWASP Gen AI Security ProjectT3-practitioner original and the modern instance is third-party MCP servers (with third-party skills an adjacent surface this chapter does not quantify). The load-bearing fact is first-party and candid: Anthropic reviews connectors against listing criteria “but does not security-audit or manage any MCP server,” [Official] Security · AnthropicT1-official original so the recommended posture is to write “your own MCP servers or [use] MCP servers from providers that you trust.” [Official] Security · AnthropicT1-official original Academic work corroborates the gap: malicious MCP servers are “easy to implement, difficult to detect with current tools, and capable of causing concrete damage.” When MCP Servers Attack: Taxonomy, Feasibility, and Mitigation · Zhao, Liu, Ruan, Li, Liang (2025)T3-practitioner original The conclusion is sharp — listing in a registry is not vetting. Install-time trust is the operator’s responsibility, and allowlisting by provenance is the control, not the marketplace.

[Caveat]

A figure of “~12–20% of skills are malicious” circulates in some discussions. This book does not assert it — no reliable primary supports that band. The honest, citable claim is qualitative: the registry does not audit, and malicious servers are hard to detect.

Buying a detector and calling it solved

Two failures travel together. The first is making a prompt-injection classifier your primary defense — it is evadable by construction, so an attacker who can probe it can pass it. The second is treating any mitigation as a solution: the residual is non-zero (11.2% in Anthropic’s own browser numbers), and part of today’s safety is “security by incompetence” that improving agents will erode. The defensible posture is design-by-construction (cut a trifecta leg) plus defense-in-depth (layer the imperfect controls), held with the explicit knowledge that the risk is reduced, not removed.

Quick reference

  • Threat model: prompt injection is structural — the agent cannot reliably separate its principal’s instructions from ingested content; LLM01:2025 Prompt Injection · OWASP Gen AI Security ProjectT3-practitioner original the indirect (external-content) case is the danger.
  • The hinge — the lethal trifecta: private data + untrusted content + an exfiltration path; all three are needed, so cut one leg. The lethal trifecta for AI agents: private data, untrusted content, and external communication · Simon Willison (2025)T3-practitioner original
  • One attack shape: EchoLeak, CVE-2025-32711 · NVD (NIST National Vulnerability Database)T1-official original Comet, Agentic Browser Security: Indirect Prompt Injection in Perplexity Comet · Artem Chaikin and Shivan Kaul Sahib (Brave)T3-practitioner original ShadowPrompt, ShadowPrompt: How Any Website Could Have Hijacked Claude's Chrome Extension · Oren Yomtov (Koi Research) (2026)T3-practitioner original and the markdown-image channel ChatGPT Plugins: Data Exfiltration via Images and Cross Plugin Request Forgery · Johann Rehberger (wunderwuzzi) (2023)T3-practitioner original are the same three legs.
  • Detection fails by construction: known-answer detection is structurally evadable; How Not to Detect Prompt Injections with an LLM · Choudhary, Anshumaan, Palumbo, Jha (2025)T3-practitioner original deployed detectors hit “up to 100%” evasion in some instances. Bypassing LLM Guardrails: An Empirical Analysis of Evasion Attacks against Prompt Injection and Jailbreak Detection Systems · Hackett, Birch, Trawicki, Suri, GarraghanT3-practitioner original
  • The one convergence — design-by-construction: cut a leg architecturally (CaMeL / design patterns / model-level) Defeating Prompt Injections by Design · Debenedetti, Shumailov, Fan, Hayes, Carlini, et al.T3-practitioner original Design Patterns for Securing LLM Agents against Prompt Injections · Beurer-Kellner, Buesser, Creţu, Debenedetti, et al.T3-practitioner original Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks · Chen, Zharmagambetov, Wagner, Guo (Meta)T3-practitioner original — don’t buy a detector as your primary control. (The convergence tag sits on this claim in the body.)
  • Reduce, not eliminate: Anthropic’s browser ASR fell 23.6% → 11.2%, not to zero; Piloting Claude in Chrome · Anthropic (2025)T1-official original WASP’s “security by incompetence” is a margin that erodes as agents improve. WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks · Evtimov, Zharmagambetov, Grattafiori, Guo, ChaudhuriT3-practitioner original
  • Supply chain: the registry “does not security-audit … any MCP server” Security · AnthropicT1-official original — listing is not vetting; allowlist by provenance.

Practice

Exercise

State the lethal trifecta’s three legs, and explain why it is a necessary-conditions model rather than a list of three separate risks. Then take one incident from the chapter (EchoLeak, Comet, or ShadowPrompt) and name which leg each part of it supplies.

Exercise

The chapter says detection-only defenses fail “by construction,” not merely “in practice.” Explain the difference, and why it implies design-by-construction must carry the load. Why is it still defensible to run a detector like a runtime monitor at all?

Practice ◆◆◆◇

Take an agent you run or have read about that browses the web or reads external documents. Locate it on the lethal trifecta: does it have (1) access to private data, (2) exposure to attacker-controllable content, and (3) an external-communication path? For whichever legs are present, write down the single most practical leg you could cut by design (not detect) — and state honestly what capability you would lose by cutting it. If all three legs are present and you cannot cut any, say so: that is the finding.

Exercise solutions

Solution ↑ Exercise

The three legs are access to private data, exposure to untrusted (attacker-controllable) content, and a path for external communication. It is a necessary-conditions model because the catastrophe — an attacker reading your secrets and shipping them out — requires all three at once: private data with no untrusted content has nothing to hijack it; untrusted content with no private-data access has nothing to steal; either of those with no exfiltration path cannot get the data out. That is exactly why the defensive move is to remove one leg rather than to harden all three. Mapping EchoLeak: the private data is the M365 Copilot context; the untrusted content is the attacker’s email/content that Copilot ingests; the exfiltration path is the network disclosure the CVE describes (“disclose information over a network”). The same trio appears in Comet (page content as untrusted input, browser tools as the exfiltration path) and ShadowPrompt (any website injecting prompts “as if the user wrote them,” with the extension’s reach as the data/exfiltration surface).

Solution ↑ Exercise

“Fails in practice” would mean a detector that is merely imperfect — catches most attacks, misses some, and improves with more training. “Fails by construction” is stronger: the design of a detection-only defense contains the vulnerability, independent of how good the classifier is. Known-answer detection has a “structural vulnerability that invalidates its core security premise,” and an adaptive attacker who can probe the classifier optimizes directly against the signal it reads — DataFlip “consistently evades” it, and deployed detectors have been evaded “up to 100%” in some instances. Because the failure is structural, throwing a better classifier at it does not close the hole; you must instead make untrusted content structurally unable to act as instruction (cut a trifecta leg) — design-by-construction. A detector is still defensible as one layer of defense-in-depth — a runtime monitor that raises the attacker’s cost and catches the unsophisticated cases — exactly as LlamaFirewall positions itself as “a final layer of defense.” What is indefensible is making that evadable layer your primary control.

Solution ↑ Exercise

A worked example. Take a research agent that browses the web and can post to an internal Slack. Trifecta audit: (1) private data — yes, it has the team’s internal context and Slack history; (2) untrusted content — yes, it reads arbitrary web pages; (3) external communication — yes, web fetches and Slack posts can both carry data out. All three legs are present, so it is exfiltratable. The most practical leg to cut by design is usually (2)→instruction: run the browsing in a mode where retrieved page text is structurally treated as data, never as instruction (e.g., a control/data-flow separation so fetched content cannot trigger tool calls), which is the CaMeL-style move. The capability cost is real and must be named: the agent can no longer act on instructions it finds on a page — including legitimate ones like “see the linked doc for the full spec” — so some autonomy is lost. If your harness cannot enforce that separation, the honest fallbacks are to cut leg (1) (scope the agent’s data access so a successful injection steals little) or leg (3) (remove the outbound channel — no Slack post, no arbitrary fetch — so exfiltration has nowhere to go). If you genuinely cannot cut any leg, the correct output of this exercise is to say so plainly: the agent is exposed, and the residual risk must be accepted, escalated to a human gate (ch26), or the deployment reconsidered — not papered over with a detector.

Part 3 Chapter 28 Last verified 2026-06-14 Fresh

Operating the Whole: Eval + Ops as One Loop

The Volume 3 capstone — the five operational surfaces as one closed operate-and-improve loop. A production failure surfaces in the session log, becomes an eval case, and drives a fix bounded by cost, oversight, and security, then it is measured again. An honest map of where Vol 3's evidence stands, the unsolved trade-offs the discipline navigates rather than solves, and a short close on Design v1.0.

Volatility: architectural-pattern
Tools compared: claude-codecross-tool
Before you start: The five surfaces of this volume — eval (ch22–23), observability (ch24), cost (ch25), human-in-the-loop (ch26), and security (ch27). This capstone composes them; it introduces no new evidence.
You will learn
  • How the five operational surfaces close into one operate-and-improve loop, not five separate checklists
  • The loop in motion — how a production failure becomes an eval case becomes a bounded fix
  • An honest map of where Vol 3’s evidence actually stands, tier by tier
  • The unsolved trade-offs the discipline navigates rather than solves
  • Where Design v1.0 ends, and what it sets up

This is the capstone of the Evaluation & Operations volume, and of Design v1.0. It adds no new evidence — every citation points back to a source an earlier chapter already established; its job is to show that the five surfaces ch21 introduced — measure, see, spend, oversee, defend — are not a list you tick once but a loop you run continuously. The argument of the whole volume reduces to one shape: a production failure becomes a measurement becomes a fix becomes a new measurement, and the operational surfaces are the instruments that close that loop.

The five surfaces close a loop

ch21 laid the surfaces out as an arc — measure → see → spend → oversee → defend. Read once, left to right, that looks like a pipeline with an end. It is not. The output of operating an agent is information about how it failed, and that information flows back to the start: a wrong answer or a near-miss in production is exactly the raw material an eval is made of.

So the surfaces close. Observability (ch24) is where a failure first becomes visible — the session log is the ground truth a regression is read from. Monitoring · AnthropicT1-official original Eval (ch22–23) is where that observed failure is turned into a repeatable measurement — a new test case, derived from a real failure, exactly as the eval discipline prescribes. Demystifying evals for AI agents · Grace, Hadfield, Olivares and De Jonghe (Anthropic)T1-official original The fix that follows is bounded by the other three surfaces: it has a cost (ch25), it may need a human gate if it touches something irreversible (ch26), and it must not open a trifecta leg (ch27). Then you measure again. That is the loop.

Key idea

Operating an agent is a closed loop, not a one-time setup. Failures seen in production (observability) become measurements (eval) that drive fixes — fixes priced by cost, gated by oversight, and constrained by security — whose effect is then measured again. The five surfaces are the instruments; the loop is the discipline.

Concept · The operate-and-improve loop

The closed cycle that operating an agent actually is: a production failure is seen (observability), turned into a repeatable measurement (eval), and fixed within the budgets that cost, oversight, and security impose — after which the fix is measured again. Each pass should leave the eval suite stronger, because every real failure becomes a permanent test. Skipping the loop-back step — fixing without adding the eval — is how the same regression returns.

The loop in motion

Trace one turn of it. An agent ships a subtly wrong result. It surfaces because someone reads the session log — the transcript is the one record every other surface derives from (ch24). You reproduce the failure, and instead of patching by hand and moving on, you make it a case: a small, unambiguous eval task drawn from this real failure (ch22 for a prompt, ch23 for a trajectory), so that “fixed” becomes something you can measure rather than something you feel. You change the prompt, the tool, or the guardrail. Now the other three surfaces bound the change: you check that the fix has not ballooned the input context that drives cost (ch25); if the fix lets the agent take a more irreversible action, you put a human gate in front of it (ch26); and you confirm the fix has not handed an attacker a new leg of the trifecta (ch27). Finally you re-run the eval. The suite is now one case stronger, and the loop is ready for the next failure.

The operate-and-improve loop. A production run's failure is seen in the session log (observability, ch24), turned into a repeatable eval case (ch22–23), and fixed — with the fix bounded by cost (ch25), oversight (ch26), and security (ch27) — after which it is measured again and the loop returns to production. Each pass leaves the eval suite one case stronger.A clockwise cycle of four stages. Stage 1 'Production run'. Stage 2 'See the failure — session log (ch24)'. Stage 3 'Make it an eval case (ch22–23)'. Stage 4 'Fix it', with three bounding labels attached — 'cost (ch25)', 'oversight (ch26)', 'security (ch27)'. An arrow returns from stage 4 to stage 1 labeled 'measure again'. A note in the centre reads 'each pass leaves the eval suite one case stronger'.
The operate-and-improve loop. A production run's failure is seen in the session log (observability, ch24), turned into a repeatable eval case (ch22–23), and fixed — with the fix bounded by cost (ch25), oversight (ch26), and security (ch27) — after which it is measured again and the loop returns to production. Each pass leaves the eval suite one case stronger.

An honest map of the evidence

A capstone should say plainly how well-founded its own volume is, because Vol 3’s evidence is deliberately uneven, and ch21 promised to keep saying so. This is the book’s reading of where the evidence stands — a synthesis, not a new sourced claim.

  • Eval, observability, cost, and oversight are first-party-authoritative, not triangulated. The eval discipline is Anthropic methodology; Demystifying evals for AI agents · Grace, Hadfield, Olivares and De Jonghe (Anthropic)T1-official original observability is Claude Code mechanics over the OpenTelemetry spec; cost is Anthropic’s own economics. These are the definitive sources for how their systems behave — but a single authoritative voice is not the same as independent agreement, and the volume never dressed it as such.
  • The cost multipliers are one workload’s measurements. The roughly fifteen-times token figure for multi-agent systems is a single first-party datapoint on Anthropic’s research workload, How we built our multi-agent research system · Anthropic (2025)T1-official original a modeling input, not a universal constant.
  • Security is the one genuine convergence — and still unsolved. That you defend by construction rather than detection is asserted by multiple independent research groups, and the lethal trifecta The lethal trifecta for AI agents: private data, untrusted content, and external communication · Simon Willison (2025)T3-practitioner original names why the architectural move works; Defeating Prompt Injections by Design · Debenedetti, Shumailov, Fan, Hayes, Carlini, et al.T3-practitioner original this is the volume’s one place to claim agreement across sources. Yet even there the residual is non-zero — Anthropic’s own browser attack-success rate fell to 11.2%, not to nothing Piloting Claude in Chrome · Anthropic (2025)T1-official original — and supply-chain trust is delegated to the operator, since the registry “does not security-audit … any MCP server.” Security · AnthropicT1-official original

The shape of the evidence is itself a finding: operations is the part of building agents where the guidance is most authoritative and least triangulated at once.

The unsolved trade-offs

The loop runs against three tensions this volume could not dissolve, only name — and navigating them per workload is the actual skill.

  • Autonomy ↔ control. Every gate that keeps a human over an irreversible action also slows the agent and risks approval fatigue; ch26 presented this as an open trade-off, not a solved one. More autonomy is more throughput and more unreviewed risk; the right point is workload-specific.
  • Cost ↔ performance. Token spend buys capability — a multi-agent system can be worth its roughly fifteen-times burn on a high-value task How we built our multi-agent research system · Anthropic (2025)T1-official original — but the same spend is pure waste on a task that never needed it. The lever is the same; only the task’s value decides.
  • Utility ↔ security. Cutting a trifecta leg by construction is the robust defense (ch27), but it constrains what the agent may do — the design patterns come with explicit utility/security trade-offs. A perfectly safe agent that cannot act is as useless as a capable one that leaks.
The trade-offs are the discipline

None of these three has a universal answer, and a reference book that pretended otherwise would be lying. The discipline is not resolving autonomy-vs-control, cost-vs-performance, or utility-vs-security once and for all — it is locating this workload on each axis, with the instruments of the five surfaces, and re-locating it as the workload changes. The operate-and-improve loop is how you keep that calibration honest over time.

[Note]

The “operate-and-improve loop” and the “measure → see → spend → oversee → defend” arc are this book’s framing of the dossiers’ separate treatments — a way to read the volume as one system, not terms the primary sources use.

Design v1.0, complete

This chapter closes the third volume, and with it Design v1.0. The arc was deliberate. Vol 1 — Environment & Context engineered what surrounds the model: the environment an agent acts in and the context it reasons over. Vol 2 — Tools & Orchestration took the harness’s two remaining axes: the capability an agent reaches for, and the coordination of more than one agent. Vol 3 — Evaluation & Operations took what is left once the system runs: how you know it works, see what it did, pay for it, keep a human over it, and defend it. Together they are one engineering discipline — building agentic systems that are not just capable but measured, operable, and honest about their limits.

What v1.0 does not do is re-traverse this material through specific real-world problems — the applied, problem-first volume that comes next. But the discipline is the foundation that volume will stand on: you cannot operate what you cannot measure, and you cannot improve what you do not operate as a loop.

Treating operations as setup

The most common way the loop breaks is treating its surfaces as one-time configuration — stand up a dashboard, write a few evals, set a permission policy, ship. Operations is not setup; it is the continuous conversion of production failures into measurements into fixes. The tell is a regression that keeps coming back: it returns because the failure was patched but never made into an eval case, so nothing measures whether it stays fixed. Close the loop, or it is not a loop.

Quick reference

  • One loop, not five checklists: see the failure (ch24) → make it a measurement (ch22–23) → fix it within cost/oversight/security budgets (ch25/26/27) → measure again.
  • Every failure becomes a permanent eval case — that is what leaves the suite stronger each pass; skipping it is why regressions return.
  • The evidence map: eval/observability/cost/oversight are first-party-authoritative, not triangulated; Demystifying evals for AI agents · Grace, Hadfield, Olivares and De Jonghe (Anthropic)T1-official original security is the one genuine convergence; Defeating Prompt Injections by Design · Debenedetti, Shumailov, Fan, Hayes, Carlini, et al.T3-practitioner original the ~15× is one workload’s datapoint; How we built our multi-agent research system · Anthropic (2025)T1-official original defenses reduce, not eliminate. Piloting Claude in Chrome · Anthropic (2025)T1-official original
  • Three unsolved trade-offs: autonomy ↔ control, cost ↔ performance, utility ↔ security — navigated per workload, not solved.
  • Design v1.0 = Vols 1–3: environment & context → tools & orchestration → evaluation & operations, one engineering discipline.

Practice

Exercise

The chapter claims the five surfaces “close a loop” rather than forming a pipeline. Describe the loop in one sentence per stage, and name the single step whose omission turns the loop back into a one-time setup. Why does omitting that step cause regressions to recur?

Practice ◆◆◆◇

Take an agent you operate or have read about and walk it once around the operate-and-improve loop for a single real (or plausible) failure: where would the failure first be seen, what eval case would you derive from it, what fix would you make, and how would each of cost, oversight, and security bound that fix? Then place that agent on the three unsolved trade-offs — where does it sit on autonomy↔control, cost↔performance, utility↔security, and what would move it? The goal is to feel operations as a continuous loop with named tensions, not a setup checklist.

Exercise solutions

Solution ↑ Exercise

The loop: (1) a production run produces a failure; (2) the failure is seen in the session log (observability); (3) it is turned into a repeatable eval case derived from the real failure; (4) it is fixed, with the fix bounded by cost (don’t balloon input context), oversight (gate it if it is irreversible), and security (don’t open a trifecta leg); and then you measure again, which returns to step 1 with a stronger suite. The step whose omission breaks the loop is turning the failure into an eval case (step 3): if you fix the failure without adding a test that measures whether it stays fixed, nothing closes the loop back to measurement. Regressions then recur because the only record that the bug was ever fixed is the patch itself — there is no standing measurement that fails if a later change reintroduces it, so the same failure can return unnoticed until it shows up in production again. The eval case is what converts a one-time patch into a permanent guarantee.

Solution ↑ Exercise

A worked example. Take a customer-support agent that drafts replies and can issue refunds. Failure seen: a customer reports the agent promised a refund it should have escalated; you find it by reading the session log of that conversation (ch24) — the transcript shows the tool call and the reasoning. Eval case derived: a trajectory eval (ch23) built from that exact transcript — given this customer message and account state, the agent must escalate, not auto-refund — plus, if the root cause was prompt wording, a prompt-level case (ch22) on the instruction that misfired. Fix: tighten the policy in the system prompt and require a tool precondition. Bounded by: cost (ch25) — the tighter prompt adds context tokens on every call, a small permanent cost to weigh; oversight (ch26) — a refund above a threshold is irreversible, so it now hits an approval gate rather than firing autonomously; security (ch27) — confirm the refund tool cannot be triggered by injected content in a customer message (an untrusted-content leg), or the fix has opened a hole. Re-measure: run the new eval cases; “fixed” is now a green test, not a hope. Trade-offs: on autonomy↔control the refund gate moves it toward control (slower, safer); on cost↔performance the richer prompt is a deliberate small spend for accuracy; on utility↔security gating refunds costs some self-service utility to close an exfiltration-adjacent risk. What would move it: higher refund volume might justify a calibrated auto-approve threshold (back toward autonomy) once the eval suite is trusted enough to catch regressions. The point of the exercise is that the fix is never just a prompt edit — it is a loop pass with three budgets and three tensions, all named.