⌕
The Claude Architect's Reference Claude Certified Architect — Foundations (D1–D5)
All chapters

Part 1

  1. Ch1 Agentic Loops: stop_reason and Tool-Result Handling
  2. Ch2 Coordinator–Subagent Patterns: Hub-and-Spoke and Isolated Context
  3. Ch3 Subagent Invocation: AgentDefinition, the Agent Tool, and allowedTools
  4. Ch4 Multi-Step Workflows: Programmatic vs Prompt-Based Handoff
  5. Ch5 Agent SDK Hooks: Intercepting, Gating, and Normalizing the Loop
  6. Ch6 Task Decomposition: Sequential Pipelines vs Adaptive
  7. Ch7 Session State: resume, fork, and Scratchpads

Part 2

  1. Ch1 Effective Tool Interfaces: Descriptions, Boundaries, and Naming
  2. Ch2 Structured Error Responses: isError, Retryability, and the Protocol-Error Split
  3. Ch3 Tool Distribution and tool_choice: auto, any, Forced, and none
  4. Ch4 MCP Server Configuration: .mcp.json, Scopes, and Env-Var Expansion
  5. Ch5 Built-in Tools: The Roster, Execution Order, and Permission Gating

Part 3

  1. Ch1 CLAUDE.md Hierarchy & @import: Four Scopes That Concatenate
  2. Ch2 Slash Commands & Skills: Stored Prompts, Lazy-Loaded Capabilities
  3. Ch3 Path-Scoped Rules: Modular, Glob-Triggered Instructions
  4. Ch4 Plan Mode vs Direct Execution: Research Before You Edit
  5. Ch5 Iterative Refinement: The Loop, the Interview, and Test-Driven Prompting
  6. Ch6 CI/CD Integration: Headless Runs, Output Formats, and GitHub Actions

Part 4

  1. Ch1 Explicit Criteria over Vague Instructions
  2. Ch2 Few-Shot Prompting for Ambiguous Cases
  3. Ch3 Structured Output via Tool Use and JSON Schema
  4. Ch4 Validation, Retry, and Feedback Loops
  5. Ch5 Batch Processing: The Message Batches API
  6. Ch6 Multi-Pass Review: Independent Reviewers and Attention Dilution

Part 5

  1. Ch1 Long-Conversation Context: Accumulation, Degradation, Compaction
  2. Ch2 Escalation and Ambiguity Resolution
  3. Ch3 Error Propagation Across Multi-Agent Systems
  4. Ch4 Large-Codebase Context: Compaction, Scratchpads, Delegation
  5. Ch5 Human Review and Confidence Calibration
  6. Ch6 Information Provenance: Citations and Temporal Validity
Part 1 Chapter 1 Last verified 2026-06-02 Fresh

Agentic Loops: stop_reason and Tool-Result Handling

The first chapter of Domain 1 and the substrate the rest of the book assumes — the agent loop as a control structure whose branch condition is stop_reason. Teaches the tool-use round-trip from first principles, the turn model, error and parallel tool-result handling, termination budgets, and every stop_reason value an architect must recognize.

Volatility: architectural-pattern
Tools compared: claude-code
Before you start: You can call the Claude Messages API (or the Agent SDK) and have seen a tool definition. No orchestration experience assumed — this chapter builds the loop from the ground up.
You will learn
  • Define what an agent is, and why “a model calling tools in a loop” is the whole architecture
  • Trace one full pass of the loop: the tool_use → execute → tool_result round-trip, and the condition that ends it
  • Distinguish a turn from an API request, and predict how max_turns counts
  • Handle the two tool-result cases the exam probes: a failed call (is_error) and parallel calls (return all results together)
  • Identify every stop_reason value and say what it means for whether the loop continues
  • Design a termination policy (max_turns / max_budget_usd + a subtype check) that fails safe

Domain 1 is the largest slice of the exam (27%), and this chapter is its floor: every later topic — subagents, workflows, hooks, session state — is a variation on the loop defined here. The exam tests whether you can read the loop’s control flow, not whether you can recite an SDK signature. We build the loop from the definition of an agent, trace it end to end with a worked example, name the stop_reason values that branch it, handle the error and parallel result cases, and fix the termination contract an architect owns.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

  1. In one sentence, what is an agent, and what is the single branch condition that drives its loop?
  2. A session makes three tool calls and then returns a text answer. How many turns is that, and what is the smallest max_turns that lets it finish?
  3. A tool throws an exception inside your handler. What do you send back to the model, and what field do you set?
  4. The model returns two tool_use blocks in one response. May you answer one now and the other next turn?
  5. Your result object has subtype: "error_max_turns". Is .result populated? What should your code check before reading it?
Check your answers
  1. An agent is an ordinary model in a loop — “LLMs using tools based on environmental feedback in a loop.” The branch condition is stop_reason: tool_use continues the loop, end_turn ends it.
  2. Three turns — turns count tool-use round-trips, and the final text-only answer is not a turn. The smallest budget that finishes is max_turns = 3.
  3. A normal tool_result block carrying the error text as content, with is_error: true — never throw out of the loop; the model reads the failure and adapts.
  4. No — every tool_use block in a response needs its tool_result in the next user message, all returned together, keyed by tool_use_id.
  5. .result is empty on every error_* subtype. Check subtype == "success" before reading it.
[Tip]

A “Do I know this already?” pass is pre-testing: trying to retrieve before reading strengthens memory even when you guess wrong. Don’t skip it to save time.

What an agent is

Start with the definition, because the loop falls out of it. Anthropic’s Building Effective Agents draws the line that organizes this entire domain: “Workflows are systems where LLMs and tools are orchestrated through predefined code paths. Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks.” [Official] Building effective agents · Erik Schluntz and Barry Zhang (2024)T1-official original A workflow runs on rails you laid down in code; an agent decides its own next step at runtime.

Operationally, that autonomy has a simple shape: “Agents are typically just LLMs using tools based on environmental feedback in a loop.” [Official] Building effective agents · Erik Schluntz and Barry Zhang (2024)T1-official original The model proposes an action, your code runs it, the result of running it becomes the model’s next input, and the model decides again. That feedback cycle — act, observe, decide — is the agent. Everything else in Domain 1 is a way of shaping, bounding, or distributing it.

Key idea

An agent is not a special model; it is an ordinary model placed in a loop where its tool calls are executed and the results fed back. “Workflow vs agent” is the architect’s first design decision: a predefined code path (a workflow) when the steps are known, a model-driven loop (an agent) when they are not.

The loop is a control structure

Because the agent is the loop, the loop is the thing you reason about — and it has exactly one branch condition. At the Messages API level, Claude returns stop_reason: "tool_use" together with one or more tool_use blocks; your application executes each call and returns tool_result blocks on the next user turn. [Official] Tool use with Claude · AnthropicT1-official original The loop repeats that exchange until Claude responds without a tool call.

Concept ·
  1. Send tools + the user message.
  2. Claude returns stop_reason: "tool_use" + a tool_use block (id, name, input).
  3. Your code executes the call.
  4. The next request appends the assistant turn (echoing the tool_use) and a user turn carrying tool_result blocks ({type, tool_use_id, content, is_error?}).
  5. Repeat until stop_reason: "end_turn".

The architectural point is that the model decides what happens next, but your code decides whether it gets to. Tool access is “one of the highest-leverage primitives you can give an agent,” [Official] Tool use with Claude · AnthropicT1-official original and the loop is where that leverage is either contained or left unbounded. Owning the loop — not authoring any single tool — is the orchestration discipline the rest of Domain 1 elaborates.

Tracing the loop: 'fix the failing tests' Worked example

Claude Code’s own documentation walks one task through the loop. Asked to fix failing tests, the model chains tool calls, each result informing the next decision:

  1. Run the test suite — Bash(npm test) → stop_reason: "tool_use"; the result is the failure output.
  2. Read the error output — the model now knows which tests fail and why.
  3. Search for the relevant source files — Grep for the failing symbol.
  4. Read those files — Read to understand the code.
  5. Edit the files to fix the issue — Edit.
  6. Run the tests again to verify — Bash(npm test); green this time → the model responds with text and stop_reason: "end_turn".

“Each tool use gives Claude new information that informs the next step.” [Official] How Claude Code works · AnthropicT1-official original Read top to bottom, this is five tool-use turns (steps 1–6 minus the final text) followed by one free text answer. Nothing in the path was pre-programmed — the model chose each tool from the previous result. That is an agent.

A turn is one tool-use round-trip

The word turn has a precise meaning, and the exam leans on it. A turn is “one round trip inside the loop: Claude produces output that includes tool calls, the SDK executes those tools, and the results feed back to Claude automatically … Turns continue until Claude produces output with no tool calls.” [Official] How the agent loop works · AnthropicT1-official original The Agent SDK embeds this loop for you — it ships “the same tools, agent loop, and context management that power Claude Code” [Official] Agent SDK overview · AnthropicT1-official original — so at the SDK level you observe messages, not the raw branch.

The consequence that trips candidates: a text-only final response is not a turn. A four-message session is three tool-use turns plus one final text answer, so max_turns=2 would stop before that final step. [Official] How the agent loop works · AnthropicT1-official original Size a turn budget to the tool calls a task needs, not to the messages you expect to see.

[Tip]

Exam reflex: count turns by tool-use round-trips, not by API requests or messages. The final text answer never counts against max_turns.

Assuming max_turns counts the final answer

Setting max_turns to the number of messages you expect off-by-ones every budgeted agent. The limit counts tool-use round-trips; size it to the tool calls a task needs, then add headroom — the final text response is free.

Handling tool results: errors and parallel calls

Step 4 of the round-trip hides two cases the exam specifically tests, because both are where a hand-written loop goes wrong.

A failed tool does not raise — it reports. When your handler hits an error (the file is missing, the command exits non-zero, the API times out), you do not throw out of the loop. You return a normal tool_result block with is_error: true and the error text as its content. [Official] Tool use with Claude · AnthropicT1-official original The model sees the failure and adapts — retries with a corrected argument, tries a different tool, or explains the blocker — exactly as it would read any other result. Raising an exception instead severs the loop and throws away the model’s ability to recover.

Throwing on a tool failure instead of returning is_error

A handler that lets an exception propagate turns a recoverable tool failure into a dead agent. The loop’s whole value is that the model reads results and adapts; an error is just a result. Return { "type": "tool_result", "tool_use_id": …, "content": "<error text>", "is_error": true } and let the model decide what to do next.

Parallel calls must be answered together. When a single response contains more than one tool_use block, the API requires every corresponding tool_result in the next user message — you cannot answer one tool now and defer the other to a later turn. Execute them (concurrently when they are read-only and independent; serially when one mutates state another reads), collect all results, and send them as one batch. The mechanics of when to parallelize live in D2.3 — for the loop, the rule to hold is: all of a turn’s results return together, keyed by tool_use_id.

Key idea

The tool_use_id is the join key of the whole protocol. A failed call is a tool_result with is_error: true; parallel calls are several tool_result blocks in one user message, each matched to its tool_use by tool_use_id. Drop or defer one and the next request is malformed.

stop_reason is the branch

Because stop_reason is the loop’s branch condition, recognizing each value — and its loop consequence — is core exam material. On a ResultMessage, stop_reason carries the value from the model’s last response. [Official] How the agent loop works · AnthropicT1-official original

stop_reasonWhat it meansLoop consequence
tool_useThe response contains tool callsContinue — execute tools, return tool_result, request again
end_turnThe model finished naturally, no tool callsStop — deliver the final result
max_tokensOutput hit the token budget mid-responseStop, but the answer is truncated — you may need to continue
refusalThe model declined to generateStop — handle as a non-answer, not a result
[Note]

The full Messages API set includes further values (e.g. stop_sequence, pause_turn); confirm the current list against the API reference before relying on one in production. Knowing where the authoritative list lives is itself an exam-useful habit.

Not branching on tool_use

If your client treats every response as a final answer, it silently drops the model’s tool calls and the agent never acts. The first thing a hand-written loop must do is check stop_reason == "tool_use" and route to tool execution before reading any text.

Termination is the architect’s safety contract

A model-driven loop that can run forever is a production incident waiting to happen, so the architect supplies the stop conditions the model cannot. The SDK exposes two budgets: max_turns (a turn count) and max_budget_usd (a client-side cost estimate). Hitting either ends the loop and sets ResultMessage.subtype to error_max_turns or error_max_budget_usd. [Official] How the agent loop works · AnthropicT1-official original

The subtype is the termination indicator, and it gates whether .result is even populated:

subtypeMeaning.result?
successNormal finishyes
error_max_turnsHit max_turnsno
error_max_budget_usdHit max_budget_usdno
error_during_executionAPI / cancellation errorno
error_max_structured_output_retriesJSON-Schema validation failed past the retry limitno
Reading .result before checking subtype

On every error subtype, .result is empty. Code that prints message.result without first checking subtype == "success" reports a blank “answer” on a budget exhaustion or execution error — a silent failure that hides the real outcome. Branch on subtype first, always.

Key idea

The model owns what to try next; the architect owns whether the loop may continue and how it ends. A safe agent pairs a max_turns / max_budget_usd budget with a subtype-first result check — so exhaustion is reported, never mistaken for an answer.

Where Claude Code’s loop sits

Claude Code is one concrete harness around this loop. Its documentation states the architecture plainly: “The agentic loop is powered by two components: models that reason and tools that act.” [Official] How Claude Code works · AnthropicT1-official original The same source makes the dependency explicit — “Tools are what make Claude Code agentic. Without tools, Claude can only respond with text” [Official] How Claude Code works · AnthropicT1-official original — and the SDK’s account agrees that “Tools are the primary building blocks of execution for your agent.” [Official] Building agents with the Claude Agent SDK · Thariq Shihipar et al. (2025)T1-official original

For the exam, treat the loop in this chapter as the inner cycle. A long-running harness wraps it in an outer cycle that carries state across many context windows — but the branch condition, the turn model, and the termination contract are identical at every scale. Master the inner loop and the rest of Domain 1 is composition.

Practice

Exercise

A session runs: (1) Claude calls Read; (2) Claude calls Grep; (3) Claude calls Edit; (4) Claude returns a text summary with no tool call. How many turns is this? What is the stop_reason on the final response, and what is the smallest max_turns that still lets the session finish?

Practice ◆◆◇◇

An agent’s final ResultMessage carries one of these stop_reason values. Which one means the output can be delivered to the user as-is, with no further loop step and no special handling?

  • A. tool_use
  • B. end_turn
  • C. max_tokens
  • D. refusal
Practice ◆◆◆◇

You are designing a coding agent that occasionally loops re-reading the same files. Write a two-sentence termination policy: which budget(s) you set and why, and how your result-handling code distinguishes a real answer from a budget exhaustion. Name the exact field you check first.

Exercise solutions

Solution ↑ Exercise

Three turns. Turns count tool-use round-trips: Read, Grep, Edit are turns 1–3; the final text summary is not a turn. The final response’s stop_reason is end_turn (no tool call). The smallest budget that still finishes is max_turns = 3 — it permits the three tool-use turns, and the free final text answer follows. max_turns = 2 would stop before the Edit.

Solution ↑ Exercise

B — end_turn. It means the model finished naturally with no tool calls, so the text response is the deliverable. A (tool_use) is the continue branch — there is more loop to run, not a final answer. C (max_tokens) stops the response but the answer is truncated mid-output, so it usually needs a continuation before it is usable. D (refusal) is a non-answer to be handled as a declined request, not delivered as a result. The discriminating idea: only end_turn is both terminal and complete.

Solution ↑ Exercise

A defensible policy: “Set max_turns to the tool calls a normal fix needs plus headroom (say 15), and max_budget_usd as a hard cost ceiling, because the re-reading loop fails by count and by spend and either bound should stop it. My handler checks ResultMessage.subtype first: on success I read .result; on any error_* subtype I surface the exhaustion (and the partial transcript) rather than printing an empty .result.” The exact field checked first is subtype — never .result before it.

Exam essentials

  • An agent is a model in a loop: “LLMs using tools based on environmental feedback in a loop.” Workflow = predefined code path; agent = model directs its own steps.
  • The loop has one branch: stop_reason: "tool_use" → run tools, return tool_result, request again; end_turn → stop.
  • A turn = one tool-use round-trip. The final text-only response does not count against max_turns.
  • Tool-result handling: a failed call returns a tool_result with is_error: true (the model adapts; you do not throw); parallel tool_use blocks need all their tool_results in the next user message, keyed by tool_use_id.
  • stop_reason values: tool_use (continue), end_turn (stop, usable), max_tokens (truncated — maybe continue), refusal (non-answer).
  • Termination is yours to set: max_turns + max_budget_usd; on exhaustion the ResultMessage.subtype becomes an error_* value and .result is empty.
  • Check subtype before reading .result — every error subtype leaves .result unpopulated.

Further reading

The design rationale for the inner/outer split — why owning the loop boundary is what turns a model into a harness — is developed at length in the Agentic Systems Design book, Chapter 1, Agent = Model + Harness. It is optional depth, not required for the exam; this chapter is self-contained.

[Note]

Cross-book link is provisional — it points at the chapter source until the Agentic Systems Design book is deployed, then repoints to its published URL.

Part 1 Chapter 2 Last verified 2026-06-02 Fresh

Coordinator–Subagent Patterns: Hub-and-Spoke and Isolated Context

The coordinator–subagent (orchestrator-worker) pattern — a lead agent that decomposes a task and spawns isolated-context subagents. Teaches why a second agent ever helps, when the pattern earns its 3–10x token cost, the full single-vs-multi trade-off (including reliability and maintainability), why decomposition must split by context and not by role, and the one variant that works across domains.

Volatility: architectural-pattern
Tools compared: claude-code
Before you start: Chapter D1.1 (the agent loop). You have run at least one agent and understand that each agent reasons over a finite context window.
You will learn
  • Explain why a single agent’s one finite context window is the constraint that multiple agents relieve
  • Describe the coordinator–subagent (orchestrator-worker) pattern and the property that defines it — isolated context
  • Decide when the pattern earns its 3–10× token cost, using the three win-conditions, and read the full single-vs-multi trade-off (cost, latency, reliability, maintainability, coherence)
  • Distinguish context-centric decomposition from the role-based anti-pattern that loses information at every handoff
  • Select the verification-subagent pattern — the one multi-agent shape that holds up across domains

Once the agent loop of D1.1 can run tools, the next architectural question is whether to run more than one agent. This chapter develops the canonical multi-agent shape — a coordinator that spawns isolated subagents — and, just as importantly, the discipline of not reaching for it. The exam tests judgment here: when the pattern wins, what it costs on every axis, how to cut the work, and which single variant is reliable.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

  1. What single property defines a subagent and separates it from a plain tool call?
  2. Name the three conditions under which coordinator–subagent earns its token cost.
  3. Roughly what token multiplier does multi-agent cost versus a single agent, and does it buy speed or thoroughness?
  4. Besides cost, name two axes on which a multi-agent system is worse than a single agent.
  5. A teammate proposes splitting a task into planner → implementer → tester → reviewer. What is the name of this anti-pattern and the metaphor for its failure?
Check your answers
  1. Isolated context — a subagent runs in its own context window and does not see the parent’s state, where a tool call returns directly into the calling agent’s context.
  2. Context protection (large, mostly-irrelevant intermediate data stays out of the main window), parallelization (genuinely independent paths), and specialization (tool-set overload, conflicting personas, or deep domain expertise).
  3. 3–10× more tokens than a single agent — and it buys thoroughness, not speed (coordination often makes wall-clock slower).
  4. Any two of: latency (often slower despite parallelism), reliability (multiple failure points), maintainability (multiple prompt sets to keep in sync), context coherence (fragmented at handoffs).
  5. Role-based (problem-centric) decomposition — its failure metaphor is “the telephone game”: context loss at every handoff.

Why run more than one agent

A single agent (D1.1) is one model reasoning over one finite context window. That window is the bottleneck: everything the agent reads, every tool result, every intermediate thought accumulates in it, and a model attends less reliably as it fills. So the motivation for a second agent is not “more brains” — it is more windows. When a subtask would flood the main window with data the final answer doesn’t need, or when independent paths could be explored at once, splitting the work across separate context windows relieves the constraint a single loop cannot.

That is the line Building Effective Agents draws between a workflow and an agent: an agent is a “system where LLMs dynamically direct their own processes and tool usage.” [Official] Building effective agents · Erik Schluntz and Barry Zhang (2024)T1-official original A coordinator of subagents is one such system — an agent whose tool is “spawn another agent.”

The orchestrator and its workers

The canonical multi-agent shape is hub-and-spoke: a lead agent analyzes the task, plans a strategy, and spawns subagents that explore parts of it independently. [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original The lead synthesizes their results and decides whether more work is needed — it is an orchestrator, and the subagents are workers.

Concept ·

A lead agent decomposes a task, dispatches subagents (each with its own objective, tools, and context window), and integrates their returned results. The coordinator owns planning and synthesis; the subagents own focused execution.

This is a real architecture with measured stakes, not a toy. In Anthropic’s research system, “a multi-agent system with Claude Opus 4 as the lead agent and Claude Sonnet 4 subagents outperformed single-agent Claude Opus 4 by 90.2% on our internal research eval.” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original The number is real but specific to that model pairing and that eval — read it as evidence the pattern can pay off, not as a portable benchmark.

Hub-and-spoke: comparing three regions' climate policy Worked example

A lead agent receives: “Compare the climate-disclosure rules of the EU, the US, and Japan.” Decomposed hub-and-spoke:

  1. Lead plans — three regions are independent research paths with no shared state, so it spawns one subagent per region.
  2. Subagents run in isolation — the EU subagent searches EU sources in its own context window; it never sees the US subagent’s intermediate pages, and neither pollutes the lead’s window.
  3. Artifacts, not transcripts — each subagent writes its full findings to a file and returns a compact reference, so 2,000 tokens of raw sources per region don’t stream back through the lead. [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original
  4. Lead synthesizes — it reads the three references and writes the comparison.

The win conditions stack here: context protection (raw sources stay out of the lead) and parallelization (three regions at once). Note what is not split — the final synthesis stays in one agent, because comparing the three regions needs all three in one window.

Isolated context is the whole point

The property that defines the pattern is that each subagent runs in its own context window and does not see the parent’s state. [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original A subagent is given a task and returns a result; the intermediate tokens it generates never touch the coordinator’s window. That isolation is the feature: it keeps a subtask’s noise out of the agent that has to reason over the whole problem.

Because results must cross a context boundary, large outputs use the artifacts pattern — a subagent writes its full output to the filesystem or external storage and passes a lightweight reference back, rather than streaming everything through the coordinator. [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original The coordinator stays lean; the high-fidelity output lives outside its window until needed.

[Tip]

On the exam, “isolated context window” and “does not see parent state” are the phrases that identify a subagent — distinct from a tool call, which returns directly into the calling agent’s context.

When the pattern earns its cost

The capability is bought with tokens. “In our data, agents typically use about 4× more tokens than chat interactions, and multi-agent systems use about 15× more tokens than chats,” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original and against a single agent on an equivalent task, “multi-agent implementations typically use 3-10x more tokens than single-agent approaches.” [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original So the official guidance leads with restraint: “Start with the simplest approach that works, and add complexity only when evidence supports it.” [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original Try improved prompting, context compaction, and the Tool Search Tool on one agent first.

Reach for coordinator–subagent only when one of three conditions holds: [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original

Win conditionThe signalWhat it buys
Context protectionA subtask generates large, mostly-irrelevant intermediate data (>1000 tokens) that would pollute the main agent’s reasoningA clean main-agent window
ParallelizationGenuinely independent paths to explore concurrentlyThoroughness, not speed (coordination often makes wall-clock slower)
SpecializationTool-set overload (avoid 20+ tools on one agent), conflicting personas, or deep domain expertiseFocused agents that outperform an overloaded generalist

Cost is only the first axis. The full trade-off — the one a scenario question makes you weigh — is worse for multi-agent on most rows, and the architect must be able to name them:

Concept ·
DimensionSingle agentMulti-agent
Token usageBaseline3–10× higher
LatencyFast, sequentialOften slower despite parallelism (coordination + slowest-subagent)
ReliabilityOne point of failureMultiple failure points — more places an error can enter
MaintainabilityOne prompt setMultiple prompt sets to keep in sync
Context coherenceUnifiedFragmented at handoffs

Multi-agent is not “more advanced and therefore better”; it trades cost, latency, reliability, and maintainability for capability on tasks that genuinely need separate windows. Three of those five rows are downsides — which is why the exam frames the decision as restraint first.

Reaching for multi-agent before trying the cheap fixes

Teams “invest months building elaborate multi-agent architectures only to discover that improved prompting on a single agent achieved equivalent results.” [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original Before splitting: can better prompting, compaction, or the Tool Search Tool (a large tool-count reducer) solve it on one agent? If yes, stay single-agent.

Decompose by context, not by role

When you do split, how you cut the work is the most-tested judgment in this domain — and the most common way to get it wrong. The anti-pattern is role-based / problem-centric decomposition: planner → implementer → tester → reviewer. It feels organized but “creates constant coordination overhead and context loss at handoffs — the telephone game,” spending more tokens coordinating than executing. [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original

The reliable alternative is context-centric decomposition: split only at true context boundaries.

Concept ·

Split along independent research paths, separate components with clean interfaces, or blackbox verification — work that shares little state. Keep together tightly-coupled work or anything needing shared state. If a “subagent” would need the parent’s working context to do its job, the boundary is synthetic.

Splitting a pipeline into role-agents

planner/implementer/tester/reviewer is sequential phases of one tightly-coupled task — exactly the split that loses fidelity at each handoff. It is not a context boundary. Decompose by what context is independent, not by what step comes next.

The verification subagent

One multi-agent shape “consistently succeeds across domains”: the verification subagent. [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original The main agent does the work; a separate agent blackbox-tests the result with clear success criteria and minimal context transfer. The isolation is the strength — the verifier has no stake in, and no memory of, how the work was produced.

Its failure mode is early victory: verifiers tend to declare success after one or two checks. The documented mitigation is an explicit instruction — “You MUST run the complete test suite before marking as passed.” [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original

Practice

Exercise

A support agent has 40 tools spanning CRM, order history, messaging, and analytics, and its accuracy on multi-step tickets is falling. A teammate proposes splitting it into four subagents: intake, diagnosis, resolution, follow-up. Walk the decision framework: is multi-agent warranted here, and if so, is this the right way to cut it? State the win-condition (if any) and name the anti-pattern (if any).

Practice ◆◆◆◇

A colleague says “we split by role — planner, implementer, tester, reviewer — for clean separation of concerns.” Name the failure mode this invites, the metaphor the guidance uses for it, and the one-sentence rule for where you should place a split instead.

Practice ◆◆◇◇

Your lead has approved a multi-agent design purely because “it’s more scalable.” Name two trade-off axes (besides token cost) on which the multi-agent design is actually worse than a single agent, and state each in the form “single → multi.”

Exercise solutions

Solution ↑ Exercise

Multi-agent is plausibly warranted — for specialization — but the proposed split is the role-based anti-pattern. The real signal is tool-set overload (40 tools on one agent; the guidance says avoid 20+ and prefer focused agents). So the justified cut is by tool/domain context — e.g. a CRM-and-orders agent vs a messaging-and-analytics agent — each carrying a focused tool set. The proposed intake → diagnosis → resolution → follow-up split is problem-centric: those are sequential phases of one tightly-coupled ticket, so they would lose fidelity at every handoff (the telephone game) and add coordination cost. Decompose by what context is independent (tool domains), not by what step comes next. And first confirm the Tool Search Tool alone can’t relieve the tool overload on a single agent.

Solution ↑ Exercise

The failure mode is context loss / information-fidelity degradation at handoffs, plus constant coordination overhead — the guidance calls it “the telephone game.” It happens because planner/implementer/tester/reviewer are sequential phases of one tightly-coupled task, not independent contexts. The rule: place a split only at a true context boundary — independent paths, clean-interface components, or blackbox verification — never by “what step comes next.”

Solution ↑ Exercise

Any two of: Reliability — single (one point of failure) → multi (multiple failure points). Maintainability — single (one prompt set) → multi (multiple prompt sets to keep in sync). Latency — single (fast sequential) → multi (often slower despite parallelism). Context coherence — single (unified) → multi (fragmented at handoffs). “More scalable” is not free: three of the five trade-off rows move the wrong way when you add agents.

Exam essentials

  • Why multi-agent at all: a single agent has one finite context window; extra agents buy more windows, not more intelligence.
  • Coordinator–subagent = hub-and-spoke: a lead decomposes, spawns subagents, and synthesizes; subagents run in isolated context windows and do not see parent state.
  • Isolation is the feature (context protection); large outputs use the artifacts pattern — write to storage, pass a reference back.
  • Cost is 3–10× tokens (and ~15× vs a chat). The 90.2% gain was Opus 4 lead + Sonnet 4 subagents vs single Opus 4 — not a portable number.
  • The full trade-off is mostly worse for multi-agent: higher tokens, often higher latency, multiple failure points, multiple prompt sets, fragmented coherence. Start single-agent; split only for context protection, parallelization, or specialization.
  • Decompose by context, not role. planner/implementer/tester/reviewer is the telephone-game anti-pattern; split at true context boundaries.
  • The verification subagent is the reliable variant — blackbox-test the result; mitigate early victory with “run the complete test suite before marking as passed.”

Further reading

The environment angle on isolation — how bounding what each agent loads is the same discipline that makes a large codebase legible — is developed in the Agentic Systems Design book, Chapter 7, Environments at Scale. Optional depth; this chapter stands on its own.

[Note]

Cross-book link is provisional — it points at the chapter source until the Agentic Systems Design book is deployed, then repoints to its published URL.

Part 1 Chapter 3 Last verified 2026-06-02 Fresh

Subagent Invocation: AgentDefinition, the Agent Tool, and allowedTools

The mechanics beneath the coordinator–subagent pattern — how a subagent is actually invoked. The Agent tool and its three creation paths, the AgentDefinition contract, why "Agent" must be in allowedTools, the single prompt-string channel into a fresh context, what crosses back out, and what a subagent does not inherit (including parent permissions).

Volatility: architectural-pattern
Tools compared: claude-code
Before you start: Chapter D1.2 (coordinator–subagent patterns). You understand that a subagent runs in its own isolated context window and returns a result to the coordinator.
You will learn
  • Choose among the three ways to create a subagent — programmatic, filesystem, and the built-in general-purpose agent
  • Configure an AgentDefinition — the required description/prompt contract and a scoped tools set
  • Diagnose why a defined subagent never gets invoked — the two top failure modes
  • Write the Agent-tool prompt that carries everything a fresh-context subagent needs, knowing what it does not inherit
  • Trace what crosses back — the subagent’s final message as the Agent-tool result, and why it may arrive summarized

Chapter D1.2 established when to reach for a subagent and how to cut the work. This chapter drops one level — to the mechanics of actually defining and invoking one, and reading what crosses each way across the context boundary. The exam tests whether you can read an AgentDefinition, predict whether it will be invoked at all, and say what crosses into the subagent’s fresh context and what crosses back — and what silently does not.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

  1. Through which single tool is every subagent invoked, and what was that tool previously named?
  2. Which two AgentDefinition fields are required, and what does each control?
  3. A perfectly-defined subagent never runs. Name the two most likely causes.
  4. A subagent is told “fix the bug we discussed.” Why does it fail, and what is the only channel that could have carried the context in?
  5. When the subagent finishes, what does the parent receive — and what might happen to it on the way back?
Check your answers
  1. The Agent tool — every subagent is invoked through it; it was renamed from Task in Claude Code v2.1.63, so tool-name filters must match both values.
  2. description (natural-language when to use this agent — drives automatic matching) and prompt (the agent’s system prompt: its role and behavior); everything else is optional.
  3. "Agent" is missing from allowedTools (the run gate — the call is never approved) or the description is too vague for Claude to match the task to it (the match gate).
  4. The subagent starts in a fresh context window with no parent conversation, so “the bug we discussed” never crossed; the only inbound channel is the Agent tool’s prompt string.
  5. The parent receives the subagent’s final message as the Agent-tool result — and the parent may summarize it rather than carry it through verbatim.

The Agent tool is the invocation surface

A subagent is invoked through exactly one tool — the Agent tool — and there are three ways to give that tool something to invoke. [Official] Subagents in the SDK · AnthropicT1-official original Everything in this chapter hangs off that one tool: how you define what it runs, how you allow it to run, and what crosses the boundary in each direction.

Concept ·
  • Programmatic — pass an AgentDefinition in the agents option of your query. Recommended for SDK apps.
  • Filesystem — drop a Markdown file in .claude/agents/*.md. Loaded at startup (restart to pick up new files).
  • Built-in — Claude can invoke a default general-purpose subagent through the Agent tool with no definition at all.

One naming wrinkle is load-bearing on the exam and in tool-name filters. The tool was renamed from Task to Agent in Claude Code v2.1.63; current SDK releases emit Agent in tool_use blocks but still report Task in the system:init tools list and in result.permission_denials[].tool_name. [Official] Subagents in the SDK · AnthropicT1-official original Code that filters on the tool name must check both values.

[Tip]

Exam reflex: “subagent invocation” = the Agent tool. If a filter only matches "Task" (or only "Agent"), it misses half the surface — match both.

AgentDefinition is the subagent’s contract

When you create a subagent programmatically, its AgentDefinition is a contract with two required halves: a description that says when to use it and a prompt that says how it behaves. [Official] Subagents in the SDK · AnthropicT1-official original Everything else is optional refinement.

FieldRequiredPurpose
descriptionyesNatural-language when to use this agent — drives automatic matching
promptyesThe agent’s system prompt: its role and behavior
toolsnoAllowed tool names; omit to inherit all of the parent’s tools
modelnoModel override (sonnet / opus / haiku / inherit / full ID)
maxTurnsnoCap the subagent’s agentic turns (its own budget)

The description does double duty — it is also how Claude decides to invoke the agent automatically (below), so write it specific and keyword-rich rather than generic. [Official] Subagents in the SDK · AnthropicT1-official original

[Note]

Need more review? The full field set (disallowedTools, skills, memory, mcpServers, effort, permissionMode, background, …) lives in the Agent SDK reference — recognize the contract shape here, look up the long tail there.

Enabling invocation: Agent in allowedTools

A defined subagent will not run unless the Agent tool itself is approved. Always include "Agent" in allowedTools to auto-approve subagent invocations; without it, the call falls through to your canUseTool callback or — in dontAsk mode — is denied outright. [Official] Subagents in the SDK · AnthropicT1-official original

Key idea

A subagent passes through two independent gates. The description decides whether Claude matches the task to your agent; allowedTools decides whether the resulting Agent call is allowed to run. A perfect description with Agent un-allowed never executes; Agent allowed with a vague description never gets matched. You need both.

The subagent that never delegates

When a defined subagent is ignored and the main agent just does the work inline, the cause is almost always one of two things: "Agent" is missing from allowedTools (so the call is never approved), or the description is too vague for the model to match the task to it. Check the gate and the description before touching the agent’s prompt.

The prompt string is the only channel in

A subagent starts in a fresh context window, and the only thing that crosses from parent to child is the Agent tool’s prompt string. [Official] Subagents in the SDK · AnthropicT1-official original

“A subagent’s context window starts fresh (no parent conversation) but isn’t empty. The only channel from parent to subagent is the Agent tool’s prompt string, so include any file paths, error messages, or decisions the subagent needs directly in that prompt.” [Official] Subagents in the SDK · AnthropicT1-official original

What that means concretely — the subagent receives its own system prompt (AgentDefinition.prompt), the Agent-tool prompt, the project CLAUDE.md (loaded via settingSources), and its tool definitions. It does not receive the parent’s conversation or tool results, the parent’s system prompt, or any preloaded skill content. [Official] Subagents in the SDK · AnthropicT1-official original Permissions are part of what does not cross: a subagent does not inherit the parent’s permissions — each runs its own evaluation chain — so a tool the parent could use is not automatically usable by the child. [Official] Configure permissions · AnthropicT1-official original

Assuming the subagent can see the parent's context

A subagent told to “fix the bug we discussed” has no idea what was discussed — the parent conversation never crossed the boundary. File paths, error text, and prior decisions must be written into the Agent-tool prompt or they do not exist for the child.

What crosses back: the return channel

The boundary is asymmetric, and the exam probes the outbound side too. When the subagent finishes, the parent receives the subagent’s final message as the Agent-tool result — but the parent may summarize it rather than carry it through verbatim. If a downstream step depends on the subagent’s exact output (a precise list, a diff, a structured payload), instruct the main agent to preserve the subagent’s result verbatim. [Official] Subagents in the SDK · AnthropicT1-official original Two more facts ride the return path: every message generated inside the subagent carries a parent_tool_use_id linking it to the invoking Agent call, [Official] How the agent loop works · AnthropicT1-official original and the subagent’s transcript persists independently of the main conversation (it survives main-session compaction). [Official] Subagents in the SDK · AnthropicT1-official original

Trusting that the subagent's exact output reaches the parent

Only the subagent’s final message returns, and the parent may summarize it. A coordinator that needs the subagent’s verbatim output — e.g. a reviewer’s exact line-by-line findings — can silently receive a lossy paraphrase. When fidelity matters, tell the main agent to preserve the result verbatim.

Defining and invoking a read-only reviewer Worked example

A coordinator needs a focused doc reviewer. Programmatic definition, both gates satisfied, both channels handled:

const result = await query({
  prompt: "Use the doc-reviewer agent to check docs/api.md for broken links and stale version numbers.",
  options: {
    allowedTools: ["Read", "Grep", "Glob", "Agent"],   // <-- "Agent" is the RUN gate
    agents: {
      "doc-reviewer": {
        description: "Reviews Markdown docs for broken links, stale refs, and version drift; read-only.",  // MATCH gate: specific + keyword-rich
        prompt: "You are a meticulous documentation reviewer. Report issues as a bulleted list with file:line. Do not edit.",
        tools: ["Read", "Grep", "Glob"],                 // scoped: cannot edit or run commands
      },
    },
  },
});

Trace it: (1) match — the keyword-rich description lets Claude map “check docs/api.md” to the agent; (2) run — "Agent" in allowedTools approves the call; (3) in — the file path docs/api.md travels in the prompt string (the subagent’s fresh window has nothing else); (4) back — the subagent’s bulleted findings return as the Agent-tool result. Because the coordinator will act on exact file:line references, the parent prompt should add: “preserve the reviewer’s findings verbatim.” Drop the "Agent" entry and nothing runs; vague the description to “Reviews things” and nothing matches.

Invocation paths and the one-level limit

Once Agent is allowed, a subagent is invoked one of two ways. [Official] Subagents in the SDK · AnthropicT1-official original

  • Automatic — Claude matches the subagent’s description to the task. This is why the description must be specific and keyword-rich.
  • Explicit — name the agent in the prompt (“Use the code-reviewer agent to check the auth module”), bypassing automatic matching.

There is a hard structural limit: subagents cannot spawn subagents. Don’t include Agent in a subagent’s tools array — delegation is one level deep. [Official] Subagents in the SDK · AnthropicT1-official original

Putting Agent in a subagent's tools

This is an attempt to build a three-level hierarchy — coordinator → subagent → sub-subagent. The depth-1 limit means the nested Agent call won’t produce a grandchild. If a task seems to need that depth, orchestrate the extra layer from the parent, not from inside a child.

Practice

Exercise

You define a read-only doc-reviewer subagent via the agents option — description: "Reviews things", tools: ["Read", "Grep", "Glob"] — but it never triggers; the main agent just reviews inline. The parent’s allowedTools is ["Read", "Edit", "Bash"]. Diagnose the two reasons it won’t delegate, and give the fix for each. Is the tools array itself part of the problem?

Practice ◆◆◇◇

Name the three ways to create a subagent, say which one the SDK docs recommend for SDK apps, and state in one clause why the built-in option needs no AgentDefinition.

Practice ◆◆◆◇

A coordinator spawns a reviewer subagent and then tries to apply its fixes, but keeps acting on the wrong lines. The reviewer’s own transcript shows correct file:line findings. Explain what is most likely happening on the return channel, and the one-line instruction that fixes it.

Exercise solutions

Solution ↑ Exercise

Two faults, neither in the tools array. (1) "Agent" is not in the parent’s allowedTools, so the Agent-tool call is never auto-approved — it falls through to canUseTool (or is denied in dontAsk). Fix: add "Agent" to allowedTools. (2) description: "Reviews things" is too vague for automatic matching — descriptions must be specific and keyword-rich. Fix: rewrite it (e.g. “Review Markdown/docs for accuracy, broken links, and stale references; read-only”), or invoke the agent explicitly by name to bypass matching. The tools: ["Read", "Grep", "Glob"] set is exactly right for read-only review — leave it. The two faults map to the two gates: allowedTools is the run gate, the description is the match gate.

Solution ↑ Exercise

Programmatic (an AgentDefinition in the agents option) — the recommended path for SDK apps; filesystem (.claude/agents/*.md, loaded at startup); and the built-in general-purpose agent. The built-in needs no AgentDefinition because it ships with a default description and prompt — Claude can invoke it through the Agent tool with nothing defined, which is why it is the zero-config fallback.

Solution ↑ Exercise

The return channel is lossy by default: only the subagent’s final message returns to the parent, and the parent may summarize it. So the coordinator is acting on a paraphrase of the reviewer’s findings, not the exact file:line list the subagent produced. The fix is to instruct the main agent to preserve the subagent’s result verbatim (and have the reviewer return a structured, easily-quoted format). The subagent’s own transcript being correct is the tell that the loss happened on the way back, not inside the subagent.

Exam essentials

  • One tool, three creation paths: subagents are invoked via the Agent tool (renamed from Task — filters must match both), created programmatically (agents option, recommended), via filesystem (.claude/agents/*.md), or as the built-in general-purpose agent.
  • AgentDefinition = description + prompt (both required); tools/model/maxTurns optional. The description drives automatic matching, so make it keyword-rich.
  • Agent must be in allowedTools or the subagent never runs. Two gates: description matches, allowedTools runs.
  • The prompt string is the only inbound channel. A subagent gets a fresh context — no parent conversation, no parent system prompt, no preloaded skills — but does get project CLAUDE.md + its tools. Permissions do not inherit; each subagent has its own evaluation chain.
  • The return channel is asymmetric and lossy: the parent receives the subagent’s final message but may summarize it; instruct the main agent to preserve it verbatim when fidelity matters. parent_tool_use_id attributes a message to its subagent.
  • Delegation is one level deep: subagents cannot spawn subagents (no Agent in a child’s tools).
Part 1 Chapter 4 Last verified 2026-06-02 Fresh

Multi-Step Workflows: Programmatic vs Prompt-Based Handoff

A multi-step task's control flow is enforced either in your code (programmatic) or in the model (prompt-based). When to choose each, why every step boundary is a handoff that leaks fidelity, the Writer/Reviewer pattern as the handoff that works, how a written artifact makes a handoff survivable, and how a programmatic validation gate rejects-and-retries a bad step before it propagates.

Volatility: architectural-pattern
Tools compared: claude-code
Before you start: Chapters D1.1 (the loop) and D1.3 (subagent invocation). You can invoke a subagent and know that only the prompt string crosses into it.
You will learn
  • Compare the two places a multi-step workflow’s control flow can live — in your code (programmatic) or in the model (prompt-based)
  • Analyze why every step boundary is a handoff, and where fidelity leaks across one
  • Design a handoff contract — and a written artifact (a spec or a test file) that carries it across the boundary
  • Build a programmatic validation gate that rejects a malformed step and retries it before the next step consumes it
  • Evaluate when to enforce a workflow programmatically versus trusting the model to follow it

Most real agent work is several steps, not one. The question this chapter answers is not what the steps are but who enforces the order — your application code, or the model itself — and how you keep a bad step from poisoning the next one. That choice is an Evaluate-level judgment: the exam gives you a workflow and asks where the control flow belongs, how the handoff is specified, and where the gate goes.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

  1. Name the two places a multi-step workflow’s control flow can live, and one property each buys.
  2. Why is every step boundary a place fidelity can leak?
  3. In the Writer/Reviewer pattern, why must the reviewer not inherit the writer’s context?
  4. What three things does a handoff contract specify, and what file can carry it across a boundary?
  5. A programmatic pipeline produces a malformed step-2 output. What does a validation gate do, and what are its two kinds of check?
Check your answers
  1. In your code (programmatic) — which buys determinism and auditability — or in the model (prompt-based) — which buys flexibility and adaptivity.
  2. Each boundary is a handoff — step N’s output becomes step N+1’s input — and a re-narrated handoff erodes detail at every retelling: the telephone game.
  3. Because the absence of inheritance is the feature — a fresh context isn’t biased toward code it just wrote and cannot rationalize choices it never made.
  4. An objective, an output format, and clear task boundaries — carried across the boundary as a written artifact the next step reads, such as a spec file or a test file.
  5. It rejects and retries — re-prompting the failing step with the specific errors instead of passing the bad output downstream — using a schema/structural check (right shape) and a semantic check (content actually right).

Two places to enforce a workflow

A multi-step workflow’s control flow lives in exactly one of two places. Either your code drives the sequence — run a step, take its output, decide the next call — or the model drives it, having been told the steps in a prompt. The Agent SDK frames the split directly: “With the Client SDK, you implement a tool loop. With the Agent SDK, Claude handles it.” [Official] Agent SDK overview · AnthropicT1-official original

Concept ·
  • Programmatic — your application owns the steps. You sequence the calls, pass each output to the next stage, and can insert validation gates between them. Deterministic and auditable.
  • Prompt-based — the model owns the steps. One agent is told the whole procedure (or an orchestrator self-directs), and you trust it to follow. Flexible and adaptive.

These are not rival products — Anthropic notes the same workflow “translate[s] directly” between the CLI and the SDK. [Official] Agent SDK overview · AnthropicT1-official original The architect’s decision is which layer holds the control flow, and it turns on how much the workflow needs determinism versus flexibility.

Every step boundary is a handoff

Whichever layer enforces the steps, each transition between them is a handoff — the output of step N becomes the input of step N+1 — and a handoff is where information is lost. Chapter D1.2 named the worst case: dividing a tightly-coupled task by role (planner → implementer → tester → reviewer) “creates constant coordination overhead and context loss at handoffs — the telephone game,” spending more tokens coordinating than executing. [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original Prompt-based handoffs across a long sequential chain are the most fidelity-fragile arrangement: each step re-narrates the last, and detail erodes at every retelling.

A sequential chain of role-agents

A pipeline of one-job agents passing work down the line looks like clean separation of concerns, but every arrow is a lossy handoff. If the steps share state, keep them in one context; reserve cross-agent handoffs for boundaries where little context needs to travel.

The Writer/Reviewer handoff that works

Not every handoff leaks — the canonical multi-step quality workflow depends on one. In the Writer/Reviewer pattern, one session implements and a second reviews: “A fresh context improves code review since Claude won’t be biased toward code it just wrote.” [Official] Best practices for Claude Code · AnthropicT1-official original Session A writes the rate limiter; Session B reviews the file for edge cases, race conditions, and consistency; Session A then addresses the feedback. The same shape works for tests — “have one Claude write tests, then another write code to pass them.” [Official] Best practices for Claude Code · AnthropicT1-official original

Key idea

Here the absence of inheritance is the feature. The reviewer’s value comes from not carrying the writer’s context — it cannot rationalize choices it never made. A handoff that deliberately drops context (fresh-context review) and one that accidentally drops it (the telephone game) differ only in whether the loss was the point.

Letting the author review its own work

A single agent asked to review the code it just wrote is the biased case the Writer/Reviewer pattern exists to avoid. Hand the review to a fresh context — a second session, or a verification subagent (D1.3) in its own window — not back to the writer.

The handoff contract and its artifact

When work does cross a boundary, what crosses must be specified, not assumed. Anthropic’s research system makes each handoff an explicit contract: “Each subagent needs an objective, an output format, guidance on the tools and sources to use, and clear task boundaries.” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original That is the same discipline as D1.3’s rule that everything the subagent needs goes in the prompt — applied to every step of a multi-step flow.

The most robust way to carry that contract across a boundary is as a written artifact — a file the next step reads, not prose it re-narrates. Two concrete forms appear in the best-practices guidance:

  • A spec file. After an interview/planning phase, “start a fresh session to execute it … and you have a written spec to reference.” [Official] Best practices for Claude Code · AnthropicT1-official original The spec, not the conversation, is what crosses to the implementation step.
  • A test file. In the test/code split, the tests are the contract: one Claude writes them, another writes code to pass them. The implementer’s target is the file, not a description of it.
Key idea

A handoff survives in proportion to how little it depends on memory. A re-narrated handoff degrades at every retelling; a handoff carried by a written artifact — a spec or a test suite the next step reads from disk — does not. Turn the contract into a file whenever a step boundary matters.

[Tip]

A good handoff contract — objective, output format, boundaries — makes a handoff survivable regardless of who enforces the workflow. Vague handoffs are where both programmatic and prompt-based pipelines fail.

The validation gate: reject before propagating

A precise contract is also what lets a programmatic pipeline put a gate between steps — a check the step-N output must pass before step N+1 is allowed to consume it. The gate does two kinds of check, and the distinction is the one Domain 4 builds on (D4.4):

Concept ·
  • Schema / structural — is the output the right shape? Required fields present, types correct, enums valid, parses cleanly. Cheap and mechanical.
  • Semantic — is the output right? The shape can be perfect and the content wrong (a draft that omits a required claim, a total that doesn’t add up). Needs domain logic, not a schema.

On failure, the gate does not pass the bad output downstream — it rejects and retries: re-prompt the failing step with the specific errors, and only advance when the output passes. That is the difference between a programmatic pipeline and a prompt-based one: the gate is enforced in your code, where a malformed step cannot quietly become the next step’s input.

A gated content pipeline Worked example

A programmatic flow: research → draft → [gate] → publish.

  1. Research runs; its notes are written to research.md (the artifact crossing to draft).
  2. Draft produces an article keyed to a contract: { title, sections[≥3], every claim cites a research.md line }.
  3. The gate (your code, not the model) runs two checks on the draft:
    • Schema: does it have a title and ≥3 sections, and does every claim carry a citation marker? (parse check)
    • Semantic: does each cited line actually exist in research.md? (a fabricated citation passes the schema but fails here)
  4. On failure — say a claim cites a non-existent line — the gate rejects the draft and re-prompts the draft step: “Claim 4 cites research.md:88, which does not exist. Re-cite from real lines.” It loops until the draft passes, then lets publish consume it.

Without the gate, the fabricated citation flows straight into publish — a silent failure caught only by a reader. The gate is the programmatic analogue of D1.1’s rule that a failed step is a result to handle, not something to wave through.

A gate that only checks the schema

A schema check confirms the shape and nothing else — a structurally perfect output can be semantically wrong (valid JSON, fabricated data). A gate that stops at the schema lets the dangerous failures through. Pair every structural check with a semantic one wherever the step’s content can be wrong, not just its form.

Choosing where the control flow lives

The Evaluate-level call: enforce programmatically when the workflow needs determinism, an audit trail, validation gates between steps, or a fixed and repeatable sequence — the steps are known in advance and you want them to run the same way every time. Stay prompt-based when the path is flexible, the model can sensibly self-direct, and the orchestration code would cost more than it saves.

This pairs with two neighboring decisions: whether to split into multiple agents at all (D1.2) and whether the decomposition is a fixed pipeline or an adaptive one (D1.6). Enforcement locus is how the steps are driven; those chapters cover whether and into what shape. The mechanics of the validation/retry loop itself — schema vs semantic errors, bounded retries — are developed in D4.4.

[Note]

Need more review? The fixed-vs-adaptive structure of a multi-step task is D1.6 (Task Decomposition); whether to involve more than one agent is D1.2; the validation/retry loop in depth is D4.4.

Practice

Exercise

A content workflow runs four steps — research → draft → fact-check → polish — today as a single agent in one context. Quality is slipping specifically at the fact-check step: the agent waves through claims it wrote moments earlier. Evaluate two options: (a) keep it one prompt-based agent, or (b) hand the fact-check to a fresh-context reviewer with an explicit programmatic handoff. Which do you choose, where exactly do you place the split, and what would over-splitting into four role-agents cost you?

Practice ◆◆◆◇

In one sentence each, distinguish programmatic from prompt-based workflow enforcement, and name one condition under which each is the right choice.

Practice ◆◆◇◇

A teammate’s pipeline validates every inter-step output against a JSON schema and still ships wrong data. Name the kind of check the schema is doing, the kind it is missing, and what the gate should do on a failure instead of passing the output along.

Exercise solutions

Solution ↑ Exercise

Choose (b), but split at one boundary only. The fact-check is failing for the exact reason the Writer/Reviewer pattern addresses — a context biased toward the draft it just produced rationalizes its own claims. Hand the fact-check to a fresh context (a second session or a verification subagent), passing an explicit handoff contract: the draft, the claims to verify, the success criteria, the output format. That is a programmatic handoff — your code routes the draft to the reviewer and the verdict back. Keep research → draft coupled in one context: they are tightly coupled and share state, so a handoff there would only leak fidelity. And do not split all four steps into role-agents — that is the telephone-game pipeline, four lossy handoffs where you needed one. The skill is placing the single split where fresh context buys independence.

Solution ↑ Exercise

Programmatic enforcement puts the control flow in your code — you sequence the steps, pass each output to the next, and can gate between them; choose it when you need determinism, an audit trail, validation gates, or a fixed repeatable sequence. Prompt-based enforcement puts the control flow in the model — it is told the procedure and self-directs; choose it when the path is flexible, the model can sensibly adapt, and orchestration code would cost more than it saves.

Solution ↑ Exercise

The schema is doing a structural check — right shape, fields present, types valid — and is missing the semantic check: whether the content is actually correct (valid JSON can still carry fabricated or contradictory data). The gate should not pass a failing output along; it should reject and retry — re-prompt the failing step with the specific error and only advance once the output passes both the structural and semantic checks.

Exam essentials

  • Two enforcement loci: a multi-step workflow’s control flow lives in your code (programmatic — you sequence steps and gate between them; deterministic/auditable) or the model (prompt-based — told the steps, self-directs; flexible).
  • Every step boundary is a handoff, and handoffs leak. A sequential chain of role-agents is the most fidelity-fragile arrangement — the telephone game.
  • Writer/Reviewer is the handoff that works because the reviewer has fresh context — it can’t defend code it never wrote. Don’t let an author review its own work.
  • Carry the contract in a written artifact — a spec file or a test file the next step reads from disk — so the handoff doesn’t depend on re-narration.
  • A programmatic validation gate runs a structural check and a semantic check between steps, and on failure rejects and retries rather than propagating the bad output. (The loop in depth: D4.4.)
  • Choose programmatic for determinism / audit / gates / fixed sequence; prompt-based for flexible self-direction. (Whether to split = D1.2; pipeline vs adaptive = D1.6.)
Part 1 Chapter 5 Last verified 2026-06-08 Fresh

Agent SDK Hooks: Intercepting, Gating, and Normalizing the Loop

A hook is a typed callback the SDK fires at a named lifecycle event — the architect's control plane to intercept, gate, and normalize the agent loop without touching the model. The events to know, the two interception modes (PreToolUse gates, PostToolUse normalizes), the four PreToolUse decisions including what defer does, and the deny-beats-defer-beats-ask-beats-allow precedence.

Volatility: feature-surface
Tools compared: claude-code
Before you start: Chapter D1.1 (the agent loop). You know the loop runs tools between model turns. Helpful: D1.3 (subagents) for the subagent-hook events.
You will learn
  • Identify the key hook events and what triggers each
  • Apply the two interception modes — PreToolUse to gate a call, PostToolUse to normalize a result
  • Define the four PreToolUse decisions — including what defer does that the other three don’t
  • Analyze what happens when several hooks fire on one event, using the deny > defer > ask > allow precedence
  • Recognize the hook gotchas that silently disable interception (case, max_turns, ordering, subagent permissions)

The agent loop of D1.1 runs the model’s tool calls automatically. Hooks are how the architect gets between the model and those calls — to block a dangerous one, rewrite its input, or clean up its output — without editing the model’s prompt. The exam tests three things: which events exist, the two modes of intervention, and who wins when hooks disagree.

[Note]

This is a feature-surface chapter: the specific event names and the Python/TS roster change faster than the interception pattern. Treat the lists as a current snapshot and re-verify against the SDK reference before relying on a specific event.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

  1. What is a hook, and does it consume the agent’s context window?
  2. Which event gates a call before it runs, and which normalizes a result after?
  3. Name the four decisions a PreToolUse hook can return — and say what defer does.
  4. Three hooks return ask, allow, and deny on one event. What happens, and why?
  5. A subagent can’t use a tool the parent could. Why — and what hook cleanly pre-approves it?
Check your answers
  1. A hook is a callback that runs your code in response to an agent event, running in your application process — it does not consume the agent’s context window.
  2. PreToolUse gates a call before it runs; PostToolUse normalizes the result before the model sees it.
  3. allow, ask, deny, and defer — defer ends the query so the host can resume it later from the persisted session, a pause-and-hand-back, not an allow or a block.
  4. The call is blocked: matching hooks run in parallel and the most restrictive result wins, per the precedence deny > defer > ask > allow.
  5. Because subagents do not inherit the parent’s permissions — each runs its own evaluation chain; a PreToolUse hook cleanly pre-approves the tool.

Hooks intercept the loop at named events

A hook is a callback that runs your code in response to an agent event: “Hooks are callback functions that run your code in response to agent events, like a tool being called, a session starting, or execution stopping.” [Official] Intercept and control agent behavior with hooks · AnthropicT1-official original They arrive through two channels that share one lifecycle — programmatic hooks (callbacks in your query() options) and filesystem hooks (shell commands in settings.json, loaded via settingSources). [Official] Intercept and control agent behavior with hooks · AnthropicT1-official original

Concept ·

A hook runs in your application process, so it consumes host resources and a host failure can kill the agent. It does not consume the agent’s context window — it is out-of-band control logic, distinct from an in-context system reminder. The model never sees the hook; it only feels its effect.

The events you must recognize

Hooks fire at named lifecycle points. The Python SDK exposes ten; the TypeScript SDK extends the same set to twenty. [Official] Intercept and control agent behavior with hooks · AnthropicT1-official original The ones an architect must recognize:

EventFires whenTypical use
PreToolUsea tool call is requestedblock or rewrite the call before it runs
PostToolUsea tool returns a resultnormalize or replace the result before the model sees it
PostToolUseFailurea tool execution failslog or handle the error
UserPromptSubmita prompt is submittedinject extra context
Stopthe agent stopspersist state before exit
SubagentStart / SubagentStopa subagent begins / endstrack spawned parallel work
PreCompactcompaction is about to runarchive the full transcript first
PermissionRequesta permission dialog would appearcustom permission handling
Notificationan agent status messageforward to Slack / PagerDuty

The TypeScript-only additions (PostToolBatch, SessionStart, SessionEnd, Setup, and others) are why SessionStart / SessionEnd are not available as Python SDK callbacks — Python apps needing them load filesystem hooks from settings instead. [Official] Intercept and control agent behavior with hooks · AnthropicT1-official original

PreToolUse gates; PostToolUse normalizes

The two most-tested events define the two interception modes, and they sit on opposite sides of tool execution. [Official] Intercept and control agent behavior with hooks · AnthropicT1-official original

  • PreToolUse gates — it runs before the tool and returns a permissionDecision of allow, deny, ask, or defer, optionally with updatedInput to rewrite the call. Blocking a write to a .env file is a PreToolUse hook matching Write|Edit that returns deny when the target path ends in .env.
  • PostToolUse normalizes — it runs after the tool and returns either additionalContext (appended to the result) or updatedToolOutput (which replaces the output before Claude sees it). Stripping noise, redacting secrets, or reshaping a tool’s response into a clean form is PostToolUse work.
[Tip]

Exam reflex: PreToolUse = the gate (allow/deny/ask/defer, rewrite input); PostToolUse = the normalizer (updatedToolOutput replaces what the model reads). The “tool interception, normalization” task area is exactly these two.

Matchers select which calls a hook sees: they are regex strings tested against the tool name — "Write|Edit", "^mcp__" for all MCP tools, or omitted to match everything. [Official] Intercept and control agent behavior with hooks · AnthropicT1-official original Matchers do not filter by argument; filter on the file path or command inside the callback.

A PreToolUse gate: forbid writes to .env Worked example

The rule: the agent may never write a .env file. That is a gate (a decision before the tool runs), so it is a PreToolUse hook. The matcher selects the write tools; the callback inspects the path and returns a decision:

{
  "hookSpecificOutput": {
    "permissionDecision": "deny",
    "permissionDecisionReason": "Cannot modify .env files"
  }
}

Wiring: a PreToolUse hook with matcher "Write|Edit"; inside the callback, if the target path ends in .env, return the object above; otherwise return "allow". Note what the matcher does not do — it selects by tool name, not by argument, so the .env test must happen inside the callback on the call’s input. Because this is a gate, it has to be PreToolUse: at PostToolUse the write has already happened.

Precedence: deny beats defer beats ask beats allow

When several hooks (or permission rules) act on one event, the outcome is decided by a fixed precedence, not by who ran first: “When multiple hooks or permission rules apply, deny takes priority over defer, which takes priority over ask, which takes priority over allow. If any hook returns deny, the operation is blocked regardless of other hooks.” [Official] Intercept and control agent behavior with hooks · AnthropicT1-official original The four decisions are not symmetric, and defer is the one candidates miss:

Concept ·
  • allow — permit the call (may carry an updatedInput to rewrite it).
  • ask — surface a permission prompt to the human (may also carry updatedInput).
  • deny — block the call outright; one deny wins over everything else.
  • defer — the special one: it ends the query so the host can resume it later from the persisted session — a pause-and-hand-back, not an allow or a block. Note updatedInput is ignored with defer (it applies only to allow/ask). [Official] Intercept and control agent behavior with hooks · AnthropicT1-official original
Key idea

All matching hooks run in parallel, and the most restrictive result wins — a single deny blocks the call no matter what the others return. The system fails safe: to forbid something, one hook saying so is enough; to permit it, every hook must agree. defer sits just under deny because pausing the whole query is more restrictive than asking or allowing.

Relying on hook execution order

Because matching hooks run in parallel and completion order is non-deterministic, a hook that assumes another has “already run” is a race. Write each hook to act independently; never chain side effects across hooks on the same event.

Hooks and subagents

Two subagent-aware events — SubagentStart and SubagentStop — let you track spawned work, but the operational gotcha is about permissions. As D1.3 established, subagents do not inherit the parent’s permissions; each runs its own evaluation chain. [Official] Configure permissions · AnthropicT1-official original The clean way to pre-approve a subagent’s tools is a PreToolUse hook rather than re-prompting inside every child. [Official] Intercept and control agent behavior with hooks · AnthropicT1-official original

Hooks that silently never fire

Event names are case-sensitive (PreToolUse, not pretooluse), and hooks may not fire when max_turns cuts a session short. A second trap is recursion: a UserPromptSubmit hook that spawns subagents can loop if those subagents re-trigger it — guard with a session-state flag.

Practice

Exercise

You must enforce two rules on an agent: (a) it may never write to any .env file, and (b) every Bash result must have its ANSI color codes stripped before the model reads it. For each rule, name the hook event you would use and the specific return field that does the work. Why can’t (b) be done with the same event as (a)?

Practice ◆◆◇◇

Three hooks fire on one PreToolUse event and return, respectively, ask, allow, and deny. What happens to the tool call, and what is the one-word rule that decides it?

Practice ◆◆◆◇

A teammate sets a PreToolUse hook to return defer with an updatedInput that rewrites a command, expecting the rewritten command to run. Two things are wrong with that expectation. Name what defer actually does to the query, and what happens to the updatedInput.

Exercise solutions

Solution ↑ Exercise

(a) is a gate, (b) is a normalization — opposite sides of tool execution, so they need different events. (a) Use a PreToolUse hook with a matcher of "Write|Edit"; in the callback, inspect the target path and return hookSpecificOutput.permissionDecision: "deny" when it ends in .env. The decision must come before the write runs. (b) Use a PostToolUse hook with a matcher of "Bash"; return hookSpecificOutput.updatedToolOutput containing the result with ANSI codes stripped — updatedToolOutput replaces the output before Claude sees it. (b) can’t reuse PreToolUse because the output does not exist yet when the call is requested; you can only normalize a result after the tool has produced it.

Solution ↑ Exercise

The call is blocked. All matching hooks run in parallel and the most restrictive result wins, so the deny overrides the allow and the ask — the precedence is deny > defer > ask > allow. The one-word rule is restrictive (equivalently, “deny wins”): one hook saying deny is enough to block; permitting requires every hook to agree.

Solution ↑ Exercise

Both expectations are wrong. defer does not run the call — it ends the query so the host can resume it later from the persisted session (a pause-and-hand-back, not an allow). And updatedInput is ignored with defer — that field applies only to allow (or ask). To run a rewritten command, the hook must return allow with updatedInput, not defer.

Exam essentials

  • A hook is a callback at a named lifecycle event, delivered programmatically (query options) or via filesystem settings. It runs in your process and does not consume agent context.
  • Two interception modes: PreToolUse gates (returns allow/deny/ask/defer + optional updatedInput); PostToolUse normalizes (updatedToolOutput replaces the result, additionalContext appends).
  • The four decisions: allow / ask / deny, and defer — ends the query so the host can resume it later from the persisted session (and updatedInput is ignored with defer).
  • Precedence is deny > defer > ask > allow. Matching hooks run in parallel; the most restrictive wins; one deny blocks.
  • Don’t rely on hook order (non-deterministic) — write each hook independently.
  • Subagents don’t inherit permissions; pre-approve their tools with a PreToolUse hook. Watch the silent-failure traps: case-sensitive names, max_turns cut-offs, recursive subagent loops.
Part 1 Chapter 6 Last verified 2026-06-02 Fresh

Task Decomposition: Sequential Pipelines vs Adaptive

Once you've decided to decompose, the structural choice is fixed-in-advance (a sequential pipeline — predictable, cheap, auditable) versus decided-at-runtime (adaptive — the orchestrator scales effort to the task). Why open-ended work can't be hardcoded, why predictable work shouldn't be adaptive, the quantified token cost of choosing adaptive, and the failure modes at both extremes.

Volatility: architectural-pattern
Tools compared: claude-code
Before you start: Chapter D1.2 (coordinator–subagent patterns). You can decide whether a task has true context boundaries worth splitting on.
You will learn
  • Distinguish a fixed sequential pipeline from adaptive decomposition
  • Explain why open-ended, path-dependent work cannot be reduced to a hardcoded pipeline
  • Quantify what adaptive decomposition costs — the token multiplier and what it buys
  • Evaluate which structure a given task needs, trading predictability and cost against capability
  • Select the mitigation for the failure mode at each extreme — over-decomposition and rigid pipelining

Chapter D1.2 settled whether to split a task and where the context boundaries are. This chapter asks a different question about the same task: once it is cut into pieces, is the set of pieces fixed in advance, or decided at runtime? That is the pipeline-versus-adaptive choice, and getting it wrong is expensive in opposite directions — an Evaluate-level judgment the exam probes with concrete tasks.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

  1. What distinguishes a sequential pipeline from adaptive decomposition — and when is each structure set?
  2. Why can’t open-ended research be reduced to a hardcoded pipeline?
  3. Roughly what token multiplier does an adaptive multi-agent flow cost versus a single agent, and does parallelism buy speed or thoroughness?
  4. Name the over-decomposition failure mode and the heuristic that guards against it.
  5. Which structure is usually programmatically enforced, and why?
Check your answers
  1. A sequential pipeline’s steps are hardcoded and fixed at design time; in adaptive decomposition an orchestrator decides the shape at runtime, scaling subtasks to the input.
  2. Open-ended research is “inherently dynamic and path-dependent” — step N+1 depends on what step N discovered, so no design-time sequence can capture it.
  3. Roughly 3–10× more tokens than a single agent (about 15× a chat), and parallelism buys thoroughness, not speed — coordination plus the slowest subagent often make wall-clock slower.
  4. Over-decomposition — e.g. “spawning 50 subagents for simple queries” — guarded by the effort-scaling heuristic (1 agent / 2–4 subagents / 10+, sized to complexity).
  5. The sequential pipeline — your code drives the fixed sequence, because nothing about the structure needs to be decided live.

Two shapes of a decomposed task

A decomposed task takes one of two structural shapes, distinguished by when the structure is determined.

Concept ·
  • Sequential pipeline — the steps are known ahead of time and hardcoded: step 1 → step 2 → step 3, the same every run. The structure is fixed at design time.
  • Adaptive decomposition — an orchestrator decides at runtime how many subtasks to spawn and what each does, scaling the decomposition to the specific input.

The difference is not how many agents run but who decides the shape and when: the author, in advance, or the orchestrator, on the fly. Each is right for a different kind of task.

When the path can’t be hardcoded → adaptive

Some work resists a fixed pipeline by its nature. Anthropic’s research system is explicit about why: “Research work involves open-ended problems where it’s very difficult to predict the required steps in advance. You can’t hardcode a fixed path for exploring complex topics, as the process is inherently dynamic and path-dependent.” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original When step N+1 depends on what step N discovered, no design-time sequence can capture it.

Adaptive systems handle this by scaling effort to the input. The research system embeds the heuristic directly in its lead-agent prompt: “Simple fact-finding requires just 1 agent with 3-10 tool calls, direct comparisons might need 2-4 subagents with 10-15 calls each, and complex research might use more than 10 subagents with clearly divided responsibilities.” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original

Key idea

Adaptive decomposition buys the ability to handle unpredictable, path-dependent work — but it spends the orchestrator’s judgment to do so. The structure is only as good as the lead agent’s runtime decision about how much effort the task deserves.

[Tip]

The 1 / 2–4 / 10+ effort-scaling ladder (simple / comparison / complex) is the canonical example of adaptive decomposition — recognize it as “the orchestrator sizing the decomposition to the query,” not a fixed rule.

When the path is predictable → pipeline

The opposite case is just as common and far cheaper to run. When a task’s steps are known, repeatable, and the same every time, a fixed sequential pipeline is the right structure: it is deterministic, auditable, and predictable in cost, and it asks the orchestrator to make no runtime judgment at all. A pipeline is typically programmatically enforced (D1.4) — your code drives the fixed sequence — precisely because nothing about the structure needs to be decided live. Reaching for adaptivity here is wasted capability: you pay for an orchestrator’s deliberation to re-derive a structure you already knew at design time.

The cost you are choosing

“Cheaper” and “more expensive” are not hand-waving — the choice has a price tag, and the exam expects the number. An adaptive multi-agent flow “typically use[s] 3-10x more tokens than single-agent approaches for equivalent tasks,” [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original and on the absolute scale, “multi-agent systems use about 15× more tokens than chats.” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original What that spend buys is thoroughness, not speed — parallel subagents explore a larger space, but coordination plus the slowest subagent often make the wall-clock slower, not faster. [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original

Key idea

A sequential pipeline’s cost is bounded and predictable — the same fixed steps every run. An adaptive flow’s cost is variable and can spike: it pays the 3–10× multiplier and depends on the orchestrator sizing effort correctly. You choose adaptive for capability on open-ended work, knowing you are buying thoroughness at several times the token cost — not buying speed.

The same data, two shapes Worked example

Two tasks that look superficially similar decompose oppositely:

Pipeline — summarize 200 support tickets to a fixed JSON shape. The steps are identical for every ticket (read → extract fields → emit JSON), so the structure is fixed at design time. Cost is bounded: 200 × (one fixed path). You fan the same path across all 200 tickets — parallel throughput, not adaptive structure — and your code enforces it (D1.4). No orchestrator judgment, no token-multiplier surprise.

Adaptive — “find out whether any competitor shipped feature X; go as deep as the question needs.” Depth is unknowable up front and each finding changes what to look at next (path-dependent), so a lead agent sizes the decomposition at runtime via the ladder: a quick check might be 1 agent, 3–10 calls; a real comparison 2–4 subagents; a deep dive 10+. The capability to branch is the point — but it costs 3–10× the tokens of a single agent, and the guard is the effort ladder so it doesn’t spawn ten subagents to confirm an obvious “no.”

Same input domain (text about a topic); opposite structures, because one path is predictable and one is not.

The failure mode at each extreme

Both shapes fail, in opposite ways, when matched to the wrong task.

Over-decomposition is the adaptive failure: an orchestrator that misjudges effort produces absurd structures. Among the research system’s documented early failures was “spawning 50 subagents for simple queries” — capability with no judgment behind it, multiplying token cost for nothing. [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original The effort-scaling heuristic exists to mitigate exactly this.

Rigid pipelining is the inverse: forcing a fixed sequence onto path-dependent work, which then cannot adapt when a step surfaces something the design didn’t anticipate. And a pipeline cut by role rather than context is the telephone-game anti-pattern of D1.2 — sequential phases of one coupled task, losing fidelity at every handoff. [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original

Over-decomposing a simple task

Adaptive structure without effort-scaling spawns subagents the task never needed — the “50 subagents for a simple query” failure, paying the 3–10× multiplier for nothing. If you go adaptive, the orchestrator must be told how to size effort to complexity, or it will over-spend.

Forcing a pipeline onto open-ended work

A hardcoded sequence applied to inherently dynamic, path-dependent work is brittle: it cannot branch on what it finds. If you cannot predict the steps in advance, a fixed pipeline is the wrong shape no matter how clean it looks.

Choosing the structure

The Evaluate-level call comes down to predictability. Choose a sequential pipeline when the steps are knowable in advance and you value determinism, auditability, and bounded cost. Choose adaptive decomposition when the task is open-ended and path-dependent and that capability is worth the orchestrator overhead and the 3–10× variable cost. Either way, the orchestrator (or the author) must size effort to the task — adaptivity does not excuse you from judgment; it relocates it to runtime.

[Note]

Need more review? Whether to decompose at all (and where the boundaries are) is D1.2; how a multi-step flow is enforced once structured is D1.4. This chapter is the fixed-vs-adaptive shape only.

Practice

Exercise

Evaluate the right decomposition structure for each task and justify it in one or two sentences. (a) A nightly job that summarizes each of 200 incoming support tickets into the same fixed JSON shape. (b) “Find out whether any of our competitors have shipped feature X — go as deep as the question needs.” For whichever you call adaptive, name the failure mode you must guard against and roughly the token cost you are accepting.

Practice ◆◆◆◇

State, in one sentence, the rule for when you can hardcode a sequential pipeline versus when you must decompose adaptively.

Practice ◆◆◇◇

A lead chose an adaptive multi-agent decomposition for a fixed, predictable nightly job, citing “scalability.” Name the rough token multiplier they are now paying versus a single-agent pipeline, state what that spend actually buys (and does not buy), and say why it is wasted here.

Exercise solutions

Solution ↑ Exercise

(a) Sequential pipeline. (b) Adaptive. (a) The steps are known and identical for every ticket, and the output is a fixed shape — nothing to decide at runtime, so a deterministic pipeline wins on cost, auditability, and predictability. (You fan the same fixed path across 200 tickets — parallel throughput, not adaptive structure.) (b) The depth is unknowable up front and each finding changes what to look at next — “you can’t hardcode a fixed path… inherently dynamic and path-dependent” — so the orchestrator must scale the decomposition to what it discovers. On (b) guard against over-decomposition (don’t spawn ten subagents to confirm an obvious “no” — size effort via the 1 / 2–4 / 10+ ladder), and accept that you are paying roughly 3–10× the tokens of a single agent for the thoroughness.

Solution ↑ Exercise

Hardcode a sequential pipeline when the steps are knowable and identical in advance (predictable, repeatable); decompose adaptively only when the task is open-ended and path-dependent — when step N+1 depends on what step N discovers, so no design-time sequence can capture it.

Solution ↑ Exercise

They are paying roughly 3–10× the tokens of a single-agent pipeline (about 15× a chat). That spend buys thoroughness — a larger explored space — not speed (coordination plus the slowest subagent often make it slower in wall-clock). It is wasted here because a fixed, predictable nightly job has nothing to decide at runtime: the orchestrator’s judgment is re-deriving a structure already known at design time, so you pay the multiplier for capability the task never needed. A deterministic pipeline is the correct, bounded-cost shape.

Exam essentials

  • Two shapes, distinguished by when the structure is set: a sequential pipeline is fixed at design time; adaptive decomposition is decided at runtime by the orchestrator.
  • Adaptive when the path can’t be hardcoded — open-ended, path-dependent work where step N+1 depends on step N. Scale effort to the input (the 1 / 2–4 / 10+ heuristic).
  • Pipeline when steps are predictable — deterministic, auditable, bounded cost, usually programmatically enforced (D1.4). Adaptivity here is wasted.
  • Adaptive costs 3–10× the tokens of a single agent (≈15× a chat), and buys thoroughness, not speed. A pipeline’s cost is bounded; an adaptive flow’s is variable and can spike.
  • Two opposite failure modes: over-decomposition (“50 subagents for a simple query”) for adaptive; rigid/role-based pipelining (the telephone game) for fixed.
  • Choosing turns on predictability — and either way the orchestrator must size effort to the task. (Whether to split = D1.2; how to enforce = D1.4.)
Part 1 Chapter 7 Last verified 2026-06-02 Fresh

Session State: resume, fork, and Scratchpads

A session is the persisted conversation, not the filesystem. The architect's tools for carrying or branching state across context windows — continue, resume, and fork, with their literal Python/TS spellings — plus the encoded-cwd resume trap and the discipline of capturing durable artifacts as application state rather than shipping transcripts.

Volatility: architectural-pattern
Tools compared: claude-code
Before you start: Chapter D1.1 (the loop) — you know what a turn is and that hitting max_turns sets an error subtype on the result.
You will learn
  • Distinguish continue, resume, and fork — and write the literal call for each in Python and TS
  • Apply resume to recover a session that hit error_max_turns, getting the cwd right
  • Explain why forking and resuming branch the conversation, not the filesystem
  • Choose between resuming a transcript and capturing artifacts as application state for cross-host work

The loop of D1.1 runs inside a single context window. Real agents outlive one window — they pause, resume on another host, or branch to try an alternative. The state that survives is the session, and this chapter fixes what a session is, the three controls that carry or branch it, and the one discipline that outlasts the session itself. This closes Domain 1: the loop, scaled across many windows.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

  1. What does a session persist — and what does it not?
  2. Write the literal Python (or TS) call for continue, resume, and fork.
  3. If a forked agent edits a file, is the change isolated from the original session? Why?
  4. A resume call returns an empty, fresh session. What is the single most likely cause?
  5. You must finish an interrupted job on a different host tomorrow. What is more robust than shipping the transcript?
Check your answers
  1. The conversation — prompt, tool calls, results, and responses as JSONL under ~/.claude/projects/<encoded-cwd>/ — not the filesystem; reverting file changes is file checkpointing’s job.
  2. continue_conversation=True / continue: true; resume=sessionId / resume: sessionId; fork_session=True / forkSession: true.
  3. No — forking branches the conversation history, not the filesystem, so both forks share one disk and the edit is real and visible to any session in that directory.
  4. A mismatched cwd — sessions are looked up under ~/.claude/projects/<encoded-cwd>/, so resuming from a different directory derives the wrong path and starts fresh.
  5. Capture the artifacts you care about as application state (analysis output, decisions, file diffs) and pass them into a fresh session’s prompt — more robust than shipping transcript files around.

A session is the persisted conversation

A session is the conversation history — the prompt and every tool call, tool result, and response — persisted as JSONL on disk at ~/.claude/projects/<encoded-cwd>/<session-id>.jsonl. [Official] Work with sessions · AnthropicT1-official original The boundary that matters most is what a session does not include:

“Sessions persist the conversation, not the filesystem. To snapshot and revert file changes the agent made, use file checkpointing.” [Official] Work with sessions · AnthropicT1-official original

Concept ·

The session file is the transcript: messages, tool calls, tool results. It lives under ~/.claude/projects/<encoded-cwd>/, keyed by the working directory. It records what was said and done — not the state of the files the agent touched. Conversation and filesystem are two separate kinds of state.

continue, resume, and fork

Three controls carry or branch a session — and the exam expects you to recognize their literal spellings, which differ between Python and TypeScript. [Official] Work with sessions · AnthropicT1-official original

ControlLiteral call (Python / TS)What it doesWhen to reach for it
continuecontinue_conversation=True / continue: truePicks up the most recent session in the current cwd — no ID neededResume after a process restart in the same directory
resumeresume=sessionId / resume: sessionIdPicks up a specific session by IDMulti-user / multi-conversation apps where “most recent” is ambiguous
forkfork_session=True / forkSession: trueStarts a new session ID from a copy of the original’s history; the original is untouchedTry an alternative direction without losing the original thread

continue and resume extend one thread; fork splits one into two. Capture the ID you’ll need later from ResultMessage.session_id — it is present even on errors. [Official] Work with sessions · AnthropicT1-official original

[Tip]

Exam reflex: resume / continue carry one conversation forward; fork branches it into a second, independent one (new ID, original unchanged). Note fork is fork_session (Python) / forkSession (TS) — same idea, casing differs.

Fork branches the conversation, not the filesystem

The most-missed property of forking is the same boundary from section 1, sharpened:

“Forking branches the conversation history, not the filesystem. If a forked agent edits files, those changes are real and visible to any session working in the same directory.” [Official] Work with sessions · AnthropicT1-official original

Key idea

Forking gives you two conversations, not two worlds. Both forks share one filesystem, so a file edit in either is real and visible to the other. To branch and revert file changes, you need file checkpointing — forking alone does not sandbox the disk.

Assuming fork or resume snapshots the filesystem

Treating a fork as an isolated sandbox — “I’ll fork, experiment destructively, and the original is safe” — corrupts the shared working directory. Only the conversation branches. Pair forking with file checkpointing if the files must branch too.

Resume to recover — and the encoded-cwd trap

resume is the recovery tool for a loop that ended on a budget. When a session stops with error_max_turns (D1.1), you resume it with a higher limit and let it finish rather than restarting from scratch. [Official] Work with sessions · AnthropicT1-official original Because the work hit a budget, not a wall, the transcript is intact and resumable. [Official] How the agent loop works · AnthropicT1-official original

Recovering an error_max_turns session Worked example

A first run is bounded too tight and stops with subtype: "error_max_turns" (D1.1) — its .result is empty, but its transcript is intact. Recover it instead of restarting:

# First run returned error_max_turns; we captured its session_id from ResultMessage.session_id
async for message in query(
    prompt="Continue the refactor where you left off.",
    options=ClaudeAgentOptions(resume=session_id, max_turns=40),  # specific session, bigger budget
):
    ...

Two things must be right or resume silently starts fresh: (1) you pass the specific session_id (captured from the first run’s ResultMessage, which is populated even on the error), and (2) you run from the same cwd as the original, because the lookup path is ~/.claude/projects/<encoded-cwd>/. Bump max_turns so the resumed session can actually finish. Restarting from scratch would re-pay every turn already spent; resuming continues the intact transcript.

The single most common resume bug is the encoded-cwd mismatch:

“If a resume call returns a fresh session instead of the expected history, the most common cause is a mismatched cwd. Sessions are stored under ~/.claude/projects/<encoded-cwd>/*.jsonl, where <encoded-cwd> is the absolute working directory with every non-alphanumeric character replaced by -.” [Official] Work with sessions · AnthropicT1-official original

resume silently returns a fresh session

If resume hands you an empty conversation, you are almost certainly running from a different directory than the original. The lookup path is derived from cwd (/Users/me/proj → -Users-me-proj); resume from the same working directory or the SDK looks in the wrong place and starts fresh.

Scratchpads: durable state beyond the session

Sometimes the session itself is the wrong unit to carry — especially across hosts, where a CI worker or ephemeral container won’t have yesterday’s transcript file. The robust move is to lift the state you care about out of the conversation: “capture the artifacts you care about (analysis output, decisions, file diffs) as application state and pass into a fresh session’s prompt,” which the docs call “often more robust than shipping transcript files around.” [Official] Work with sessions · AnthropicT1-official original A scratchpad — a working file the agent writes to and reads from — is that same discipline applied within a run: the durable artifact, not the transcript, is the thing that survives. A fresh session then starts from that artifact, the way the best-practices guidance recommends executing a written spec in a clean session. [Official] Best practices for Claude Code · AnthropicT1-official original

Key idea

Session controls (resume/fork) carry the transcript; a scratchpad/artifact carries the conclusions. The second is more portable: a transcript is tied to a cwd and a host, while a captured artifact (a decisions file, a diff, a spec) can seed a fresh session anywhere. Reach for session controls to recover in place; reach for artifacts to move across hosts.

This is where session state shades into memory — the design rationale for persisting durable context across many sessions is optional depth, in the Further reading.

Practice

Exercise

A CI job runs an agent that stops with error_max_turns partway through a refactor. The job’s container is torn down, and you must finish the work on a fresh worker tomorrow. Walk two options: (a) resume the original session by ID — what must be in place for that to work? — and (b) capture artifacts as application state. Which is more robust across hosts, and why?

Practice ◆◆◇◇

You want to try a risky alternative refactor without losing your current working thread, and you want the original conversation left exactly as it is. Which control do you use, and what happens to the original session?

  • A. continue — it picks up where you left off
  • B. resume with the original sessionId — it reopens the thread
  • C. fork (fork_session=True / forkSession: true) — a new session ID from a copy; the original is untouched
  • D. file checkpointing — it snapshots the work so you can revert
Practice ◆◆◇◇

A resume call returns an empty, fresh session instead of the expected history. Name the most common cause and the fix.

Exercise solutions

Solution ↑ Exercise

(b) is more robust. (a) To resume by ID, the original session JSONL must be restored to the same path — ~/.claude/projects/<encoded-cwd>/<session-id>.jsonl — and the fresh worker must run from the same cwd, because the encoded-cwd is derived from the working directory; only then does resume=sessionId (with a bumped max_turns) find the transcript. That means shipping the transcript file and reproducing the exact directory on every worker. (b) Instead, capture the artifacts that matter — the decisions made, the diff so far, the remaining plan — as application state and seed a fresh session’s prompt with them. No transcript to ship, no cwd to reproduce; the docs call this “often more robust than shipping transcript files around.” Resume is for recovering in place; cross-host work favors captured artifacts. (And note: neither option restores the files the agent edited — that is file checkpointing, separate from session state.)

Solution ↑ Exercise

C — fork. Forking starts a new session ID from a copy of the original’s history, and the original session is left untouched — exactly “branch to try an alternative without losing the original thread.” A (continue) and B (resume) both extend the same thread, so the alternative attempt would pollute the original conversation, not branch from it. D (file checkpointing) snapshots files, not the conversation — useful if the risky refactor must be revertible on disk, but it does not give you a second conversation. (Reminder: fork branches the conversation, not the filesystem, so pair it with checkpointing if the files must branch too.)

Solution ↑ Exercise

The most common cause is a mismatched cwd. Sessions are stored under ~/.claude/projects/<encoded-cwd>/*.jsonl, where <encoded-cwd> is the absolute working directory with every non-alphanumeric character replaced by -; if you resume from a different directory, the SDK derives a different encoded path, finds nothing, and starts a fresh session. The fix: run the resume from the same working directory as the original session (or otherwise ensure the encoded-cwd path matches).

Exam essentials

  • A session is the persisted conversation, not the filesystem — JSONL under ~/.claude/projects/<encoded-cwd>/<session-id>.jsonl. File state is separate (file checkpointing).
  • Three controls, with literal spellings: continue (continue_conversation=True / continue: true), resume (resume=sessionId / resume: sessionId), fork (fork_session=True / forkSession: true). resume/continue carry; fork branches (new ID from a copy; original untouched).
  • Fork branches the conversation, not the disk — forked file edits are real and shared. Pair with file checkpointing to branch files.
  • resume recovers an error_max_turns session with a bumped budget; the #1 bug is a mismatched cwd → wrong encoded path → a fresh, empty session.
  • For cross-host work, capture artifacts as application state and seed a fresh session’s prompt — more robust than shipping transcripts.

Further reading

The design rationale for persisting durable context across many sessions — scratchpads, memory files, retrieval as a discipline — is developed in the Agentic Systems Design book, Chapter 10, Memory: Persisting Context Across Sessions. Optional depth; this chapter stands on its own.

[Note]

Cross-book link is provisional — it points at the chapter source until the Agentic Systems Design book is deployed, then repoints to its published URL.

Part 1 · D1 Review

7 exercises across 7 chapters — interleaved review.

d1-01-agentic-loops

  1. d1-01-ex-trace A session runs: (1) Claude calls `Read`; (2) Claude calls `Grep`; (3) Claude calls `Edit`; (4) Claude returns a text summary with no tool call. How many *turns* is this? What is the `stop_reason` on the final response, and what is the smallest `max_turns` that still lets the session finish?

d1-02-coordinator-subagent-patterns

  1. d1-02-ex-decide A support agent has 40 tools spanning CRM, order history, messaging, and analytics, and its accuracy on multi-step tickets is falling. A teammate proposes splitting it into four subagents: *intake*, *diagnosis*, *resolution*, *follow-up*. Walk the decision framework: is multi-agent warranted here, and if so, is this the right way to cut it? State the win-condition (if any) and name the anti-pattern (if any).

d1-03-subagent-invocation

  1. d1-03-ex-fix-delegation You define a read-only `doc-reviewer` subagent via the `agents` option — `description: "Reviews things"`, `tools: ["Read", "Grep", "Glob"]` — but it never triggers; the main agent just reviews inline. The parent's `allowedTools` is `["Read", "Edit", "Bash"]`. Diagnose the **two** reasons it won't delegate, and give the fix for each. Is the `tools` array itself part of the problem?

d1-04-multi-step-workflows

  1. d1-04-ex-pipeline A content workflow runs four steps — research → draft → fact-check → polish — today as a single agent in one context. Quality is slipping specifically at the fact-check step: the agent waves through claims it wrote moments earlier. Evaluate two options: (a) keep it one prompt-based agent, or (b) hand the fact-check to a fresh-context reviewer with an explicit programmatic handoff. Which do you choose, where exactly do you place the split, and what would over-splitting into four role-agents cost you?

d1-05-agent-sdk-hooks

  1. d1-05-ex-gate-normalize You must enforce two rules on an agent: (a) it may never write to any `.env` file, and (b) every `Bash` result must have its ANSI color codes stripped before the model reads it. For each rule, name the hook event you would use and the specific return field that does the work. Why can't (b) be done with the same event as (a)?

d1-06-task-decomposition

  1. d1-06-ex-pipeline-vs-adaptive Evaluate the right decomposition structure for each task and justify it in one or two sentences. (a) A nightly job that summarizes each of 200 incoming support tickets into the same fixed JSON shape. (b) "Find out whether any of our competitors have shipped feature X — go as deep as the question needs." For whichever you call adaptive, name the failure mode you must guard against and roughly the token cost you are accepting.

d1-07-session-state

  1. d1-07-ex-resume-ci A CI job runs an agent that stops with `error_max_turns` partway through a refactor. The job's container is torn down, and you must finish the work on a **fresh worker** tomorrow. Walk two options: (a) resume the original session by ID — what must be in place for that to work? — and (b) capture artifacts as application state. Which is more robust across hosts, and why?
Part 2 Chapter 1 Last verified 2026-06-02 Fresh

Effective Tool Interfaces: Descriptions, Boundaries, and Naming

A tool's caller-facing contract — description, input examples, operation boundary, name, and response shape — is what a non-deterministic model reads to select and use it. Why the description is the highest-leverage surface, how input_examples show correct usage, when to consolidate, how to namespace, and the object schemas (input and output) every interface stands on.

Volatility: architectural-pattern
Tools compared: claude-code
Before you start: Chapter D1.1 (the agent loop) — you know a tool is name + description + input schema, called via the tool_use round-trip. No tool-authoring experience assumed.
You will learn
  • Explain why a tool’s description is the highest-leverage factor in its performance
  • Add input_examples to demonstrate correct usage, and state their one hard constraint
  • Apply consolidation and service-namespacing to a set of overlapping or service-spanning tools
  • Evaluate a tool response for high-signal content and a tool name against the documented constraints
  • Identify the object input schema — and the optional output schema — that every tool interface stands on

Part I built the agent and its orchestration; Part II turns to the tools that agent reaches for. A tool is a contract between a deterministic system and a non-deterministic caller — and the architect’s leverage is not the implementation behind it but the surfaces the model actually reads: the description, the input examples, the operation boundary, the name, and the response. Get those right and a capable model selects the tool correctly; get them wrong and no amount of model quality rescues it.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

  1. Which single field on a tool definition moves its performance the most, and what is the documented length floor?
  2. What does input_examples do, and what is the one hard rule every example must satisfy?
  3. You have create_pr, review_pr, and merge_pr. What is the documented redesign, and why?
  4. Give the namespaced names for a GitHub search and a Jira search, and the mcp__ form an MCP tool surfaces as.
  5. What is structurally required of every tool’s input schema — and what optional schema governs its output?
Check your answers
  1. The description — “by far the most important factor in tool performance” — with a documented floor of at least 3–4 sentences per tool description, more if the tool is complex.
  2. input_examples is an array of example argument objects that show the model correct calls; each example must validate against the tool’s input_schema, or the request returns a 400.
  3. Consolidate them into a single tool with an action parameter — fewer, more capable tools reduce selection ambiguity.
  4. github_search and jira_search; an MCP tool surfaces as mcp__<server>__<tool> (e.g. mcp__github__list_issues).
  5. The input schema must be a JSON Schema object (a no-argument tool still declares an empty object); the optional MCP outputSchema governs the output, obligating conforming structuredContent.

The description is the highest-leverage surface

Of every field on a tool definition, the description moves performance the most: detailed descriptions are “by far the most important factor in tool performance.” [Official] Define tools · AnthropicT1-official original A description is not documentation for a human reader — it is the surface the model selects from, so it must spell out what the tool does, when it should be used (and when it should not), what each parameter means, and any caveats. [Official] Define tools · AnthropicT1-official original The guidance even sets a floor: aim for “at least 3-4 sentences per tool description, more if the tool is complex.” [Official] Define tools · AnthropicT1-official original

The gap is concrete. A get_stock_price described as “Retrieves the current stock price for a given ticker symbol… returns the latest trade price in USD… It will not provide any other information” tells the model exactly when to reach for it and what it gets back; the same tool described as “Gets the stock price for a ticker” leaves it guessing about inputs, outputs, and boundaries. [Official] Define tools · AnthropicT1-official original

Key idea

For the agent, the description is the API. The model never reads your implementation; it chooses tools by their descriptions alone — so the description is where an architect spends the first and largest share of design effort.

Show correct usage with input_examples

The description tells the model how to use a tool; input_examples show it. This optional field carries an array of example argument objects that demonstrate correct calls — the documented “Tool Use Examples” feature. [Official] Define tools · AnthropicT1-official original A weather tool can ship three: a full call, a call with a different unit, and a call that omits the optional field — teaching the model the shape by demonstration rather than prose.

The one hard rule: each example must validate against the tool’s input_schema, or the request returns a 400. [Official] Define tools · AnthropicT1-official original Two more facts for the exam: input_examples are for client (user-defined) tools, not server-side tools, and they cost roughly 20–50 tokens for simple examples, 100–200 for complex nested ones — a context cost you pay deliberately where ambiguity is high. [Official] Define tools · AnthropicT1-official original

A description plus input_examples Worked example

A get_weather tool, with the two model-facing surfaces working together:

{
  "name": "get_weather",
  "description": "Get the current weather for a location. Use when the user asks about present conditions; not for forecasts. `unit` is optional and defaults to celsius.",
  "input_schema": {
    "type": "object",
    "properties": {
      "location": { "type": "string" },
      "unit": { "type": "string", "enum": ["celsius", "fahrenheit"] }
    },
    "required": ["location"]
  },
  "input_examples": [
    { "location": "San Francisco, CA", "unit": "fahrenheit" },
    { "location": "Tokyo, Japan", "unit": "celsius" },
    { "location": "New York, NY" }
  ]
}

The third example deliberately omits unit to show it is optional. Every example validates against input_schema — if you typo "unit": "kelvin", the example fails the enum and the whole request 400s, so the examples double as a self-check on your schema. The description draws the boundary (“not for forecasts”); the examples remove any doubt about argument shape.

Consolidate operations to reduce selection ambiguity

The next surface is the operation boundary — how much each tool does. The documented default is to consolidate: “Consolidate related operations into fewer tools. Rather than creating a separate tool for every action (create_pr, review_pr, merge_pr), group them into a single tool with an action parameter. Fewer, more capable tools reduce selection ambiguity.” [Official] Define tools · AnthropicT1-official original Every extra near-equivalent tool is one more line the model can pick wrong.

The deeper principle is to design for the agent’s affordances, not mirror your API’s endpoints: rather than make the model chain list_users + list_events + create_event, give it one schedule_event; rather than get_customer_by_id + list_transactions + list_notes, give it get_customer_context. [Official] Writing tools for agents · AnthropicT1-official original A tool that returns exactly the workflow the agent needs beats three tools it must orchestrate.

Two tools, one job

When two tools carry near-duplicate descriptions — analyze_content vs analyze_document — the model misroutes between them. Either fold them into one tool with a disambiguating parameter, or make each description’s boundary unambiguous (“use analyze_document only when the input is a stored file with a document_id”). Vague, overlapping boundaries are a documented cause of tool-selection errors. [Official] Writing tools for agents · AnthropicT1-official original

Namespace tool names by service

A name is the model’s fastest disambiguator, and the documented convention is to namespace by service: “Use meaningful namespacing in tool names… prefix names with the service (e.g., github_list_prs, slack_send_message). This makes tool selection unambiguous as your library grows.” [Official] Define tools · AnthropicT1-official original Bare search becomes a liability the moment a second search exists; github_search and jira_search never collide.

Names also carry hard constraints that differ by regime. A Claude API tool name must match ^[a-zA-Z0-9_-]{1,64}$. [Official] Define tools · AnthropicT1-official original An MCP tool name should be 1–128 characters of ASCII letters, digits, underscore, hyphen, or dot — no spaces — and unique within its server. [Official] Tools — Model Context Protocol Specification 2025-11-25 · AnthropicT1-official original Those MCP tools then reach the agent through a fixed pattern, mcp__<server>__<tool>: a list_issues tool on a server keyed github becomes mcp__github__list_issues. [Official] Connect to external tools with MCP · AnthropicT1-official original

[Tip]

Consistent prefixes pay off later: the wildcard scoping of D2.3 (mcp__github__*) and the server configuration of D2.4 both key off these names. Naming is a distribution decision, not just a readability one.

Return only high-signal information

The response is the half of the contract authors forget. The model reads every token a tool returns, so a tool should “return only high-signal information… semantic, stable identifiers (e.g., slugs or UUIDs) rather than opaque internal references, and include only the fields Claude needs to reason about its next step.” [Official] Define tools · AnthropicT1-official original Bloated responses waste the context window and bury the fields that matter. The shape of the response also shapes the next call: a semantic identifier the model can pass straight into the following tool keeps a multi-step task cheap; an opaque internal handle forces a re-lookup. [Official] Writing tools for agents · AnthropicT1-official original

When the response should be machine-shaped, MCP lets a tool declare an optional outputSchema — and when it does, the server MUST return structuredContent conforming to that schema (mirroring it in a text block for compatibility). [Official] Tools — Model Context Protocol Specification 2025-11-25 · AnthropicT1-official original That is the output-side analogue of the required input schema; the structured-output machinery that drives it is Domain 4’s subject (D4.3).

Concept ·

The input schema shapes the call in; the response (and an optional outputSchema) shapes the call out — and the call after it. An interface designed only at the input boundary is half-specified. Treat the response fields as deliberately as the parameters.

The structural floor: an object input schema

Beneath the design judgments sits a requirement no interface can skip. Every tool’s input schema is a JSON Schema object: in the Claude API a tool definition’s three required fields are name, description, and an input_schema object; [Official] Define tools · AnthropicT1-official original in MCP the inputSchema is required and must be a valid JSON Schema object, not null. [Official] Tools — Model Context Protocol Specification 2025-11-25 · AnthropicT1-official original A tool that takes no arguments still declares an empty object schema — the object is the floor every interface stands on.

[Note]

A related field, strict: true, makes the model’s inputs conform to that schema exactly — but it belongs to the error contract (preventing malformed calls), which is D2.2’s and D2.3’s subject. Here it is enough that the schema is structurally an object.

Practice

Exercise

A teammate ships three tools — get_customer_by_id, list_customer_transactions, and list_customer_notes — and reports that the agent often calls only the first and then stalls. You may redesign the interface. Propose a single consolidated tool (give its name, a one-sentence description, and the boundary it draws), and name the two interface principles your redesign applies.

Practice ◆◆◇◇

You add three input_examples to a tool and the API rejects the whole request with a 400. Nothing is wrong with your description or your code. What is the most likely cause, and which field do the examples have to agree with?

Practice ◆◆◆◇

You inherit a tool whose entire description reads “Gets data for a record.” List three things a good tool description must add (per the documented best practice), and explain why the model — not a human reader — is the audience that makes this the highest-leverage fix.

Exercise solutions

Solution ↑ Exercise

Consolidate the three into one get_customer_context tool (namespace it — e.g. crm_get_customer_context — if the agent spans services). Its description should state what it returns and when to use it: “Returns a customer’s profile, recent transactions, and notes for a given customer ID; use it whenever you need context about a customer before acting.” The redesign applies consolidation (fewer, more capable tools reduce selection ambiguity) and design-for-affordances (one call returns the context the agent needs instead of three CRUD calls it must chain). The agent stalled because three thin tools forced multi-step chaining the descriptions never made obvious; a single high-signal response also lets any follow-up call reuse the returned identifiers cheaply.

Solution ↑ Exercise

The most likely cause is that one of the examples does not validate against the tool’s input_schema — an invalid input_examples entry returns a 400. Every example must conform to the same input_schema the real calls do (right types, required fields present, enum values legal); a single bad example (a typo’d enum, a missing required field) fails the whole request. The examples have to agree with input_schema — which is also why they double as a check on the schema itself.

Solution ↑ Exercise

A good description must add, at minimum: (1) what the tool does concretely (not “gets data” but which data, in what form); (2) when to use it and when not to — the boundary that prevents misrouting; (3) what each parameter means (and what the response returns). Aim for 3–4 sentences. The audience is the model, which selects tools by description alone and never reads the implementation — so an opaque description is a performance bug the model cannot route around, making the description the single highest-leverage fix (“by far the most important factor in tool performance”). Adding input_examples compounds the gain by showing correct argument shape.

Exam essentials

  • The description is the highest-leverage surface — “by far the most important factor in tool performance.” Say what the tool does, when (and when not) to use it, and what each parameter means; 3–4 sentences minimum.
  • input_examples show correct usage — an array of example argument objects; each must validate against input_schema (invalid → 400). Client tools only, not server tools; ~20–50 / ~100–200 tokens.
  • Consolidate to reduce selection ambiguity — fewer, more capable tools (an action parameter over create_pr/review_pr/merge_pr); design for the agent’s affordances, not your API’s endpoints.
  • Namespace names by service — github_list_prs, not a bare search. API names match ^[a-zA-Z0-9_-]{1,64}$; MCP names are 1–128 ASCII chars and surface as mcp__server__tool.
  • Return only high-signal information — semantic, stable identifiers and only the fields the model needs. MCP’s optional outputSchema governs the machine-shaped output (server must then return conforming structuredContent).
  • The input schema must be an object — the structural floor of every tool; strict: true (D2.2/D2.3) then makes inputs conform to it.
Part 2 Chapter 2 Last verified 2026-06-02 Fresh

Structured Error Responses: isError, Retryability, and the Protocol-Error Split

A tool's failure contract — which channel a failure travels down, what its text says, and whether the schema could have prevented it. The two regimes (Messages-API is_error vs MCP isError + JSON-RPC), why is_error turns a failure into a recoverable signal, the normative execution-vs-protocol error split, and the difference between steering a retry and preventing the error.

Volatility: architectural-pattern
Tools compared: claude-code
Before you start: Chapter D1.1 (the tool_result round-trip and is_error) and Chapter D2.1 (the tool interface). You know a failed tool call returns a tool_result block; this chapter is what goes inside it.
You will learn
  • Distinguish the two error regimes — the Claude Messages API (is_error) and MCP (isError + JSON-RPC) — and not conflate their spellings
  • Apply is_error: true with actionable content so a failed call becomes recoverable
  • Distinguish an execution error (the model self-corrects) from a protocol error (it cannot)
  • Analyze the model’s documented retry behavior and why error content — not a parameter — steers it
  • Choose strict: true to prevent the schema errors you would otherwise have to handle

D2.1 designed a tool’s happy path; this chapter designs its failure path. When a call goes wrong, the architect decides three things: which channel the failure travels down, what the failure text says, and whether the schema could have prevented it at all. But “which channel” depends on which regime you are in — and conflating the Claude Messages API with MCP is the most common mistake here, so we separate them first.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

  1. In the Claude Messages API, what field flags a failed tool result — and what is its exact casing?
  2. In MCP, what are the two error channels, and which one does a validation failure belong in?
  3. Does the Claude Messages API have a JSON-RPC -32602 channel for a bad tool call? If not, where does a protocol-level problem surface?
  4. How many times does Claude retry a bad call, and is that a parameter you can set?
  5. Which failures does strict: true prevent, and which does it not?
Check your answers
  1. is_error: true on the tool_result block — snake_case (is_error); the camelCase isError belongs to MCP.
  2. Execution errors ride isError: true inside a successful result; protocol errors ride a JSON-RPC error response (e.g. -32602). A validation failure belongs in the isError channel (SEP-1303), so the model can correct and retry.
  3. No — the Messages API has one tool-failure signal (is_error); a protocol-level problem surfaces as an HTTP error (e.g. 400), not a JSON-RPC channel.
  4. 2–3 times with corrections before apologizing — documented default behavior, not a parameter you can set; your only lever is the quality of the error content.
  5. strict: true prevents schema violations (missing parameters, type mismatches); it does not prevent runtime API errors, business-logic violations, or semantic constraints a JSON Schema can’t express.

Two regimes, two spellings

“Structured error” means two related-but-distinct things depending on the surface you are on, and the exam (and real code) punish conflating them.

Concept ·
  • Claude Messages API (direct). A failed tool returns a tool_result block with is_error: true (snake_case) and actionable content. There is one error signal for tool failures; a protocol-level problem (a malformed request, tool_result not first in the content array) surfaces as an HTTP error (e.g. 400), not a JSON-RPC channel. [Official] Handle tool calls · AnthropicT1-official original
  • MCP. A tools/call returns a CallToolResult carrying an isError flag (camelCase, default false). MCP has two channels: execution errors ride isError: true inside a successful result; protocol errors (unknown tool, malformed request) ride a JSON-RPC error response such as -32602. [Official] Tools — Model Context Protocol Specification 2025-11-25 · AnthropicT1-official original

The casing is the tell: is_error is the Claude Messages API; isError is MCP. The two-channel (isError vs JSON-RPC) split below is an MCP model — the direct Messages API has only the single is_error signal for tool failures.

Key idea

The execution-vs-protocol split is real but lives in MCP. On the Claude Messages API there is one tool-failure signal — is_error — and protocol problems are HTTP-level, not a second in-band channel. Keep the spelling consistent with the regime you are describing, and never present the MCP two-channel model as if the direct API also had a JSON-RPC error class.

is_error: true is the canonical failure signal (Messages API)

On the Claude Messages API, a failed tool still returns a tool_result — but flagged. is_error: true is the canonical signal that a tool call failed: Claude folds the error into its next-turn reasoning and may retry. [Official] Handle tool calls · AnthropicT1-official original The flag is what turns a failure into a message to the model rather than a dead end — a result whose content reads ConnectionError: the weather service API is not available (HTTP 500) with is_error: true lets the next turn reason about what to do. [Official] Handle tool calls · AnthropicT1-official original The design principle is two lines: set is_error: true on the tool_result block, and make the content text actionable. [Official] Writing tools for agents · AnthropicT1-official original

Key idea

An error is a message to the model, not a log line for you. Its only job is to let the next turn recover — so it must be both flagged (so the model knows the call failed) and legible (so the model knows what to do about it).

Write instructive error messages

The flag says that it failed; the content must say what to do next. The documented principle is explicit: “Write instructive error messages. Instead of generic errors like ‘failed’, include what went wrong and what Claude should try next, e.g., ‘Rate limit exceeded. Retry after 60 seconds.’ This gives Claude the context it needs to recover or adapt without guessing.” [Official] Handle tool calls · AnthropicT1-official original

The 'failed' anti-pattern

A tool_result of "failed" or a raw stack trace gives the model nothing to act on — it guesses or gives up. Every error string should name what went wrong and what to try next: not "Error: 429" but "Rate limit exceeded. Retry after 60 seconds." That actionable form is the difference between a recovered task and a stalled one.

Execution vs protocol errors: the MCP channel split

Within MCP, a failure travels down one of two channels, and the choice is normative, not stylistic. The specification draws the line: isError: true inside a successful result is for execution errors the model should self-correct on — input validation, API failures, business-logic errors — while a JSON-RPC error response is for protocol errors the model cannot fix, such as an unknown tool or a malformed request. [Official] Tools — Model Context Protocol Specification 2025-11-25 · AnthropicT1-official original

The most-tested trap lives on this line: a validation failure belongs in the isError channel, not in a JSON-RPC -32602. The 2025-11-25 spec (per SEP-1303) is explicit that input-validation errors return as isError: true content — for example, Invalid departure date: must be in the future. Current date is 08/08/2025. — so the model can correct and retry. [Official] Tools — Model Context Protocol Specification 2025-11-25 · AnthropicT1-official original

Concept ·

The isError content is addressed to the model — it reads the text and self-corrects. The JSON-RPC error is addressed to the client/host — the model cannot usefully recover from it, so routing a recoverable failure there silently denies the model its chance to retry. Route by who can act on the error, not by how severe it feels.

A past departure date: which channel? Worked example

A booking tool receives departure_date in the past. Two ways to report it — one right, one wrong:

// WRONG (MCP) — a JSON-RPC protocol error for a recoverable input problem:
{ "jsonrpc": "2.0", "id": 7,
  "error": { "code": -32602, "message": "Invalid params" } }   // model can't read/recover

// RIGHT (MCP) — an execution error in the result, addressed to the model:
{ "content": [{ "type": "text",
    "text": "Invalid departure date: must be in the future. Current date is 2026-06-02." }],
  "isError": true }                                            // model corrects and retries

A past date is an input-validation / business-logic failure — exactly what isError: true exists for (SEP-1303). Sending it as -32602 routes a recoverable error to the host, which silently denies the model its retry. On the Claude Messages API the same failure is a tool_result with is_error: true and that actionable text — same idea, different spelling, no JSON-RPC channel involved.

Retryability is documented behavior, not a parameter

The model already retries failed calls on its own: “If a tool request is invalid or missing parameters, Claude will retry 2-3 times with corrections before apologizing to the user.” [Official] Handle tool calls · AnthropicT1-official original That loop is why the channel choice and the content quality matter so much — a failure returned as legible error content feeds each retry something to correct against, whereas a protocol error the model cannot read gives it nothing to adjust and burns the budget toward an apology.

[Note]

The 2–3-times retry is documented default behavior, not an API knob — there is no retry-count parameter. The only lever over the retry is the quality of the error content the model reads on each attempt.

Prevent the error: strict: true

The cheapest error to handle is the one that never happens. For the largest class of failures — malformed inputs — there is a prevention switch: “To eliminate invalid tool calls entirely, use strict tool use with strict: true on your tool definitions. This guarantees that tool inputs will always match your schema exactly, preventing missing parameters and type mismatches.” [Official] Handle tool calls · AnthropicT1-official original Define tools · AnthropicT1-official original With strict: true the schema-violation error class disappears before it can reach your handler.

Handling what you could have prevented

Writing elaborate validation-error content for a parameter that strict: true would have made impossible is wasted effort. Reserve error content for the failures you cannot prevent — runtime API errors, business-logic violations, and semantic constraints a JSON Schema can’t express (like “the date must be in the future”). Prevent the structural errors; make the rest legible.

Practice

Exercise

Your booking tool (exposed over MCP) receives a departure date in the past. A colleague wants to return a JSON-RPC -32602 Invalid params error. (a) Which channel is correct, and why? (b) Write the one-sentence error content the tool should return. (c) Would strict: true have prevented this particular failure? Explain. (d) If the same tool were called directly over the Claude Messages API instead of MCP, what changes about the spelling of the failure flag?

Practice ◆◆◇◇

A scenario describes a tool called directly through the Claude Messages API (not MCP) that hits an internal API timeout. A candidate answer says “return a JSON-RPC -32602 error.” Name the two things wrong with that answer.

Practice ◆◆◆◇

An agent calls a tool with a missing required parameter. In two sentences, describe what Claude does next according to the documented behavior, and explain why making the error content actionable matters even though you cannot set a retry-count parameter.

Exercise solutions

Solution ↑ Exercise

(a) Return isError: true content, not a JSON-RPC error. A past departure date is an input-validation / business-logic failure the model can self-correct, and the MCP spec (SEP-1303) routes validation errors to the isError channel, reserving JSON-RPC errors for protocol problems the model cannot fix. (b) Make it actionable, e.g. “Invalid departure date: must be in the future. Current date is 2026-06-02.” (c) No. strict: true guarantees the input matches the schema (a correctly-typed date string), but “must be in the future” is a semantic constraint a JSON Schema type cannot express, so this failure must be caught at runtime and returned as legible error content. (d) Over the Claude Messages API the flag is is_error (snake_case) on the tool_result block — same meaning, different spelling — and there is no JSON-RPC channel at all (protocol problems would be HTTP 400s).

Solution ↑ Exercise

Two things are wrong. (1) Wrong regime: the Claude Messages API has no JSON-RPC error channel — JSON-RPC -32602 is an MCP protocol error. On the direct API a tool failure is a tool_result with is_error: true; protocol-level problems are HTTP errors (e.g. 400), not JSON-RPC. (2) Wrong channel even in MCP: an internal API timeout is an execution error the model could retry against, so even under MCP it belongs in isError: true content, not a protocol -32602. The candidate conflated the two regimes and mis-routed a recoverable error.

Solution ↑ Exercise

Claude retries the call 2–3 times with corrections before apologizing to the user — that is documented default behavior, not something you configure. Making the error content actionable matters because each retry reads that content to decide its correction; there is no retry-count knob, so the legibility of the error text is the only lever you have over whether those 2–3 attempts converge on a fix or burn down to an apology.

Exam essentials

  • Two regimes, two spellings: Claude Messages API uses is_error (snake_case) on a tool_result, with one tool-failure signal (protocol problems are HTTP errors). MCP uses isError (camelCase) on a CallToolResult, plus a separate JSON-RPC channel for protocol errors. Don’t conflate them.
  • is_error/isError is the canonical failure signal — it turns a failed result into a message the model reasons over and may retry against. Flag it and make the content actionable (“Rate limit exceeded. Retry after 60 seconds.” beats "failed").
  • MCP’s two channels, two audiences — isError: true content = execution errors the model self-corrects (validation, API, business logic); a JSON-RPC error = protocol errors it cannot fix (unknown tool, malformed request). Validation errors go in the isError channel (SEP-1303), never JSON-RPC -32602.
  • Retry is default behavior, not a parameter — Claude retries 2–3 times with corrections; there is no retry-count knob, so error-content quality is your only lever.
  • Prevent with strict: true — eliminates schema-violation errors entirely; reserve error content for runtime, business-logic, and semantic failures you cannot prevent.
Part 2 Chapter 3 Last verified 2026-06-08 Fresh

Tool Distribution and tool_choice: auto, any, Forced, and none

Controlling whether and which tool the model may call — the four tool_choice modes, the extended-thinking constraint, the any+strict guarantee of a schema-valid call, and the prompt-cache invalidation cost — plus allowedTools scoping versus bypassPermissions, with parallel execution as its own orthogonal axis.

Volatility: feature-surface
Tools compared: claude-code
Before you start: Chapter D1.1 (the agent loop) and Chapter D2.1 (tool interfaces). Helpful: Chapter D1.3 (allowedTools on subagents) for the distribution half.
You will learn
  • Identify the four tool_choice modes and what each one forces
  • Apply the right mode — and recognize the one case where extended thinking forbids it
  • Combine tool_choice: any with strict: true to guarantee a schema-valid tool call
  • Distinguish per-request steering (tool_choice) from per-agent tool distribution (allowedTools)
  • Evaluate allowedTools scoping against the broader bypassPermissions for MCP access

Defining a good tool (D2.1) and a good failure (D2.2) is only half the architect’s job; the other half is controlling whether and which tool the model may reach for. Two knobs do this at two different scopes: tool_choice steers a single request — force a tool, free the model, or forbid all tools — while the SDK’s allowedTools defines which tools even exist for an agent. The exam tests the four tool_choice modes (especially the one constraint that trips people up), the any+strict guarantee, and the difference between steering a call and distributing a surface.

[Note]

This is a feature-surface chapter: the tool_choice values, the allowedTools syntax, and the parallel-execution flag are named API/SDK surfaces that can shift between releases. Treat the specifics as a current snapshot and re-verify against the docs before relying on an exact flag.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

  1. Name the four tool_choice modes and what each forces.
  2. Which two modes are unusable when extended thinking is on?
  3. You need a guaranteed schema-valid tool call without forcing one specific tool. What two settings do you combine?
  4. What does changing tool_choice do to your prompt cache?
  5. tool_choice vs allowedTools — which steers a request, and which defines the toolbox?
Check your answers
  1. auto (Claude decides — the default when tools are provided), any (must use some tool), forced tool (must use this tool), and none (no tools this turn).
  2. any and forced tool — only auto and none are compatible with extended (or adaptive) thinking; the forced modes error the request.
  3. tool_choice: {"type": "any"} plus strict: true on the tool definition — any guarantees a tool fires, strict guarantees its inputs match the schema.
  4. It invalidates the cached message blocks — tool definitions and the system prompt stay cached, but message content must be reprocessed; keep tool_choice stable across cached turns.
  5. tool_choice steers a single request; allowedTools defines which tools the agent has at all — one shapes a call, the other shapes the toolbox.

The four tool_choice modes

tool_choice is the per-request control over tool calling, and it has four documented modes: auto (Claude decides — the default when tools are provided), any (Claude must use some tool but picks which), {"type": "tool", "name": …} (forces one specific tool), and none (no tools this turn — the default when none are provided). [Official] Define tools · AnthropicT1-official original Tool use with Claude · AnthropicT1-official original

Concept ·

Read the modes as a spectrum from least to most coercive: none forbids all tools, auto lets the model choose freely, any forces some tool, and forced tool forces this tool. Pick the loosest mode that still guarantees what you need — auto for an open agent, any when an answer must come through a tool, forced when exactly one tool is valid.

Forced modes are incompatible with extended thinking

The constraint the exam loves: only auto and none are compatible with extended thinking; any and forced tool return an error, and adaptive thinking carries the same limitation. [Official] Tool use with Claude · AnthropicT1-official original Define tools · AnthropicT1-official original If you need the model to reason before acting, you cannot also force it to call a tool — the two are mutually exclusive.

Forcing a tool with thinking on

Setting tool_choice to any or a forced tool while extended (or adaptive) thinking is enabled errors the request. When thinking is on, your only tool_choice options are auto and none. If a design seems to require both forced tool use and visible reasoning, the requirements are in tension — resolve one or the other; don’t expect both.

Forced modes prefill the assistant message

Forcing has a second, subtler effect. “When you have tool_choice as any or tool, the API prefills the assistant message to force a tool to be used. This means that the models will not emit a natural language response or explanation before tool_use content blocks, even if explicitly asked to do so.” [Official] Define tools · AnthropicT1-official original A forced call therefore cannot also produce a spoken preamble — there is no room before the tool_use block for one.

[Tip]

Need a natural-language preamble and a specific tool? Use auto and ask for the tool in the user message (“Use the get_weather tool in your response”) rather than forcing it. Forcing buys a guarantee at the cost of the preamble; auto keeps both at the cost of the guarantee.

Guarantee a schema-valid call: any + strict

any guarantees that a tool fires, but not that its inputs are valid — and forcing one specific tool isn’t always what you want. Compose two switches to get both guarantees at once: “Combine tool_choice: {'type': 'any'} with strict tool use to guarantee both that one of your tools will be called AND that the tool inputs strictly follow your schema. Set strict: true on your tool definitions to enable schema validation.” [Official] Define tools · AnthropicT1-official original any covers that a tool is called; strict: true (D2.2) covers that its arguments match the schema. Together they make “some tool, well-formed” a hard guarantee — the right shape for a classifier or extractor that must always emit structured output through a tool.

[Note]

strict: true is a per-tool property, set on a tool definition. It is the same prevention switch D2.2 uses against malformed inputs — here it composes with any to pin both whether a tool fires and how its inputs are shaped.

A classifier that must always emit valid JSON Worked example

A record_decision tool must be called on every turn with a schema-valid payload ({ "label": "approve" | "deny" | "escalate", "reason": string }). Thinking is off for this step.

  • tool_choice: {"type": "any"} guarantees a tool is called — with only record_decision available, that is it.
  • strict: true on record_decision guarantees the label/reason inputs match the schema exactly (no missing field, no out-of-enum label).
// request
{ "tools": [{ "name": "record_decision", "strict": true, "input_schema": { /* label enum + reason */ } }],
  "tool_choice": { "type": "any" } }

The pair makes “some tool, well-formed” a hard guarantee. Note what you cannot add: extended thinking — any is incompatible with it. If you needed the model to reason first, you would drop to auto and lose the hard guarantee, or move the reasoning to a prior, tool-free turn.

[Cost]

Changing tool_choice between turns invalidates cached message blocks under prompt caching — tool definitions and the system prompt stay cached, but message content must be reprocessed. Toggling tool_choice per turn forfeits that cache; keep it stable across cached turns. [Official] Define tools · AnthropicT1-official original

Distribution: scope the surface with allowedTools

tool_choice steers one request; distribution decides which tools an agent has at all, and in the SDK that knob is allowedTools / disallowedTools. The two behave differently: allowed_tools=["Read", "Grep"] pre-approves the listed tools (others still exist and fall through to the permission mode), while disallowed_tools=["Bash"] removes the tool from the request entirely, so the model never sees it. [Official] Configure permissions · AnthropicT1-official original

For MCP access the documented guidance is to scope with allowedTools rather than open the gates with a permission mode: a mcp__github__* wildcard “grants exactly the MCP server you want and nothing more,” whereas permissionMode: "bypassPermissions" auto-approves MCP tools but disables every other safety prompt — broader than necessary. [Official] Connect to external tools with MCP · AnthropicT1-official original

Key idea

tool_choice is per-request steering — which of the available tools fires now. allowedTools is per-agent surface definition — which tools are available at all. Keep the two axes distinct: one shapes a call, the other shapes the toolbox.

Parallel execution is a separate request-level control

A third knob is easy to confuse with tool_choice but is orthogonal to it: disable_parallel_tool_use. Claude 4 models may emit several tool_use blocks in one turn by default; setting disable_parallel_tool_use=true caps that — with tool_choice: auto Claude then uses at most one tool, and with any or forced tool it uses exactly one. [Official] Parallel tool use · AnthropicT1-official original

Conflating 'one tool' with 'one mode'

disable_parallel_tool_use governs how many tools fire per turn; tool_choice governs whether and which. They are different axes and they compose — any plus disable_parallel_tool_use yields exactly one tool call. It is not a fifth tool_choice value; don’t reach for tool_choice when what you actually want is to cap parallelism.

Practice

Exercise

You are building an agent that (a) must produce a JSON classification by calling your record_decision tool on every turn with valid inputs, and (b) you also want extended thinking enabled so it reasons first. A colleague sets tool_choice: {"type": "tool", "name": "record_decision"}. What goes wrong? What configuration gets you the schema-valid guaranteed tool call, and what must you give up to also keep thinking?

Practice ◆◆◇◇

An agent should be able to call every tool on an MCP server named linear but nothing else dangerous. Give the allowedTools entry you would use, and explain in one sentence why this is preferable to permissionMode: "bypassPermissions".

Practice ◆◆◆◇

A latency-sensitive workflow uses prompt caching and toggles tool_choice between auto and a forced tool on alternate turns, then is surprised that caching “barely helps.” Explain what is happening to the cache and the one change that restores most of the benefit.

Exercise solutions

Solution ↑ Exercise

Forced tool mode is incompatible with extended thinking, so the request errors — you cannot both force a specific tool and let the model reason with extended (or adaptive) thinking. To get a schema-valid guaranteed tool call, combine tool_choice: {"type": "any"} (guarantees some tool fires — here only record_decision exists) with strict: true on the tool (guarantees the inputs match the schema). But any is also incompatible with thinking, so to keep the hard guarantee you must give up extended thinking on this turn (or move the reasoning to a prior tool-free turn and force/any the decision on the next). No single configuration gives you a forced, schema-valid call and visible reasoning at once; forcing also prefills the assistant turn and suppresses the preamble.

Solution ↑ Exercise

Use allowedTools: ["mcp__linear__*"] — the wildcard pre-approves exactly the linear server’s tools and nothing else. It is preferable to permissionMode: "bypassPermissions" because bypass auto-approves the MCP tools and disables every other safety prompt across the whole agent (far broader than you need), whereas the scoped wildcard grants exactly the one server and leaves all other gates intact.

Solution ↑ Exercise

Every turn that changes tool_choice invalidates the cached message blocks — tool definitions and the system prompt stay cached, but the message content has to be reprocessed — so alternating auto/forced means roughly every other turn pays full message-processing cost, which is why caching “barely helps.” The fix: keep tool_choice stable across the cached turns (don’t toggle it per turn); if some steps genuinely need a forced tool, group them so the value changes as rarely as possible rather than every turn.

Exam essentials

  • Four tool_choice modes — auto (default; Claude decides), any (must use some tool), forced {"type":"tool","name":…} (this tool), none (no tools). A spectrum from free to coerced.
  • Forced modes break extended thinking — only auto and none work with extended (or adaptive) thinking; any and forced tool error. The single most-tested tool_choice constraint.
  • any + strict: true = a schema-valid guaranteed call — any guarantees a tool fires, strict: true guarantees its inputs match the schema; compose them for classifiers/extractors. (strict is a per-tool property.)
  • Forced modes prefill the assistant turn — no natural-language preamble before the tool_use block; for preamble plus a specific tool, use auto and ask in the user message.
  • tool_choice changes invalidate the prompt cache — message blocks must be reprocessed (tool defs + system prompt stay cached); keep tool_choice stable across cached turns.
  • Distribution ≠ steering — allowedTools defines which tools exist (disallowed_tools removes one entirely); tool_choice steers a request. For MCP, a mcp__server__* wildcard beats bypassPermissions (narrower). Parallelism is its own axis (disable_parallel_tool_use).
Part 2 Chapter 4 Last verified 2026-06-08 Fresh

MCP Server Configuration: .mcp.json, Scopes, and Env-Var Expansion

Wiring an MCP server so it resolves predictably across personal, team, and machine contexts. The two config paths and strictMcpConfig, claude mcp add --scope, the three scopes and their precedence, the local-scope-versus-local-settings trap, env-var expansion for secrets, verifying the connection via system:init, and the transports atop a mid-revision wire protocol.

Volatility: feature-surface
Tools compared: claude-code
Before you start: Chapter D2.1 (MCP tools surface as mcp__server__tool) and Chapter D2.3 (allowedTools wildcards scope an MCP server). No prior MCP-server operations assumed.
You will learn
  • Choose the config location — programmatic (with strictMcpConfig) vs .mcp.json — and the right scope for an audience
  • Install a server with claude mcp add and the --scope / --transport flags
  • Apply env-var expansion to keep secrets out of a committed config
  • Verify a server actually connected by reading its system:init status before the agent runs
  • Distinguish MCP “local scope” from general “local settings,” and recognize the scope-precedence order

D2.1 through D2.3 designed tools, their failures, and their distribution; this chapter is where an external MCP server actually gets connected. Almost every trap here is about location — which file holds the config, which scope it lives in, which directory it resolves against — plus one notorious naming collision and one silent-failure mode: a server that never connected. Get the location right and check the connection, and a server resolves predictably across personal, team, and machine contexts; get it wrong and the agent silently never sees the tools.

[Note]

This is a feature-surface chapter: file paths, scope names, env-var syntax, CLI flags, and the transport list are concrete config surfaces that move between releases — and the MCP wire protocol itself is mid-revision (see the last section). Treat every path and flag as a current snapshot and re-verify before relying on it.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

  1. Name the two ways to configure an MCP server, and what strictMcpConfig: true does.
  2. What flag on claude mcp add chooses local vs project vs user scope?
  3. Where does a local-scoped MCP server live — and how does that differ from “local settings”?
  4. After wiring a server, how do you confirm it actually connected before the agent runs?
  5. What transports can an MCP server use (name the deprecated one and the .mcp.json-only one), and what is the default connection timeout?
Check your answers
  1. Programmatically (mcp_servers in Python, mcpServers in TypeScript) or via a .mcp.json at the project root; strictMcpConfig: true uses only the servers you pass in mcpServers, ignoring .mcp.json, user settings, and plugins.
  2. --scope <local|project|user> — omit it and the default is Local.
  3. In ~/.claude.json (your home directory, under a per-project key) — not .claude/settings.local.json, which is the project’s general local-settings file.
  4. Read the system:init message before the agent runs — each server’s status is one of connected | failed | needs-auth | pending | disabled.
  5. stdio, sse (deprecated), and http (alias streamable-http), plus ws, configurable only via .mcp.json or claude mcp add-json; the default connection timeout is 60 seconds.

Two ways to configure a server

An MCP server reaches an agent through one of two configuration paths. You can register it programmatically — mcp_servers in Python, mcpServers in TypeScript — or declare it in a .mcp.json file at the project root. [Official] Connect to external tools with MCP · AnthropicT1-official original The file-based path is not automatic, though: .mcp.json loads only when the SDK’s settingSources includes "project". [Official] Use Claude Code features in the SDK · AnthropicT1-official original Connect to external tools with MCP · AnthropicT1-official original

For a reproducible, clean-room config there is a third lever: strictMcpConfig: true uses only the servers you pass in mcpServers, ignoring .mcp.json, user settings, and plugins. [Official] Connect to external tools with MCP · AnthropicT1-official original It is how you guarantee an SDK run sees exactly the servers you declared and nothing the machine happens to carry.

Concept ·

Programmatic config lives in your code — explicit, versioned with the application, invisible to the CLI; strictMcpConfig: true makes it the only source. .mcp.json lives in the repo — shareable, committed, and read by both the CLI and any SDK run that opts into the "project" source. Choosing between them is choosing who owns the wiring: the application, or the repository.

Three scopes — and claude mcp add --scope

Claude Code stores MCP servers in three scopes, each a different file with a different audience: Local (~/.claude.json, under a per-project key — current project only, not shared), Project (.mcp.json at the repo root — shared via version control, with a one-time approval prompt on first use), and User (~/.claude.json — available across all your projects). [Official] Connect Claude Code to tools via MCP · AnthropicT1-official original

The CLI is how you actually install one and pick its scope: claude mcp add <name> --scope <local|project|user> --transport <http|stdio|sse> … — for example, claude mcp add --transport http --scope project notion https://mcp.notion.com/mcp registers a project-scoped HTTP server. [Official] Connect Claude Code to tools via MCP · AnthropicT1-official original --scope is the flag that decides who sees the server; omit it and you get Local (the default).

When the same server name appears in more than one scope, Claude Code connects once, using the highest-precedence source: Local → Project → User → plugin-provided → claude.ai connectors (the first three match duplicates by name). [Official] Connect Claude Code to tools via MCP · AnthropicT1-official original

Key idea

Pick the scope by audience, and the file follows. Credential-bearing or experimental servers → Local (private to you, this project). Team-shared servers → Project (a committed .mcp.json). Cross-project personal servers → User. The first question is never “which file?” — it is “who should see this server?"

"Local scope” is not “local settings”

The single most confusing collision in MCP configuration is the word local. “MCP local-scoped servers are stored in ~/.claude.json (your home directory), while general local settings use .claude/settings.local.json (in the project directory).” [Official] Connect Claude Code to tools via MCP · AnthropicT1-official original They are different files in different directories — one in your home, one in the project — and they hold different things.

The two 'local' files

Editing .claude/settings.local.json to change a local-scoped MCP server is a silent no-op: that server lives in ~/.claude.json, not in the project’s local settings file. When a local-scoped server will not load, check the home-directory ~/.claude.json (under its per-project key), not .claude/settings.local.json. Memorize the two paths; the names will not help you.

Env-var expansion keeps secrets out of the file

Because a Project-scoped .mcp.json is committed to version control, secrets must never be written into it literally — they are referenced through env-var expansion instead. .mcp.json supports ${VAR} (expands, or fails the parse if unset) and ${VAR:-default} (expands, or uses the default), and the expansion works inside command, args, env, url, and headers. [Official] Connect Claude Code to tools via MCP · AnthropicT1-official original So a committed config carries "Authorization": "Bearer ${API_KEY}", and the key itself lives only in the environment.

[Tip]

One snag: ${CLAUDE_PROJECT_DIR} in a hand-written .mcp.json must use the default form, ${CLAUDE_PROJECT_DIR:-.}, because it is set in the server’s environment rather than Claude Code’s. Plugin-provided configs are the exception — they substitute it directly, no default needed.

Verify the server connected

A wired server that never connected is the silent failure of this chapter — the agent simply runs without the tools and you find out from a confusing answer. Don’t assume; check. Detect connection failures via the system:init message: each server’s status is one of connected | failed | needs-auth | pending | disabled — read it before letting the agent run. [Official] Connect to external tools with MCP · AnthropicT1-official original The default connection timeout is 60 seconds for server initialization, so a slow-starting server may need pre-warming or a lighter-weight package. [Official] Connect to external tools with MCP · AnthropicT1-official original

Wire a project server, then confirm it connected Worked example

Add a project-scoped server with the CLI, then gate the run on its status:

# the CLI sets scope (--scope) and transport (--transport); project => committed .mcp.json
claude mcp add --transport http --scope project notion https://mcp.notion.com/mcp
# Before letting the agent act, read system:init and refuse to run on a bad status:
async for message in query(prompt="…", options=options):
    if message.type == "system" and message.subtype == "init":
        bad = [s for s in message.data.get("mcp_servers", [])
               if s.get("status") != "connected"]
        if bad:
            raise RuntimeError(f"MCP servers not connected: {bad}")  # status: failed / needs-auth / pending / disabled

If notion comes back needs-auth, the OAuth flow hasn’t completed; failed usually means a missing env var, an uninstalled package, a bad connection string, or an unreachable host — and remember the 60-second init timeout. Checking status turns a silent “the tools just aren’t there” into an explicit, actionable failure.

Transports and the snapshot-dated wire protocol

A server’s type selects its transport: stdio for local processes, sse (Server-Sent Events, now deprecated — use HTTP), and http (Streamable HTTP; JSON configs accept streamable-http as an alias for http). A fourth type, ws (WebSocket), is configurable only through .mcp.json or claude mcp add-json, not the --transport flag (whose values are http/stdio/sse). [Official] Connect Claude Code to tools via MCP · AnthropicT1-official original Separately — and this is not a .mcp.json type — the SDK lets you run an MCP server in-process inside your application (an SDK deployment mode, e.g. a built-in tool server), rather than as an external process or endpoint. [Official] Connect to external tools with MCP · AnthropicT1-official original

Beneath the config sits the MCP wire protocol — and it is mid-revision, so cite it with a date. Under the 2025-11-25 specification, an initialize handshake “MUST be the first interaction between client and server,” negotiating protocol version and capabilities before any tool call. [Official] Lifecycle — Model Context Protocol Specification 2025-11-25 · AnthropicT1-official original The 2026-07-28 release candidate (locked May 2026; the final spec ships 2026-07-28) removes that handshake for a stateless model, so the wire details here are a dated snapshot, not a permanent contract. [Official] The 2026-07-28 MCP Specification Release Candidate · Model Context ProtocolT2-release-notes original

[Note]

The handshake is a 2025-11-25 protocol fact on a moving target — the cache flags a 2026-07-28 RC that eliminates the initialize handshake. Re-verify the wire protocol before relying on it; the configuration surface (scopes, files, env vars) is the more stable part of this chapter.

Practice

Exercise

Your team needs a postgres MCP server available to everyone who clones the repo, but the connection string must not be committed. (a) Which scope and file do you use, and what claude mcp add flag sets that scope? (b) Show the env field using the documented expansion syntax. (c) A teammate later adds the same server name in their personal ~/.claude.json. Which definition wins on their machine, and why?

Practice ◆◆◇◇

A colleague says: “I put my MCP server in local settings — in .claude/settings.local.json — but Claude Code can’t find it.” In one or two sentences, explain their mistake and name the file a local-scoped MCP server actually lives in.

Practice ◆◆◆◇

An agent “doesn’t seem to have” its MCP tools, but the config looks right and no error is thrown. Name the message and field you inspect to diagnose this, the status values that would explain it, and the default timeout that could be the cause.

Exercise solutions

Solution ↑ Exercise

(a) Project scope — a .mcp.json at the repo root (committed so every clone gets the server), installed with claude mcp add … --scope project. (b) Reference the secret through env-var expansion rather than inlining it, e.g. "env": { "DATABASE_URL": "${DATABASE_URL}" } (or "${DATABASE_URL:-…}" if a safe default exists) — expansion works in env, url, and headers, so no literal credential is committed. (c) The teammate’s definition wins on their machine. Their ~/.claude.json entry is Local scope, and precedence runs Local → Project → User, matched by name — so the local definition overrides the shared project one for them. That is the intended override path for personal credentials, not a conflict.

Solution ↑ Exercise

The mistake is conflating “MCP local scope” with “general local settings.” A local-scoped MCP server is stored in ~/.claude.json (the home directory, under a per-project key), not in .claude/settings.local.json (the project’s machine-local settings). Editing the latter to change an MCP server is a silent no-op — point them at ~/.claude.json.

Solution ↑ Exercise

Inspect the system:init message and its mcp_servers field — each server reports a status. A status other than connected explains the missing tools: failed (missing env var, uninstalled package, bad connection string, unreachable host), needs-auth (OAuth not completed), pending, or disabled. The 60-second default initialization timeout is a common cause of failed/pending for slow-starting servers — pre-warm or use a lighter package. Always read status before letting the agent run rather than discovering the gap from a wrong answer.

Exam essentials

  • Two config paths — programmatic (mcp_servers/mcpServers) or a .mcp.json at the project root (loads only when settingSources includes "project"). strictMcpConfig: true uses only mcpServers, ignoring .mcp.json/user/plugins.
  • claude mcp add … --scope <local|project|user> --transport <http|stdio|sse> installs a server and picks its scope; --scope defaults to Local.
  • Three scopes, three audiences — Local (~/.claude.json, per-project, private), Project (.mcp.json, committed/shared, approval-prompted), User (~/.claude.json, all projects). Precedence: Local → Project → User → plugin → claude.ai, matched by name.
  • “Local scope” ≠ “local settings” — a local-scoped MCP server lives in ~/.claude.json (home), not .claude/settings.local.json (project).
  • Env-var expansion — ${VAR} / ${VAR:-default} in command/args/env/url/headers; ${CLAUDE_PROJECT_DIR} needs the :- form in hand-written configs.
  • Verify the connection — read the system:init status (connected/failed/needs-auth/pending/disabled) before running; the default init timeout is 60s.
  • Config transports — stdio / sse (deprecated) / http (streamable-http alias), plus ws (WebSocket, .mcp.json-only); the SDK in-process server is a separate deployment mode, not a type. The 2025-11-25 initialize handshake is a dated snapshot the 2026-07-28 RC removes.
Part 2 Chapter 5 Last verified 2026-06-08 Fresh

Built-in Tools: The Roster, Execution Order, and Permission Gating

The fixed roster of built-in tools every agent ships with — Read, Write, Edit, Bash, Grep, Glob — their exact, case-sensitive names, the read-only-versus-state-modifying line that decides which run in parallel, the six permission modes, the five-step evaluation order, and the allow/deny rules that gate them. Closes on the allowlist-is-not-a-sandbox trap.

Volatility: feature-surface
Tools compared: claude-code
Before you start: Chapter D2.3 (allowedTools distributes a tool surface; tool_choice steers selection) and Chapter D1.3 (a subagent runs its own allowedTools). No prior Agent SDK operations assumed.
You will learn
  • Recognize the six core built-in tools and that their names are matched exactly
  • Distinguish which tools run concurrently from which run sequentially — and the one property that decides
  • Apply the six permission modes and the five-step evaluation order to gate those tools
  • Recognize the allowlist-is-not-a-sandbox trap: allowed_tools does not constrain bypassPermissions

D2.1 through D2.4 designed tools, their failure contracts, their distribution, and the wiring of external MCP servers. This chapter steps back to the tools an agent already has on the first turn — the fixed built-in roster — and the permission machinery that decides whether any given one actually fires. The exam angle is recognition: the exact roster, the read-versus-write execution split, the six modes, the evaluation order, and one high-value trap where a developer thinks they have locked an agent down and have not.

[Note]

This is a feature-surface chapter: the tool names, the mode list, and the rule syntax are concrete surfaces that move between releases. Treat every name and flag as a current snapshot and re-verify before relying on it; the principles (read/write execution, evaluation order) are the stable part.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

  1. Name the six core built-in tools. Does allowed_tools=["read"] pre-approve the file reader?
  2. Which built-in tools may run concurrently, and what single property decides it?
  3. Name the six permission modes; which one restricts the agent to read-only tools?
  4. In what order does the SDK evaluate hooks, deny rules, the permission mode, allow rules, and canUseTool?
  5. You set allowed_tools=["Read"] and permission_mode="bypassPermissions". What can the agent actually do?
Check your answers
  1. Read, Write, Edit, Grep, Glob, Bash — names are matched exactly, so allowed_tools=["read"] pre-approves nothing (the tool is Read).
  2. Read-only tools (Read, Glob, Grep) may run concurrently; the deciding property is whether a tool is read-only or state-modifying — Edit, Write, and Bash run sequentially.
  3. default, acceptEdits, plan, dontAsk, bypassPermissions, auto (TypeScript-only) — plan restricts the agent to read-only tools.
  4. Hooks → Deny rules → Permission mode → Allow rules → canUseTool — deny rules and hooks sit above the mode, so they bind even under bypassPermissions.
  5. Every tool, including Bash, Write, and Edit — allowed_tools only pre-approves and never restricts; the allowlist is not a sandbox.

The built-in tool roster

Every agent starts with a fixed roster of built-in tools — the SDK ships roughly fourteen of them, in six categories, identical to those that power Claude Code. [Official] Agent SDK overview · AnthropicT1-official original How the agent loop works · AnthropicT1-official original The six that do the everyday work of reading and changing a codebase are Read, Write, Edit (file operations), Grep, Glob (search), and Bash (execution). The parity with Claude Code is explicit: “The SDK includes the same tools that power Claude Code,” [Official] How the agent loop works · AnthropicT1-official original and “Everything that makes Claude Code powerful is available in the SDK.” [Official] Agent SDK overview · AnthropicT1-official original Beyond this built-in set sit MCP server tools (Chapter D2.4) and your own custom tools; this chapter is about the built-ins every agent has from the start.

These names are exact — they appear verbatim in allowed_tools / allowedTools rules and as the tool_use.name block in messages, so Read is the tool and read is not. [Official] How the agent loop works · AnthropicT1-official original

Concept ·
  • Read — read any file in the working directory (text, with optional offset/limit).
  • Write — create a new file.
  • Edit — make a precise, in-place edit to an existing file.
  • Grep — search file contents with a regular expression.
  • Glob — find files by pattern (**/*.ts, src/**/*.py).
  • Bash — run terminal commands, scripts, and git operations.
Tool names are matched exactly

A rule’s tool name is compared literally against the tool’s registered name. allowed_tools=["read"] or ["BASH"] pre-approves nothing — the tools are Read and Bash. A mis-cased name fails silently: no error, just a tool that keeps falling through to a permission prompt as if it were never listed.

Read-only and state-modifying tools run differently

The roster splits along a line that the runtime cares about: whether a tool reads state or changes it. Read-only tools — Read, Glob, Grep, and MCP tools marked read-only — can run concurrently; tools that modify state — Edit, Write, and Bash — run sequentially to avoid conflicts. Custom tools default to sequential execution and opt into parallelism by setting readOnlyHint in their annotations. [Official] How the agent loop works · AnthropicT1-official original

Key idea

Parallelism here is a property of the tool’s nature, not a switch you flip per request. That makes it orthogonal to disable_parallel_tool_use from Chapter D2.3, which is a request-level cap on how many tools a single turn may invoke. One is about what a tool is (read vs write); the other is about how many calls you allow in a turn. They compose.

Gating the tools: six permission modes

Having a tool in the roster does not mean it fires. When the model requests a tool, the active permission mode is consulted, and there are six: default, acceptEdits, plan, dontAsk, bypassPermissions, and auto (TypeScript-only). [Official] Configure permissions · AnthropicT1-official original Two are worth memorizing for the exam because they change the tool surface directly: plan restricts the agent to read-only tools, so it explores and proposes a plan without editing source files; acceptEdits auto-approves file edits and filesystem commands (mkdir, touch, rm, rmdir, mv, cp, sed) — but only inside cwd plus additionalDirectories, and paths outside that scope, or protected paths, still prompt. [Official] Configure permissions · AnthropicT1-official original

Concept ·
ModeWhat it does to the tool surface
defaultNo auto-approvals; unmatched tools hit your canUseTool callback (no callback ⇒ deny).
acceptEditsAuto-approves edits + filesystem commands (mkdir/touch/rm/rmdir/mv/cp/sed) inside cwd/additionalDirectories; other Bash follows default rules.
planRead-only tools only; no edits to source files.
dontAskAnything not pre-approved by rules is denied; canUseTool is never called.
bypassPermissionsAll tools run without prompts — but deny rules, explicit ask rules, and hooks still apply. Cannot run as root on Unix.
auto (TS only)A model classifier approves or denies each call.

Allow and deny rules — and the five-step order

Within a mode, allow and deny rules pre-approve or block specific tools and calls. A bare name and a scoped pattern behave differently: allowed_tools=["Read", "Grep"] auto-approves those tools; disallowed_tools=["Bash"] removes Bash from the request entirely, so the model never sees it; and disallowed_tools=["Bash(rm *)"] keeps Bash available but denies any rm * call — in every mode, including bypassPermissions. [Official] Configure permissions · AnthropicT1-official original All of this resolves through a fixed sequence: “When Claude requests a tool, the SDK checks permissions in this order: 1. Hooks. 2. Deny rules. 3. Permission mode. 4. Allow rules. 5. canUseTool callback.” [Official] Configure permissions · AnthropicT1-official original

Why a deny rule beats bypassPermissions Worked example

An agent runs under permission_mode="bypassPermissions" (chosen to suppress prompts in a headless run) with a guardrail: disallowed_tools=["Bash(rm -rf *)"]. The model requests Bash(rm -rf /data). Walk the five-step order:

  1. Hooks — any PreToolUse hook runs first; suppose none match.
  2. Deny rules — Bash(rm -rf *) matches → denied, here, before the mode is ever consulted.
  3. (Permission mode — never reached for this call.)
  4. (Allow rules — never reached.)
  5. (canUseTool — never reached.)

The rm -rf is blocked even though the mode is bypassPermissions, because deny rules (step 2) and hooks (step 1) sit above the mode (step 3). Flip it around to see the trap: allowed_tools=["Read"] lives at step 4, below the mode — so under bypassPermissions the mode approves everything at step 3 and the allowlist is never consulted. Order is everything: forbid high (hooks/deny), permit low (allow rules).

The high-value trap lives in the gap between pre-approving and restricting. allowed_tools only pre-approves the tools you list; it does not filter everything else out. Set allowed_tools=["Read"] alongside permission_mode="bypassPermissions" and the agent “still approves every tool, including Bash, Write, and Edit.” [Official] Configure permissions · AnthropicT1-official original The allowlist was never a sandbox.

allowed_tools is an allowlist, not a sandbox

To expand what runs without prompting, list tools in allowed_tools. To restrict what an agent can do, that list is the wrong instrument — under bypassPermissions it is ignored entirely. Read-only confinement comes from plan mode (read-only tools only) or from a deny rule (disallowed_tools), which blocks even under bypassPermissions. Reach for the allowlist to permit; reach for the mode or a deny rule to forbid.

The day-to-day use of these tools — the muscle memory of Read, Edit, and Bash inside a working session — is the handbook’s territory (the Use book’s chapter on Claude Code’s toolset); this chapter is the architect’s exam angle on the roster and the permission surface that gates it.

Practice

Exercise

A developer wants a “read-only” agent for an automated audit. They configure it with allowed_tools=["Read"], reasoning that listing only Read confines the agent to reading. To suppress every approval prompt in the headless run, they also set permission_mode="bypassPermissions". Which tools can the agent actually use, and what is the one-line fix that would genuinely confine it?

  • A. Only Read; bypassPermissions still honors the allowlist, so the agent stays read-only as intended.
  • B. Read plus the filesystem commands (mkdir, rm, mv, cp) that bypassPermissions auto-approves.
  • C. Every tool, including Bash, Write, and Edit; allowed_tools only pre-approves and never restricts.
  • D. No tools; allowed_tools=["Read"] and bypassPermissions conflict, so the query raises an error.
Practice ◆◆◇◇

An agent is asked to read three configuration files and then apply one edit. Which of those four operations may the SDK run concurrently, which must run on its own, and what single property of a tool decides the answer?

Practice ◆◆◆◇

You need an agent that can explore a repository and propose changes but must not modify any source file — and you do not want to hand-maintain an allow/deny list to get there. (a) Which single permission mode gives you this? (b) Which mode would instead auto-approve edits inside the working directory while still prompting for paths outside it?

Exercise solutions

Solution ↑ Exercise

C. allowed_tools pre-approves the tools you list; it never restricts the ones you omit. Paired with bypassPermissions, the configuration “still approves every tool, including Bash, Write, and Edit” — the allowlist is silently irrelevant (it sits at step 4 of the evaluation order, below the mode at step 3). A is the core misconception (treating the allowlist as a filter). B confuses this with acceptEdits, which is the mode that auto-approves filesystem commands — bypassPermissions approves everything, not just filesystem ops. D invents a conflict; the two settings combine without error, which is exactly why the trap is dangerous. The fix: drop bypassPermissions and use permission_mode="plan" (read-only tools only), or keep a stricter mode and add a deny rule such as disallowed_tools=["Write", "Edit", "Bash"] — deny rules block even under bypassPermissions. Reach for the allowlist to permit, and for the mode or a deny rule to forbid.

Solution ↑ Exercise

The three Read calls may run concurrently; the Edit must run on its own (sequentially). The deciding property is whether a tool is read-only or state-modifying: read-only tools (Read, Glob, Grep) can run in parallel because they cannot conflict, while state-modifying tools (Edit, Write, Bash) run sequentially to avoid clobbering each other. So the runtime can fan out the three reads at once, then run the single edit after.

Solution ↑ Exercise

(a) plan mode — it restricts the agent to read-only tools, so it can explore and propose changes but cannot edit any source file, with no allow/deny list to maintain. (b) acceptEdits — it auto-approves file edits and filesystem commands (mkdir/rmdir/mv/…) inside cwd plus additionalDirectories, while paths outside that scope (and protected paths) still prompt. plan forbids edits entirely; acceptEdits permits them but only within the working scope.

Exam essentials

  • The roster is fixed and the names are exact — Read, Write, Edit, Grep, Glob, Bash are the six core built-ins (of ~14), identical to Claude Code’s; they appear verbatim in allow/deny rules and as tool_use.name, and a mis-cased name matches nothing.
  • Read vs write decides parallelism — read-only tools (Read/Glob/Grep) run concurrently; state-modifying tools (Edit/Write/Bash) run sequentially; custom tools default to sequential and opt in via readOnlyHint. Orthogonal to D2.3’s disable_parallel_tool_use.
  • Six permission modes — default, acceptEdits, plan, dontAsk, bypassPermissions, auto (TS). plan = read-only; acceptEdits = auto-approve edits + filesystem ops (mkdir/touch/rm/rmdir/mv/cp/sed) inside cwd/additionalDirectories, prompt outside.
  • Five-step evaluation order — Hooks → Deny rules → Permission mode → Allow rules → canUseTool. Deny rules and hooks fire before the mode, so they bind even under bypassPermissions; allow rules fire after it, so the mode can override them.
  • The allowlist trap — allowed_tools pre-approves, it does not restrict; with bypassPermissions it approves everything regardless. Confine with plan mode or a deny rule, never with the allowlist alone.

Part 2 · D2 Review

5 exercises across 5 chapters — interleaved review.

d2-01-tool-interfaces

  1. d2-01-ex-consolidate-vs-split A teammate ships three tools — `get_customer_by_id`, `list_customer_transactions`, and `list_customer_notes` — and reports that the agent often calls only the first and then stalls. You may redesign the interface. Propose a single consolidated tool (give its name, a one-sentence description, and the boundary it draws), and name the two interface principles your redesign applies.

d2-02-structured-errors

  1. d2-02-ex-channel-and-content Your booking tool (exposed over MCP) receives a departure date in the past. A colleague wants to return a JSON-RPC `-32602 Invalid params` error. (a) Which channel is correct, and why? (b) Write the one-sentence error content the tool should return. (c) Would `strict: true` have prevented this particular failure? Explain. (d) If the same tool were called directly over the Claude Messages API instead of MCP, what changes about the *spelling* of the failure flag?

d2-03-tool-choice-distribution

  1. d2-03-ex-mode-selection You are building an agent that (a) must produce a JSON classification by calling your `record_decision` tool on every turn with valid inputs, and (b) you also want extended thinking enabled so it reasons first. A colleague sets `tool_choice: {"type": "tool", "name": "record_decision"}`. What goes wrong? What configuration gets you the *schema-valid guaranteed* tool call, and what must you give up to also keep thinking?

d2-04-mcp-configuration

  1. d2-04-ex-scope-and-secret Your team needs a `postgres` MCP server available to everyone who clones the repo, but the connection string must not be committed. (a) Which scope and file do you use, and what `claude mcp add` flag sets that scope? (b) Show the `env` field using the documented expansion syntax. (c) A teammate later adds the *same* server name in their personal `~/.claude.json`. Which definition wins on their machine, and why?

d2-05-builtin-tools

  1. d2-05-ex-allowlist-vs-bypass A developer wants a "read-only" agent for an automated audit. They configure it with `allowed_tools=["Read"]`, reasoning that listing only `Read` confines the agent to reading. To suppress every approval prompt in the headless run, they also set `permission_mode="bypassPermissions"`. Which tools can the agent actually use, and what is the one-line fix that would genuinely confine it? - **A.** Only `Read`; `bypassPermissions` still honors the allowlist, so the agent stays read-only as intended. - **B.** `Read` plus the filesystem commands (`mkdir`, `rm`, `mv`, `cp`) that `bypassPermissions` auto-approves. - **C.** Every tool, including `Bash`, `Write`, and `Edit`; `allowed_tools` only pre-approves and never restricts. - **D.** No tools; `allowed_tools=["Read"]` and `bypassPermissions` conflict, so the query raises an error.
Part 3 Chapter 1 Last verified 2026-06-02 Fresh

CLAUDE.md Hierarchy & @import: Four Scopes That Concatenate

How Claude Code assembles persistent instructions from four CLAUDE.md scopes that concatenate without precedence — the opposite of the strict five-level settings hierarchy (Managed > CLI > Local > Project > User) — plus the @import mechanism (depth-5, first-use approval), the AGENTS.md bridge, and the managed claudeMd / claudeMdExcludes controls.

Volatility: architectural-pattern
Tools compared: claude-code
Before you start: Chapter D2.4 (MCP scopes and their precedence; the local-scope-versus-local-settings trap). You have written at least one CLAUDE.md and know each session starts with a fresh context window.
You will learn
  • Distinguish the four CLAUDE.md scopes and the order they load in
  • Explain why CLAUDE.md files concatenate rather than override — and how that inverts the five-level settings precedence
  • Apply @import syntax, including its recursion limit and first-use approval behavior
  • Recognize the AGENTS.md bridge and the managed claudeMd / claudeMdExcludes enforcement controls

D2.4 resolved MCP servers and settings across a strict precedence where the highest scope wins. The instruction layer looks like the same machinery — files at managed, user, project, and local scopes — but it behaves in the opposite way: the files do not compete, they concatenate. This chapter is the exam angle on that distinction and on the @import mechanism that stitches files together. The design rationale for treating the file as a context budget lives in the Further reading.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

  1. Name the four CLAUDE.md scopes and the order they load in.
  2. Two CLAUDE.md files in the chain say contradictory things. Which one “wins”?
  3. Name the five levels of the settings precedence, highest to lowest.
  4. What is the @import recursion depth, and what happens permanently if you decline the first-use prompt?
  5. Claude Code “ignores” a teammate’s AGENTS.md. Why, and what one line fixes it?
Check your answers
  1. Broadest to most specific: Managed policy → User → Project → Local (~/.claude/CLAUDE.md, ./CLAUDE.md or ./.claude/CLAUDE.md, ./CLAUDE.local.md).
  2. Neither — CLAUDE.md files concatenate rather than override; both contradictory instructions sit in context at once, a smell to fix at the source.
  3. Managed → CLI arguments → Local → Project → User — the highest scope wins (exception: permission allow/ask/deny rules merge across scopes).
  4. Maximum recursion depth is 5; declining the first-use approval dialog disables imports permanently — the dialog does not reappear.
  5. Claude Code reads CLAUDE.md, not AGENTS.md; the fix is a one-line bridge — a CLAUDE.md that imports @AGENTS.md.

Four scopes that load broadest-first

Claude Code assembles its persistent instructions from up to four CLAUDE.md scopes, loaded broadest to most specific: Managed policy, then User (~/.claude/CLAUDE.md), then Project (./CLAUDE.md or ./.claude/CLAUDE.md), then Local (./CLAUDE.local.md, which you add to .gitignore). [Official] How Claude remembers your project · AnthropicT1-official original Discovery walks up the directory tree from your working directory; a CLAUDE.md nested below cwd is not loaded at launch but on demand, when Claude first reads a file in that subdirectory. [Official] How Claude remembers your project · AnthropicT1-official original The managed-policy file is the one scope that cannot be excluded by any individual setting — which is exactly what makes it the instrument for org-enforced instructions. [Official] How Claude remembers your project · AnthropicT1-official original

Concept ·
  • Managed policy — a platform path (e.g. /etc/claude-code/CLAUDE.md) or the claudeMd key in managed-settings.json. Org-enforced; cannot be excluded.
  • User — ~/.claude/CLAUDE.md. Your personal defaults, across every project.
  • Project — ./CLAUDE.md or ./.claude/CLAUDE.md. Committed; shared with the team.
  • Local — ./CLAUDE.local.md. Gitignored; your overrides for this one project.
[Tip]

Exam discriminator: of the four scopes, only the managed-policy CLAUDE.md is immune to claudeMdExcludes. If a question asks which instructions a developer cannot opt out of, that is the one.

Concatenation, not precedence

Here is the property that separates the instruction layer from every settings file: the discovered CLAUDE.md files do not override one another. “All discovered files are concatenated into context rather than overriding each other. Across the directory tree, content is ordered from the filesystem root down to your working directory.” [Official] How Claude remembers your project · AnthropicT1-official original Within a single directory, CLAUDE.local.md is appended after CLAUDE.md. [Official] How Claude remembers your project · AnthropicT1-official original

Contrast settings, which resolve by a strict five-level precedence where the highest scope wins. Named in full, highest to lowest: [Official] Claude Code Settings · AnthropicT1-official original

Concept ·
  1. Managed (highest — cannot be overridden by anything)
  2. CLI arguments (--model, --permission-mode, … — session-only)
  3. Local (.claude/settings.local.json — gitignored)
  4. Project (.claude/settings.json — committed)
  5. User (~/.claude/settings.json — lowest)

When the same setting appears in several scopes, the highest one wins. (Exception: permission allow/ask/deny rules merge across scopes rather than override — D2.5’s evaluation order.)

So CLAUDE.md and settings sit at opposite ends: one accumulates, the other overrides — and the override ladder is five rungs, not three, with CLI and Local the two most often forgotten.

Key idea

There is no “winning” CLAUDE.md. Two files that say the same thing don’t conflict — both are in context at once. Settings override; instructions accumulate. A developer who reasons about CLAUDE.md the way they reason about settings.json will predict the wrong behavior every time.

Same conflict, two opposite resolutions Worked example

A user file and a project file each set the same thing. Watch the two layers diverge:

Instructions (CLAUDE.md) — concatenate. ~/.claude/CLAUDE.md says “prefer tabs”; ./CLAUDE.md says “prefer 2-space indent.” Result: both load into context at once (root-down order), and there is no rule about which wins — Claude sees two contradictory instructions. The contradiction is a smell to fix at the source, not something proximity resolves.

Settings (settings.json) — override. ~/.claude/settings.json sets "model": "opus"; .claude/settings.json sets "model": "sonnet". Result: Project wins over User (level 4 beats level 5), so the effective model is sonnet. Add --model haiku on the CLI and that wins (level 2), overriding both files.

Same shape of conflict, opposite outcomes: the instruction files merge and coexist; the settings resolve to exactly one value down the five-level ladder (Managed → CLI → Local → Project → User). Predict CLAUDE.md with the settings model and you will be wrong every time.

Expecting a deeper CLAUDE.md to override a shallower one

A subdirectory’s CLAUDE.md does not replace the project root’s — both load, simultaneously, into the same context. If the root says “use tabs” and a subdirectory says “use spaces,” Claude sees both instructions and no rule about which wins. To actually suppress an ancestor file you need claudeMdExcludes (below), not a closer file.

@import: stitching files together

A CLAUDE.md can pull in other files with @path/to/import. The imported files expand and load at launch alongside the referencing file; relative paths resolve relative to the file containing the import; and the import chain has a maximum recursion depth of 5. [Official] How Claude remembers your project · AnthropicT1-official original The first time a session encounters an import, Claude shows an approval dialog — and declining it disables imports permanently (the dialog does not reappear). [Official] How Claude remembers your project · AnthropicT1-official original

[Note]

What happens when an import chain exceeds depth 5 — silent truncation or an error — is not documented. Don’t design a CLAUDE.md layout that depends on a specific overflow behavior.

AGENTS.md, managed policy, and the budget you don’t develop here

Three controls round out the layer. First, the cross-tool bridge: Claude Code reads CLAUDE.md, not AGENTS.md — to share one instruction set with other agents, create a CLAUDE.md that imports @AGENTS.md. [Official] How Claude remembers your project · AnthropicT1-official original Second, managed settings can deploy instructions with no file at all: the claudeMd key carries inline CLAUDE.md content, honored only in managed/policy settings, and it loads before the user and project files. [Official] How Claude remembers your project · AnthropicT1-official original Claude Code Settings · AnthropicT1-official original Third, claudeMdExcludes — glob patterns matched against absolute paths, merged across layers — skips ancestor CLAUDE.md files, with the single exception that the managed-policy file can never be excluded. [Official] How Claude remembers your project · AnthropicT1-official original

Concept ·

Claude Code has exactly one instruction loader: the CLAUDE.md hierarchy. AGENTS.md does not load on its own — it participates only when a CLAUDE.md imports it. The bridge is a single line (@AGENTS.md), which is why the same file can serve Claude Code and other tools at once without duplicating its contents.

Practice

Exercise

A monorepo has /repo/CLAUDE.md (“use tabs”) and /repo/services/api/CLAUDE.md (“use 2-space indent”). A developer runs Claude Code from /repo/services/api/, and both files are discovered. Which statement describes what Claude Code actually loads, and why?

  • A. Only the services/api/CLAUDE.md loads — the closest file overrides its ancestors.
  • B. Only the root /repo/CLAUDE.md loads — the project-root file takes precedence.
  • C. Both load and concatenate (root first, then api); with no precedence, both instructions sit in context at once.
  • D. Both load, but api’s lines override the root’s, the way settings.local.json overrides user settings.
Practice ◆◆◇◇

A developer sets "model": "sonnet" in ~/.claude/settings.json and "model": "opus" in the project’s .claude/settings.json, then launches with --model haiku. Which model actually runs, and what is the full five-level precedence (highest to lowest) that decides it? Why would the same two-scope setup behave differently if it were two CLAUDE.md files instead?

Practice ◆◆◇◇

You add @./standards/api.md to a CLAUDE.md; api.md itself imports @../shared/base.md, which imports further still. (a) What is the documented maximum recursion depth for an @import chain? (b) The first time this import runs, what does Claude Code show you, and what is the lasting consequence of declining it?

Exercise solutions

Solution ↑ Exercise

C. CLAUDE.md files do not compete: “All discovered files are concatenated into context rather than overriding each other,” ordered “from the filesystem root down to your working directory.” So both “use tabs” and “use 2-space indent” are in context simultaneously — which is itself a smell, because contradictory ancestor instructions are not resolved by proximity. A and B both assume a precedence the instruction layer does not have. D is the high-value trap: it imports the settings model (where settings.local.json overrides user settings) onto memory, where files merge instead. To actually suppress the root file you would use claudeMdExcludes, not a closer CLAUDE.md.

Solution ↑ Exercise

haiku runs. The five-level settings precedence, highest to lowest, is Managed → CLI arguments → Local → Project → User; the --model haiku CLI argument (level 2) beats both the project opus (level 4) and the user sonnet (level 5). The same two-scope setup behaves differently for CLAUDE.md because the instruction layer concatenates instead of overriding — two CLAUDE.md files setting contradictory guidance would both load into context at once, with no “winner,” whereas settings resolve to exactly one value down the ladder.

Solution ↑ Exercise

(a) The maximum @import recursion depth is 5. (b) On first encountering an import, Claude Code shows an approval dialog; declining it disables imports permanently — the dialog does not reappear, so a future import in that environment silently will not expand until the choice is reset. Design import chains shallow (≤5) and approve deliberately.

Exam essentials

  • Four scopes, broadest-first — Managed → User → Project → Local; discovery walks up from cwd, nested files load on demand, and the managed file cannot be excluded.
  • Concatenate, don’t override — there is no precedence between CLAUDE.md files (root → cwd order; CLAUDE.local.md appended after CLAUDE.md). This is the opposite of the strict five-level settings precedence: Managed → CLI → Local → Project → User (CLI and Local are the forgotten two). Conflating the two is the single most common instruction-layer error.
  • @import — @path expands at launch, resolves relative to the importing file, caps at recursion depth 5, and prompts an approval dialog on first use (declining disables imports permanently).
  • AGENTS.md — Claude Code reads CLAUDE.md only; it picks up AGENTS.md solely through a @AGENTS.md import.
  • Managed controls — claudeMd deploys inline policy content (loads before user/project); claudeMdExcludes globs away ancestor files by absolute path — but never the managed-policy CLAUDE.md.

Further reading

The discipline behind what belongs in these files — treating the always-loaded CLAUDE.md as a permanent slice of the context budget rather than documentation, and the controlled study measuring the cost of overstuffing it — is developed in the Agentic Systems Design book, Chapter 4, The Instruction Layer: CLAUDE.md & AGENTS.md. Optional depth; this chapter stands on its own.

[Note]

Cross-book link is provisional — it points at the chapter source until the Agentic Systems Design book is deployed, then repoints to its published URL.

Part 3 Chapter 2 Last verified 2026-06-08 Fresh

Slash Commands & Skills: Stored Prompts, Lazy-Loaded Capabilities

Two ways to extend the workflow — a slash command (a stored prompt recognized at message start) and a skill (a lazy-loaded, auto-invocable, directory-bundled capability). The merged model, the full lazy-load lifecycle (description budget, $ARGUMENTS substitution, compaction carry-forward, live change detection), the SKILL.md frontmatter, the four scopes, and what disable-model-invocation does to the description.

Volatility: feature-surface
Tools compared: claude-code
Before you start: Chapter D3.1 (the always-on CLAUDE.md instruction layer) and D2.5 (tool names are matched exactly). No prior skill-authoring assumed.
You will learn
  • Distinguish a slash command from a skill — and recognize that commands have merged into skills
  • Trace the lazy-load lifecycle — description budget at startup, $ARGUMENTS at invocation, persistence and compaction carry-forward
  • Recognize the SKILL.md frontmatter fields and the four skill scopes
  • Apply the user-invocable / disable-model-invocation switches — and know that the latter keeps the description out of context

D3.1’s CLAUDE.md is always-on context. This chapter is the other half of the instruction layer: capabilities the agent loads only when it needs them. A slash command is a stored prompt you trigger by typing /name; a skill is a richer, directory-bundled capability that Claude can also reach for on its own. The exam-relevant facts are that the two have converged, and that a skill’s lazy-loading — and its budget — are what keep it cheap.

[Note]

This is a feature-surface chapter: the SKILL.md frontmatter fields, the scope paths, and the budget numbers are concrete surfaces that move between releases. Treat each field name and number as a current snapshot and re-verify before relying on it.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

  1. Where does a slash command have to appear in your message, and what becomes of the text after it?
  2. True or false: a skill’s description is always loaded at session start.
  3. How do you pass and reference arguments inside a skill body?
  4. Roughly how much context does a skill cost at startup vs once invoked, and how long does the body persist?
  5. Name the four skill scopes in precedence order.
Check your answers
  1. At the start of the message — text that follows the command name is passed to it as arguments.
  2. False — a disable-model-invocation: true skill carries no description in context at all, and even ordinary descriptions can be dropped under budget pressure (least-invoked first).
  3. $ARGUMENTS expands to all arguments passed, and $ARGUMENTS[N] / $N pick a specific 0-indexed argument; if the body never references them, args are appended as ARGUMENTS: <value>.
  4. A ~100-token description at startup (against a budget defaulting to 1% of the context window); the full body loads on invocation and stays for the rest of the session, carried through compaction within a budget (≤5,000 tokens each, 25,000 combined).
  5. Enterprise > personal > project, plus plugin — namespaced as plugin-name:skill-name, so it never conflicts.

Commands and skills have merged

A slash command controls Claude Code from inside a session, and “a command is only recognized at the start of your message. Text that follows the command name is passed to it as arguments.” [Official] Commands · AnthropicT1-official original Alongside the many built-in commands, some entries are bundled skills: “they use the same mechanism as skills you write yourself: a prompt handed to Claude, which Claude can also invoke automatically when relevant. Everything else is a built-in command whose behavior is coded into the CLI.” [Official] Commands · AnthropicT1-official original

That same mechanism is why the two authoring formats have converged: “Custom commands have been merged into skills. A file at .claude/commands/deploy.md and a skill at .claude/skills/deploy/SKILL.md both create /deploy and work the same way. Your existing .claude/commands/ files keep working.” [Official] Extend Claude with skills · AnthropicT1-official original Old flat-file commands still run; skills are the recommended form for new work because they add directory bundling, frontmatter, and auto-invocation.

Key idea

A slash command and a skill are no longer two systems. A command is the legacy flat-file shape; a skill is the same idea with a directory, frontmatter, and the ability for Claude to invoke it unprompted. Author new work as a skill; reach for .claude/commands/ only to keep an existing one alive.

Skills are lazy-loaded directories

A skill is “a markdown directory: .claude/skills/<name>/SKILL.md,” with optional supporting files (reference.md, scripts/, and the like) alongside. [Official] Extend Claude with skills · AnthropicT1-official original The reason a skill is cheap is its loading model — unlike CLAUDE.md, which loads every session: “skills load on demand. The agent receives skill descriptions at startup and loads the full content when relevant.” [Official] Extend Claude with skills · AnthropicT1-official original Each description is roughly 100 tokens; the full body materializes only when the skill is invoked, “enters the conversation as a single message and stays there for the rest of the session,” and is not re-read on later turns. [Official] Extend Claude with skills · AnthropicT1-official original

Key idea

The description is the skill’s retrieval interface — for a model-invocable skill it is the part loaded at startup, and it is what Claude matches against to decide whether to auto-invoke. But “loaded at startup” is not unconditional: a disable-model-invocation: true skill carries no description in context at all (it loads only when the user types /name), and even ordinary descriptions can be dropped under budget pressure. A vague description is an undiscoverable skill — but so is one Claude was never given.

The lazy-load lifecycle in full

The “cheap” story has a budget and a lifecycle the exam can probe at each stage:

Concept ·
  • Startup (budget). Skill descriptions load into a context budget that defaults to 1% of the model’s context window (skillListingBudgetFraction; per-description cap maxSkillDescriptionChars, default 1,536). On overflow, the least-invoked skills’ descriptions are dropped first — so a rarely-used skill can become invisible. /doctor reports whether the budget is overflowing. [Official] Extend Claude with skills · AnthropicT1-official original
  • Invocation (arguments). Inside the body, $ARGUMENTS expands to all arguments passed (if you never reference it, the args are appended as ARGUMENTS: <value>), and $ARGUMENTS[N] / $N pick a specific 0-indexed argument ($0 first, $1 second), with shell-style quoting. [Official] Extend Claude with skills · AnthropicT1-official original
  • Persistence (compaction). Once loaded the body stays for the session, and compaction carries it forward within a budget — the first 5,000 tokens of each most-recently-invoked skill, up to a combined 25,000 tokens post-compaction. [Official] Extend Claude with skills · AnthropicT1-official original

One operational nicety: live change detection — adding, editing, or removing a skill under ~/.claude/skills/, the project’s .claude/skills/, or an --add-dir directory takes effect within the current session, no restart — the one exception being a brand-new top-level skills/ directory that did not exist at launch. [Official] Extend Claude with skills · AnthropicT1-official original

A deploy skill through its lifecycle Worked example
# .claude/skills/deploy/SKILL.md
---
name: deploy
description: Deploy the app to an environment. Use when asked to deploy, ship, or release a build.
argument-hint: "[environment]"
---
Deploy to $0 by running the runbook in scripts/deploy.sh, then verify the health check…

Trace it: at startup, only the ~100-token description loads (counted against the 1%-of-context budget) — the runbook body does not. On /deploy staging, $0 (and $ARGUMENTS) expand to staging, and the full body enters the conversation as one message. It then persists for the session; if a compaction fires, the body is carried forward (up to 5,000 tokens of it, within the 25,000-token combined skill budget). Flip disable-model-invocation: true and the calculus changes: the description is no longer in context at startup at all, so Claude can’t auto-invoke on “ship it” — only an explicit /deploy loads it.

SKILL.md frontmatter and the four scopes

The SKILL.md frontmatter is where a skill declares its behavior. Among the fields: name (display name in skill listings, defaults to the directory name — the directory name, not this field, sets the /command you type, except for a plugin-root SKILL.md), description (drives auto-invocation; description + when_to_use capped at 1,536 chars by default), argument-hint, disable-model-invocation, user-invocable, allowed-tools (CLI-only), model, effort, context, and paths (glob patterns that limit when the skill activates). [Official] Extend Claude with skills · AnthropicT1-official original

Skills resolve across four scopes by precedence — “enterprise > personal > project; plugin skills use a plugin-name:skill-name namespace and never conflict”: Enterprise (managed settings, all users), Personal (~/.claude/skills/, all your projects), Project (.claude/skills/, auto-discovered from cwd up to repo root), and Plugin (namespaced). [Official] Extend Claude with skills · AnthropicT1-official original

Concept ·
  • Enterprise — managed settings; applies to all users in the org (highest precedence).
  • Personal — ~/.claude/skills/<name>/SKILL.md; all your projects.
  • Project — .claude/skills/<name>/SKILL.md; this repo, discovered from cwd up to the root.
  • Plugin — <plugin>/skills/<name>/SKILL.md; invoked as /plugin-name:skill-name, namespaced so it never collides.
The description is not optional decoration

name defaults to the directory, so it is easy to skip description — but description is what Claude reads at startup to decide whether to auto-invoke. Omit it and the skill is still user-invocable by /name, yet effectively invisible to the model. If you want Claude to reach for a skill on its own, the description carries that entire decision.

Who can invoke it

Two frontmatter switches decide who may call a skill. user-invocable: false hides it from the / menu so only Claude can invoke it; disable-model-invocation: true does the inverse — only the user can trigger it via /, Claude cannot auto-invoke, its description is kept out of context, and it also blocks subagent preloading. [Official] Extend Claude with skills · AnthropicT1-official original Between them you can build a skill that is purely automatic, purely manual, or both.

Expecting Claude to auto-invoke a disable-model-invocation skill

disable-model-invocation: true is not just “hide from the menu” — it removes the skill’s description from startup context entirely, so the model never sees the skill exists. If you wanted Claude to reach for it automatically, this is the wrong switch; it makes the skill user-only. Use it precisely when a skill must never fire without a human typing /name.

Practice

Exercise

Your team has a 2,000-token deployment runbook you want Claude to follow whenever it deploys — ideally without anyone having to remember to paste it. Where should it live, and why?

  • A. In CLAUDE.md, so the runbook is always available to every session.
  • B. As a skill at .claude/skills/deploy/SKILL.md, so only its ~100-token description costs context until it is invoked.
  • C. As a slash command at .claude/commands/deploy.md, since that is the only way to get a /deploy command.
  • D. Pasted inline into each prompt at deploy time, so it is always fresh.
Practice ◆◆◆◇

A teammate marks a risky /force-release skill disable-model-invocation: true so a human must trigger it, then is surprised Claude “doesn’t even seem to know the skill exists” when asked to “just release it.” Explain what that flag did to the skill’s description, and why that is the intended behavior here.

Practice ◆◆◇◇

A skill’s full SKILL.md body is about 3,000 tokens. (a) Roughly how much of it enters context at session start, and against what budget? (b) When does the rest load, how long does it stay, and what happens to it if a compaction fires?

Exercise solutions

Solution ↑ Exercise

B. A skill lazy-loads: only its ~100-token description sits in context at startup, and the full 2,000-token body loads on invocation — and Claude can auto-invoke it when a deploy is relevant. A is the D3.1 budget mistake: a CLAUDE.md is loaded every session, so the whole 2,000 tokens would be spent on every unrelated turn. C is wrong on the “only way” claim — commands have merged into skills, so .claude/commands/deploy.md and .claude/skills/deploy/SKILL.md both create /deploy; the skill form is recommended. D defeats the point of a reusable capability. The skill is the form that is both cheap (lazy) and discoverable (auto-invocable).

Solution ↑ Exercise

disable-model-invocation: true keeps the skill’s description out of startup context entirely — so unlike an ordinary skill (whose ~100-token description Claude sees and can match against), the model is never told this skill exists. That is exactly the point for a risky release action: it forces a human to type /force-release, because Claude cannot auto-invoke something it cannot see. The teammate got the intended safety behavior; “Claude doesn’t know it exists” is the feature, not a bug. (If they wanted Claude to know about it but still gate execution, that is a permission/ask concern, not this flag.)

Solution ↑ Exercise

(a) Only the ~100-token description enters context at session start, counted against the skill-listing budget (default 1% of the model’s context window); the ~3,000-token body does not load yet. (b) The body loads on invocation, enters as a single message, and stays for the rest of the session (it is not re-read each turn). If a compaction fires, the body is carried forward within a budget — up to the first 5,000 tokens of each most-recently-invoked skill, capped at a combined 25,000 tokens post-compaction.

Exam essentials

  • Commands merged into skills — .claude/commands/x.md and .claude/skills/x/SKILL.md both create /x; old commands keep working, skills are recommended (directory bundling, frontmatter, auto-invocation). A command is recognized only at message start; text after it is arguments.
  • Skills lazy-load on a budget — a ~100-token description at session start (budget defaults to 1% of the context window; on overflow least-invoked descriptions drop first; /doctor shows it), the full body on invocation; the body persists and compaction carries it forward (≤5,000 tokens each, 25,000 combined).
  • Arguments — $ARGUMENTS (all args), $ARGUMENTS[N] / $N (0-indexed) substitute into the body; absent references append ARGUMENTS: <value>.
  • The description is the retrieval interface — but not unconditional — disable-model-invocation: true keeps the description out of context (user-only via /), and budget overflow can drop one. user-invocable: false = Claude-only.
  • Four scopes — enterprise > personal > project; plugin skills are namespaced (plugin-name:skill-name) and never conflict. Live change detection: editing a skill takes effect mid-session without restart (except a brand-new top-level skills/ dir).

Further reading

The deeper craft — how to write a description that gets discovered, when to extract a skill out of CLAUDE.md, and the progressive-disclosure design behind all of this — is developed in the Agentic Systems Design book, Chapter 5, Skills & Progressive Disclosure. Optional depth; this chapter stands on its own.

[Note]

Cross-book link is provisional — it points at the chapter source until the Agentic Systems Design book is deployed, then repoints to its published URL.

Part 3 Chapter 3 Last verified 2026-06-08 Fresh

Path-Scoped Rules: Modular, Glob-Triggered Instructions

The .claude/rules/ system — a modular, eager-loaded instruction layer parallel to CLAUDE.md, with optional glob path-scoping so a rule loads only when Claude reads matching files. Covers unconditional vs path-scoped rules, user-level vs project rules and their load order, the glob format, and the directory and symlink mechanics.

Volatility: feature-surface
Tools compared: claude-code
Before you start: Chapter D3.1 (the CLAUDE.md scopes and the concatenation model) and D3.2 (skills also use a paths glob). You have written at least one CLAUDE.md.
You will learn
  • Distinguish .claude/rules/ from CLAUDE.md — a parallel system, not a subsystem
  • Apply the paths frontmatter to scope a rule to files matching a glob
  • Explain when a path-scoped rule actually triggers — on file-read, not at launch or every tool use
  • Distinguish user-level (~/.claude/rules/) from project rules, and the load order between them

D3.1 covered the always-on CLAUDE.md; D3.2 covered the lazy, invocable skill. Rules are the third shape of the instruction layer: modular markdown files that load eagerly like CLAUDE.md, but can be glob-scoped so a rule only enters context when Claude touches the files it governs. The architect’s job here is to know that rules are a separate, equal-priority system, that user and project rules have a load order, and to use path-scoping to keep file-specific guidance out of unrelated work.

[Note]

This is a feature-surface chapter: the .claude/rules/ path, the paths frontmatter field, and the glob syntax are concrete surfaces that move between releases. Treat them as a current snapshot and re-verify before relying on them.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

  1. Is .claude/rules/ a subsystem of CLAUDE.md, or a parallel one? At what priority do un-scoped rules load?
  2. When does a paths-scoped rule actually enter context?
  3. Where do user-level rules live, and do they win over project rules or lose to them?
  4. Write a paths glob for every .ts and .tsx file in the repo.
  5. When should a standing instruction be a paths-scoped rule rather than a line in CLAUDE.md?
Check your answers
  1. A parallel system, not a subsystem — rules without paths load at launch at the same priority as .claude/CLAUDE.md.
  2. When Claude reads a file matching the glob — not at launch, and not on every tool use.
  3. In ~/.claude/rules/, applying to every project; they load before project rules, so project rules win (read last, higher priority).
  4. **/*.{ts,tsx} — ** matches any directory depth, and brace expansion covers both extensions in one pattern.
  5. When the guidance is real but only relevant to part of the tree — path-scoping keeps file-specific instructions out of context on unrelated work.

A system parallel to CLAUDE.md

.claude/rules/*.md is a modular rules system, loaded into context every session, with recursive discovery across subdirectories. [Official] How Claude remembers your project · AnthropicT1-official original The relationship to CLAUDE.md is the fact to fix first: “Rules without paths frontmatter are loaded at launch with the same priority as .claude/CLAUDE.md.” [Official] How Claude remembers your project · AnthropicT1-official original A rule is not nested under CLAUDE.md or overridden by it — the two are parallel instruction sources discovered separately and loaded at equal priority.

Key idea

Rules sit beside CLAUDE.md, not under it. An unconditional rule and your .claude/CLAUDE.md both land in context at the same priority. The reason to reach for a rule instead of more CLAUDE.md lines is modularity and scoping — and, as the next section shows, rule scopes carry a load order just as CLAUDE.md scopes do.

User-level vs project rules — there is a load order

The “no precedence” story needs one refinement the exam can probe: rules come in scopes, and the scopes load in order. User-level rules live in ~/.claude/rules/ and apply to every project on your machine — use them for preferences that aren’t project-specific (your personal style, your workflows). [Official] How Claude remembers your project · AnthropicT1-official original Project rules live in the repo’s .claude/rules/. And the order between them is documented: “user-level rules are loaded before project rules, giving project rules higher priority.” [Official] How Claude remembers your project · AnthropicT1-official original

That is the same recency model as the CLAUDE.md hierarchy (D3.1): files concatenate, but the more-specific scope is read last and so effectively dominates when two instructions tension. So “concatenate, not override” is true within a scope; across scopes there is a load order — user → project, project higher — exactly mirroring CLAUDE.md’s broad-to-specific assembly.

Concept ·
  • User-level — ~/.claude/rules/*.md. Applies to every project; for machine-wide personal preferences. Loaded first (lower priority).
  • Project — .claude/rules/*.md in the repo. Shared with the team; for this project. Loaded after user rules → higher priority.
Key idea

Rules concatenate, but their scopes have an order: user-level loads before project, so project rules win. It is the CLAUDE.md model again — assemble broad-to-specific, most-specific read last. A personal ~/.claude/rules/ preference is a default a project rule can override by being read after it.

Path-scoping with the paths frontmatter

The lever that makes rules more than “another CLAUDE.md” is the optional paths field. Give a rule a paths glob and it stops loading unconditionally: “path-scoped rules trigger when Claude reads files matching the pattern, not on every tool use.” [Official] How Claude remembers your project · AnthropicT1-official original So a rule scoped to src/api/**/*.ts costs nothing until Claude actually reads an API file — and then applies while that work is in scope.

---
paths:
  - "src/api/**/*.ts"
---

# API Development Rules
- All API endpoints must include input validation

The glob format is the same one skills use for their paths field: [Official] How Claude remembers your project · AnthropicT1-official original

Concept ·
  • **/*.ts — all TypeScript files in any directory
  • src/**/* — everything under src/
  • *.md — Markdown in the project root only
  • src/components/*.tsx — React components in one directory
  • src/**/*.{ts,tsx} — brace expansion across multiple extensions
A path-scoped rule is not loaded at launch

Unconditional rules (no paths) load at session start; path-scoped rules do not. They activate when Claude first reads a matching file — so a rule scoped to src/api/** has no effect during work that never touches src/api/. If a rule must always apply, leave paths off; if it should bind only to certain files, scope it and accept that it is silent until one of those files is read.

What's in context, and when Worked example

A developer has, across two scopes:

~/.claude/rules/preferences.md         # user-level, no paths  → personal default
.claude/rules/code-style.md            # project, no paths     → always on, beats user default
.claude/rules/backend/api.md           # project, paths: src/api/**/*.ts

At session start, in context: preferences.md (loaded first) and code-style.md (loaded after → higher priority). api.md is not loaded — it is path-scoped. If preferences.md says “prefer 4-space indent” and code-style.md says “2-space,” the project rule was read last, so it dominates. The moment Claude reads src/api/orders.ts, api.md activates and “all API endpoints must include input validation” enters context — and only then. Work that never touches src/api/ never pays for that rule. Two levers at once: scope order (user → project) decides who wins; path-scoping decides who is even present.

Layout and symlinks

Rules can mix unconditional and path-scoped files in one tree, and discovery recurses into subdirectories:

.claude/
└── rules/
    ├── code-style.md      # no paths: loaded unconditionally
    ├── security.md        # no paths: loaded unconditionally
    ├── frontend/
    │   └── react.md       # paths: src/frontend/**/*.tsx
    └── backend/
        └── api.md         # paths: src/api/**/*.ts

The unconditional files (code-style.md, security.md) are always in context; the nested path-scoped files wait for a matching file-read. [Official] How Claude remembers your project · AnthropicT1-official original Rules can also be shared from a central directory by symlink: “symlinks work in .claude/rules/ — link shared rules from a central dir; circular symlinks are detected gracefully.” [Official] How Claude remembers your project · AnthropicT1-official original

Choosing the shape: rule vs CLAUDE.md vs skill

The three instruction shapes now in view divide cleanly by when they load and what they carry:

Concept ·
  • CLAUDE.md (D3.1) — always-on context, concatenated across scopes. For broadly-applicable rules you can’t infer from code.
  • .claude/rules/ (this chapter) — modular instruction files at the same priority; optionally paths-scoped so file-specific guidance loads only when relevant; user vs project scopes with project winning.
  • Skills (D3.2) — lazy capabilities; only a description until invoked. For invocable workflows, not standing rules.

The practical rule of thumb: reach for a paths-scoped rule when guidance is real but only relevant to part of the tree — it is the one shape that lets you write file-specific instructions without paying for them on every unrelated turn.

Practice

Exercise

You have a one-line standard — “all API endpoints must validate their input” — that is only relevant when someone edits files under src/api/. You want it in front of Claude during that work but not cluttering context the rest of the time. How should you author it?

  • A. As a line in CLAUDE.md, so it is always loaded and never missed.
  • B. As an unconditional rule .claude/rules/api.md with no paths, so it loads every session.
  • C. As a path-scoped rule .claude/rules/api.md with paths: ["src/api/**/*.ts"], so it loads only when Claude reads an API file.
  • D. As a skill at .claude/skills/api/SKILL.md, so Claude can invoke it when needed.
Practice ◆◆◇◇

Write the paths frontmatter value that scopes a rule to every TypeScript and TSX file anywhere in the repository, using a single pattern.

Practice ◆◆◆◇

You keep ~/.claude/rules/preferences.md (“prefer 4-space indent”) on your machine, and a teammate’s repo ships .claude/rules/code-style.md (“2-space indent”). When you work in that repo, which indentation does Claude favor, and why — in terms of where each rule lives and the order they load?

Exercise solutions

Solution ↑ Exercise

C. A path-scoped rule is exactly this case: with paths: ["src/api/**/*.ts"] the standard loads only when Claude reads a matching API file, and stays out of context otherwise. A and B both load the line unconditionally — CLAUDE.md and an un-scoped rule sit in context at the same priority every session, which is the clutter you wanted to avoid. D misuses a skill: skills are invocable capabilities/workflows, not standing rules that should bind automatically while editing a file. The lever that matters is paths, and only a rule (or a skill) offers it — so the rule is the right shape for a standing, file-scoped instruction.

Solution ↑ Exercise

src/**/*.{ts,tsx} — or, to cover the whole repo regardless of directory, **/*.{ts,tsx}. A single paths entry with brace expansion {ts,tsx} matches both extensions; ** matches any directory depth. (You can also list two patterns, but the brace form does it in one.)

Solution ↑ Exercise

Claude favors 2-space indent — the project rule wins. Your preferences.md is a user-level rule (~/.claude/rules/, applies to every project); the repo’s code-style.md is a project rule. The documented order is that user-level rules load before project rules, giving project rules higher priority — so when both are in context and they tension, the project rule (read last) dominates. Your personal preference acts as a default that any project can override for its own repo.

Exam essentials

  • Parallel to CLAUDE.md — .claude/rules/*.md loads every session with recursive subdir discovery; rules without paths load unconditionally at the same priority as .claude/CLAUDE.md (not a subsystem of it).
  • Scopes have a load order — user-level (~/.claude/rules/, all projects) loads before project (.claude/rules/), giving project rules higher priority. Rules concatenate, but the more-specific scope is read last and wins (the CLAUDE.md model).
  • Path-scoping — a paths glob makes a rule trigger only when Claude reads matching files, not at launch and not on every tool use. Glob format same as skill paths (**/*.ts, src/**/*, {ts,tsx}).
  • Mechanics — unconditional and path-scoped rules can mix in one tree; symlinks work, with circular symlinks detected gracefully.
  • Choosing the shape — CLAUDE.md (always-on), rules (modular, optionally glob-gated, user/project order), skills (lazy capability). Reach for a paths-scoped rule for guidance that is real but only relevant to part of the tree.
Part 3 Chapter 4 Last verified 2026-06-08 Fresh

Plan Mode vs Direct Execution: Research Before You Edit

Plan mode restricts Claude to read-only research and a written proposal — no edits — and approving the plan exits the mode into a write mode. Choosing plan versus going direct is a risk-containment decision, not a named-mode toggle; "direct execution" is simply working in a write mode. The opusplan alias pairs the mode with a model-per-phase split — Opus plans, Sonnet executes.

Volatility: architectural-pattern
Tools compared: claude-code
Before you start: Chapter D2.5 (the six permission modes and how they gate tools). No prior plan-mode use assumed.
You will learn
  • Explain what plan mode restricts and how it differs from the write (execution) modes
  • Recognize how to enter and exit plan mode — and that approving a plan ends it
  • Evaluate when to plan first versus go direct, by reversal cost and uncertainty
  • Apply the opusplan alias to spend the strong model on planning and a fast one on execution

D2.5 enumerated the six permission modes as a tool-gating mechanism. This chapter zooms in on the one an architect chooses on purpose: plan. Plan mode is read-only research with a written proposal at the end — and the real exam question is not “what is plan mode” but “when do you plan first versus edit directly.” That is a risk-containment decision, and there is no named “direct execution” mode — going direct just means working in a mode that lets edits through. The chapter closes on opusplan, the alias that makes the strong-model-plans/fast-model-executes split automatic.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

  1. Does plan mode suppress permission prompts? What exactly can Claude do in it, and what can it not?
  2. The instant you approve a plan, what happens to the session’s permission mode?
  3. Name three distinct ways to enter plan mode.
  4. What does the opusplan alias do — and does its plan phase get the automatic 1M-context upgrade?
  5. A two-line fix in code you know well: plan first or go direct?
Check your answers
  1. No — permission prompts still apply the same as default mode; Claude can read files, run shell commands to explore, and write a plan, but it does not edit your source.
  2. Approving a plan exits plan mode — the session switches to the write mode the chosen approve option names, and Claude starts editing.
  3. Any three of: cycle with Shift+Tab, prefix a prompt with /plan, start with claude --permission-mode plan, or set permissions.defaultMode: "plan" in settings.
  4. It runs Opus in plan mode and switches to Sonnet for execution — and no: the automatic 1M-context upgrade is opus-only, so opusplan’s plan phase runs at the standard 200K window.
  5. Go direct — a small diff in familiar code is the going-direct profile; plan first when reversal cost is high or the design space is unclear.

What plan mode restricts

Plan mode is the one mode whose purpose is to change nothing: “Plan mode tells Claude to research and propose changes without making them. Claude reads files, runs shell commands to explore, and writes a plan, but does not edit your source. Permission prompts still apply the same as default mode.” [Official] Choose a permission mode · AnthropicT1-official original It is read-only — best for exploring a codebase before changing it. [Official] Choose a permission mode · AnthropicT1-official original

Its contrast is not a single “direct” mode but the write modes from D2.5 — default (reads auto-approved, edits prompt) and acceptEdits (edits and common filesystem commands auto-approved) — the modes that let changes through. [Official] Configure permissions · AnthropicT1-official original “Going direct” is shorthand for working in one of those, not a toggle of its own.

Key idea

Plan mode is the only mode defined by what it won’t do. Every other mode is a point on the convenience-versus-caution axis for approving edits; plan mode steps off that axis entirely — research and propose, edit nothing — until you approve.

Entering and exiting plan mode

You can enter plan mode four ways: cycle with Shift+Tab (the CLI cycle runs default → acceptEdits → plan), prefix a single prompt with /plan (optionally with a task, /plan fix the auth bug), start the session with claude --permission-mode plan, or set permissions.defaultMode: "plan" in settings. [Official] Choose a permission mode · AnthropicT1-official original

Exit is the part that trips people up: “Approving a plan exits plan mode and switches the session to the permission mode each approve option describes, so Claude starts editing. To plan again, cycle back to plan mode with Shift+Tab, or prefix your next prompt with /plan.” [Official] Choose a permission mode · AnthropicT1-official original When the plan is ready Claude presents it, and each approve option (auto, accept-edits, review-each-edit) names the write mode the session lands in.

Approving the plan ends the read-only guarantee

Plan mode protects you only up to the moment you approve. The approve options aren’t “yes, looks good” — each one is a choice of which write mode to drop into, and the next step can edit immediately. If you want to keep researching, choose “keep planning”; if you approve, you have left read-only behind. Treat the approval screen as the real decision point, not a formality.

The decision: plan first, or go direct?

Because plan mode adds a research-and-review round before any edit, the choice between it and going direct is a bet on reversal cost and uncertainty:

Concept ·
  • Plan first when a wrong turn is expensive to undo or the design space is unclear: an unfamiliar codebase, a change spanning many files, a risky or irreversible operation, or anything where you want to see the blast radius before a single edit lands.
  • Go direct when the change is small and the code is familiar: a one- or two-line diff, an iterative tweak you can eyeball, a tight feedback loop where re-running is cheap.
Key idea

Plan mode converts an unbounded reversal cost into a bounded one. Go direct on a multi-file rename in an unfamiliar repo and you discover missed call sites as broken edits you now have to walk back. Plan first and the same misunderstanding surfaces in a proposal you reject for free — the mistake is caught before it touches the tree.

Model per phase: the opusplan alias

Plan mode separates thinking from doing in time — research first, edits after. That split is exactly where a model-per-phase pays, and Claude Code ships an alias for it. opusplan is one of the eight model aliases, and it “uses Opus in plan mode, switches to Sonnet for execution.” [Official] Model configuration · AnthropicT1-official original Spelled out: “The opusplan model alias provides an automated hybrid approach: In plan mode — Uses opus for complex reasoning and architecture decisions. In execution mode — Automatically switches to sonnet for code generation and implementation.” [Official] Model configuration · AnthropicT1-official original

Set it like any alias — /model opusplan during a session, or claude --model opusplan at startup. [Official] Model configuration · AnthropicT1-official original The reasoning-heavy plan runs on the strong model; the moment execution begins, the fast model takes over. You spend the expensive tokens where the leverage is — on the design — and the cheaper model on the mechanical edits.

opusplan's plan phase does NOT get the 1M-context upgrade

There is one trap worth memorizing: the automatic 1M-context upgrade applies to the opus alias only, not to opusplan — its Opus plan phase runs with the standard 200K window. [Official] Model configuration · AnthropicT1-official original So if your planning step genuinely needs to hold more than 200K of context at once, opusplan is not the alias that gives it to you; reach for opus[1m] (or pin a 1M model) for that phase instead.

One approval, two switches: opusplan through a plan→execute cycle Worked example

A developer starts a session for a multi-file refactor with claude --model opusplan and presses Shift+Tab until the mode reads plan. Two independent settings are now in play: the permission mode is plan (read-only) and the model resolves to Opus (the plan-phase half of opusplan).

  1. Research phase. Claude reads files and runs shell exploration — all permitted, none of it edits — and the heavy reasoning runs on Opus. It writes a proposed change set.
  2. The approval. The developer picks the “accept edits” approve option. One action flips two switches at once: the permission mode leaves plan for acceptEdits (edits now auto-approve), and opusplan leaves its plan phase, so the model switches to Sonnet for execution.
  3. Execution phase. Claude applies the edits on Sonnet under acceptEdits — fast model, mechanical work.

The lesson the exam tests: approval is not a rubber stamp. It simultaneously ends the read-only guarantee (D2.5’s mode axis) and, under opusplan, hands the work to a different model. To plan again — and return to Opus — you must re-enter plan mode (Shift+Tab or /plan).

Where plan mode fits the workflow

Plan mode is the front of the iterative loop — the “explore and plan” phase before “implement and commit” (the rhythm developed in D3.5). The hands-on mechanics — how the approval screen looks, how to drive the loop turn by turn — are the handbook’s territory rather than this book’s: see the Use book, Chapter 3, Your First Working Session.

[Note]

Cross-book link is provisional — it points at the chapter source until the handbook is deployed, then repoints to its published URL.

Practice

Exercise

You are asked to rename a widely-used helper function across an unfamiliar ~40-file service, and you want Claude to carry it out. Which approach best contains the risk, and why?

  • A. Start in acceptEdits and let Claude rename call sites as it finds them — the fastest path to a green tree.
  • B. Start in plan mode so Claude maps every call site and proposes the complete change set before editing; approve once it is complete.
  • C. Use bypassPermissions so nothing interrupts the multi-file edit.
  • D. Work in default mode and approve each edit as it appears, catching mistakes one prompt at a time.
Practice ◆◆◇◇

While in plan mode, which kinds of actions can Claude take, and which can it not? Name what still happens that people sometimes assume plan mode suppresses.

Practice ◆◆◆◇

You are in plan mode, Claude presents a plan, and you approve it by choosing the “accept edits” option. (a) What permission mode is the session in now? (b) What must you do to get back to planning for the next change?

Practice ◆◆◆◇

You set claude --model opusplan and work through a plan-then-execute cycle. (a) Which model runs the planning phase, and which runs execution? (b) Your planning step needs to hold ~400K tokens of context at once — does opusplan give you that? If not, what would?

Exercise solutions

Solution ↑ Exercise

B. The change is unfamiliar, multi-file, and expensive to get wrong — exactly the profile where planning first pays. In plan mode Claude maps the full set of call sites and proposes the complete edit before touching anything, so a missed reference shows up in a proposal you can reject, not as breakage you have to walk back. A and D both start editing before the scope is known: you discover missed call sites as broken edits (A) or as a long stream of one-at-a-time approvals with no view of the whole (D). C removes the safety entirely — and approving plans is the opposite move from skipping permission checks. Plan mode here converts an unbounded reversal cost into a bounded one.

Solution ↑ Exercise

In plan mode Claude can read files and run shell commands to explore, and it writes a plan — but it does not edit your source. The thing people wrongly assume it suppresses is permission prompts: they “still apply the same as default mode,” so plan mode is not a quiet, prompt-free sandbox — it is a no-edit mode that still gates the actions it does allow. Read-only is the guarantee; silence is not.

Solution ↑ Exercise

(a) The session is now in acceptEdits — approving a plan exits plan mode and switches the session to the write mode the chosen approve option names, and Claude starts editing. The read-only guarantee is gone. (b) To plan again you must re-enter plan mode: cycle back with Shift+Tab, or prefix your next prompt with /plan. Approval is one-way; getting back to research is a deliberate re-entry, not an undo.

Solution ↑ Exercise

(a) Under opusplan, the planning phase runs on Opus (complex reasoning and architecture) and execution runs on Sonnet (code generation) — the switch is automatic at the plan→execute boundary. (b) No — opusplan does not give you a 1M planning window. The automatic 1M-context upgrade applies to the opus alias only, not opusplan, whose Opus plan phase runs at the standard 200K. For a ~400K planning context you would use opus[1m] (or pin a 1M model) for that phase instead.

Exam essentials

  • Plan mode = read-only research — Claude reads, explores via shell, and writes a plan, but does not edit source; permission prompts still apply the same as default mode (it is no-edit, not prompt-free).
  • “Direct execution” is not a mode — it is working in a write mode (default/acceptEdits); plan’s contrast is the modes that let edits through.
  • Enter — Shift+Tab cycle (default → acceptEdits → plan), /plan [task], --permission-mode plan, or defaultMode in settings.
  • Exit — approving a plan exits plan mode into the chosen write mode and Claude starts editing; to plan again, cycle back with Shift+Tab or prefix /plan.
  • opusplan — alias that runs Opus in plan mode, Sonnet in execution; set via /model opusplan or --model opusplan. Approval flips both the permission mode and the model. Its plan phase runs at 200K — the 1M upgrade is opus-only, not opusplan.
  • The decision — plan when reversal cost is high or the design space is uncertain (unfamiliar / multi-file / risky); go direct for small diffs in known code. Plan contains misunderstandings before edits.
Part 3 Chapter 5 Last verified 2026-06-02 Fresh

Iterative Refinement: The Loop, the Interview, and Test-Driven Prompting

Agentic work is iterative. The explore-plan-implement-commit rhythm, the interview pattern (Claude interviews you, writes a spec, a fresh session implements), and test-driven prompting are the durable disciplines — methodology that survives any tool rename, hence a stable principle.

Volatility: stable-principle
Tools compared: claude-code
Before you start: Chapter D3.4 (plan mode as the read-only 'plan' phase). You have run at least one multi-step task with Claude.
You will learn
  • Apply the four-phase explore → plan → implement → commit rhythm
  • Recognize the interview pattern — and why a fresh session implements the spec
  • Evaluate when to give Claude verification criteria, and why that is the highest-leverage move
  • Recognize the course-correct-early discipline — when to /clear and restart rather than keep correcting

D3.4’s plan mode is one phase of a larger loop. This chapter is the loop itself — how an architect drives a task from understanding to a committed change, refining as they go. These are disciplines, not features: the explore-plan-implement-commit rhythm, the interview pattern, and test-driven prompting outlast any particular keybinding or tool name, which is why this chapter is a stable principle rather than a feature surface.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

  1. Name the four phases of the recommended workflow, in order. When is it safe to skip the plan phase?
  2. In the interview pattern, who asks the questions, which tool drives them, and what artifact ends the interview?
  3. Why does a fresh session implement the spec rather than the one that ran the interview?
  4. What is described as “the single highest-leverage thing you can do,” and what does withholding it cost?
  5. You have corrected Claude three times on the same issue this session. What is the recommended move, and why?
Check your answers
  1. Explore → Plan → Implement → Commit; skip the plan phase only when the diff is one-sentence-describable.
  2. Claude asks the questions, driven by the AskUserQuestion tool, and the interview ends by writing a complete spec to a file (e.g. SPEC.md).
  3. The interview session is full of question-answer thrash; the fresh session starts on clean context whose only input is the written spec.
  4. A verification criterion — tests, screenshots, or expected outputs so Claude can check itself; without one it “might produce something that looks right but actually doesn’t work,” and you become the only feedback loop.
  5. Past two corrections on the same issue the context is cluttered with failed approaches — run /clear and start fresh with a more specific prompt that incorporates what you learned.

The four-phase rhythm

Letting Claude jump straight to coding produces code that solves the wrong problem; the antidote is a phased loop. “The recommended workflow has four phases: Explore [plan mode, read files] → Plan [create detailed implementation plan] → Implement [switch out of plan mode, verify against plan] → Commit [descriptive message + PR].” [Official] Best practices for Claude Code · AnthropicT1-official original Explore and Plan are the read-only front half (D3.4’s plan mode is exactly this); Implement and Commit are where edits land. Skip the plan phase only when the diff is one-sentence-describable. [Official] Best practices for Claude Code · AnthropicT1-official original

Key idea

The loop’s whole purpose is to separate understanding from editing. Explore and Plan build a shared model of the change before a single line moves; Implement then checks its work against that plan. Collapse the two and Claude optimizes a problem it never confirmed it understood.

The interview pattern

For a large feature with an unclear design space, the most effective opening is to invert the usual flow and let Claude drive the questions: “For larger features, have Claude interview you first. Start with a minimal prompt and ask Claude to interview you using the AskUserQuestion tool. Claude asks about things you might not have considered yet, including technical implementation, UI/UX, edge cases, and tradeoffs.” [Official] Best practices for Claude Code · AnthropicT1-official original The interview ends by writing a complete spec to a file, and then a fresh session implements from that spec — the interview session is full of question-answer thrashing, while the implementation session starts with clean context whose only input is the written spec.

Concept ·
  1. Brief description — enough to anchor what the project is.
  2. Interview directive — explicitly ask Claude to use the AskUserQuestion tool.
  3. Question scope — technical implementation, UI/UX, edge cases, concerns, tradeoffs.
  4. Quality filter — “don’t ask obvious questions, dig into the hard parts I might not have considered.”
  5. Termination + artifact — “keep interviewing until we’ve covered everything, then write a complete spec to SPEC.md.”

The spec file is the bridge between the two sessions: a deliverable you can review and the context bootstrap the implementation session reads.

The interview pattern, end to end Worked example

You need a rate limiter for an API and the design space is fuzzy. Instead of writing a full spec you do not yet have, you run the interview pattern:

  1. Minimal prompt. “I need a rate limiter for our API. Interview me with AskUserQuestion before writing any code — dig into the hard parts I might not have considered, then write a complete spec to SPEC.md.”
  2. Claude interviews you. It asks what you would not have front-loaded: Algorithm — token bucket (allows bursts) or sliding window (smoother)? Scope — per-API-key, per-IP, or global? On exceed — reject with 429, or queue and delay? Storage — in-process, or Redis so it holds across instances? Failure mode — if the limiter’s backing store is down, fail open or fail closed?
  3. You answer, and the hard tradeoffs surface before any code — the fail-open/closed question alone is one most one-shot prompts never raise.
  4. Artifact. Claude writes SPEC.md: token bucket, per-key, 429 on exceed, Redis-backed, fail-closed. You read and correct it — it is a reviewable deliverable, not buried in chat.
  5. Fresh session implements. You /clear (or open a new session) and prompt: “Implement the rate limiter per SPEC.md; write the tests first.” The implementation session starts on clean context whose only input is the reviewed spec — none of the interview’s question-answer thrash.

Why the split matters: the interview session’s context is full of half-formed options and back-and-forth; the implementation session should not inherit that noise. The spec file is the clean hand-off — review gate and context bootstrap in one.

Give Claude a way to verify its work

The highest-return habit in the whole loop is supplying a success criterion: “Include tests, screenshots, or expected outputs so Claude can check itself. This is the single highest-leverage thing you can do. … Without clear success criteria, it might produce something that looks right but actually doesn’t work.” [Official] Best practices for Claude Code · AnthropicT1-official original This is also what turns a vague prompt concrete — “the build is failing” becomes “the build fails with this error: [paste]; fix it, verify the build succeeds, and address the root cause.” Test-driven prompting is the same instinct formalized: for a bug, ask for “a failing test that reproduces the issue, then fix it”; for longer work, have Claude write the tests first and keep them as the persistent contract. [Official] Best practices for Claude Code · AnthropicT1-official original

Key idea

A verification criterion moves the feedback loop off the human. Give Claude a test to pass or a screenshot to match and it can iterate against ground truth on its own turns; withhold one and you become the only thing standing between a plausible-looking result and a working one.

The trust-then-verify gap

A plausible implementation with no way to check it is the most expensive kind of output: it passes a glance and fails in production. If the prompt that produced it had no test, no expected output, and no screenshot to diff, you have not finished the task — you have deferred the verification to whoever notices the bug later. Bake the success criterion into the prompt, not into a future incident.

Course-correct early — and know when to restart

Iteration only pays if the loop stays clean. “The best results come from tight feedback loops” — correct Claude quickly rather than letting a wrong direction run. But there is a threshold: “If you’ve corrected Claude more than twice on the same issue in one session, the context is cluttered with failed approaches. Run /clear and start fresh with a more specific prompt that incorporates what you learned. A clean session with a better prompt almost always outperforms a long session with accumulated corrections.” [Official] Best practices for Claude Code · AnthropicT1-official original

Correcting over and over

Past the second correction on the same point, each further nudge is fighting a context now polluted with failed attempts — the very approaches you are trying to steer away from are still in the window, pulling the model back toward them. The move is not a third correction; it is /clear plus a better initial prompt that folds in what the failed rounds taught you.

The hands-on mechanics of this loop — the session rhythm turn by turn, and the dedicated treatments of the interview pattern and the testing discipline — are the handbook’s territory: see the Use book, Chapter 3, Your First Working Session, with the interview-pattern and testing-and-verification chapters forthcoming there.

[Note]

Cross-book link is provisional — it points at the chapter source until the handbook is deployed, then repoints to its published URL.

Practice

Exercise

You are kicking off a sizeable new feature — a billing system — whose design space you have not fully thought through (proration rules, failure modes, and edge cases are still fuzzy). What is the most effective way to start Claude on it?

  • A. Write one detailed prompt with every requirement and verification criterion you can think of, for maximum autonomy.
  • B. Use the interview pattern: a minimal prompt asking Claude to interview you with AskUserQuestion, ending in a written SPEC.md, then a fresh session implements from it.
  • C. Give a vague one-liner and course-correct turn by turn as the design reveals itself.
  • D. Use plan mode alone — let Claude explore the codebase and propose an approach without interviewing you.
Practice ◆◆◇◇

Name the four phases of the recommended workflow in order, and state the one condition under which you would skip the plan phase.

Practice ◆◆◆◇

For a bug that has a known reproduction, give the test-driven prompt pattern the docs recommend, and explain why it outperforms simply asking Claude to “fix the login bug.”

Exercise solutions

Solution ↑ Exercise

B. A large feature with an unsettled design space is the documented home of the interview pattern: a minimal prompt asks Claude to interview you via AskUserQuestion, surfacing edge cases and tradeoffs you have not considered, and the interview ends in a SPEC.md that a fresh session then implements from clean context. A assumes you already hold a complete spec — but the premise is that you do not, so you would be encoding gaps as requirements. C burns turns thrashing and pollutes context with half-formed direction. D is the closest miss: plan mode (D3.4) makes Claude explore the code, but the interview pattern makes Claude interrogate you about intent and tradeoffs — and here the unknowns live in your requirements, not in the codebase. The interview elicits the spec; plan mode would plan against a spec you have not written yet.

Solution ↑ Exercise

The four phases in order are Explore → Plan → Implement → Commit — explore (read-only / plan mode) builds understanding, plan writes a detailed implementation plan, implement switches out of plan mode and verifies against that plan, and commit writes a descriptive message and PR. You may skip the plan phase only when the diff is one-sentence-describable — a change small and clear enough that there is nothing for a plan to de-risk.

Solution ↑ Exercise

The pattern is: “Write a failing test that reproduces the issue, then fix it” — make Claude first encode the bug as a test that fails, then make that test pass. It outperforms “fix the login bug” for two reasons. First, the failing test is an unambiguous success criterion: Claude can iterate against ground truth on its own turns instead of waiting for you to judge each attempt — the highest-leverage move in the loop. Second, the test persists as a regression contract: it stays green afterward, so the same bug cannot silently return. “Fix the login bug” gives Claude no way to check itself and leaves nothing behind to prove the fix held.

Exam essentials

  • Four-phase rhythm — Explore (read-only / plan mode) → Plan → Implement (verify against plan) → Commit; skip the plan only when the diff is one-sentence-describable.
  • Interview pattern — for large features, Claude interviews you via AskUserQuestion, writes a SPEC.md, and a fresh session implements from it (clean context).
  • Verification is the single highest-leverage move — include tests, screenshots, or expected outputs so Claude checks itself; without criteria, the human is the only feedback loop.
  • Concrete beats vague — replace “the build is failing” with the error plus “address the root cause”; the delta is verb + concrete example/test/file + a verification step.
  • Course-correct early, then restart — tight loops beat drift, but after two corrections on the same issue, /clear and rewrite the prompt incorporating what you learned.
Part 3 Chapter 6 Last verified 2026-06-08 Fresh

CI/CD Integration: Headless Runs, Output Formats, and GitHub Actions

Running Claude Code in CI — the headless `claude -p` entry point and `--bare` for reproducibility, the three output formats, schema-validated structured output via `--json-schema`, the permission flags that lock down a run with no human to prompt, and the GitHub Actions wrapper with its credential model.

Volatility: feature-surface
Tools compared: claude-code
Before you start: Chapter D2.5 (allowed-tools and the permission modes) and D3.4 (dontAsk as the locked-down mode). No prior CI setup assumed.
You will learn
  • Apply the headless invocation (-p) and --bare for reproducible runs
  • Distinguish the three output formats — text / json / stream-json — and when each fits
  • Apply --json-schema for schema-validated structured output
  • Recognize the permission flags that lock down a run with no human to prompt
  • Explain how a CI step gates on a run’s exit code (0 success / non-zero failure)
  • Recognize the GitHub Actions wrapper and its credential model

D2.5 and D3.4 governed permission inside an interactive session. This chapter takes the same agent out of the terminal and into a pipeline. The mechanics change — there is no one at the keyboard to approve a tool or answer a question — so a headless run has to decide its output shape and its permission surface up front. The payoff is that Claude Code becomes a scriptable, gated CI citizen.

[Note]

This is a feature-surface chapter: CLI flag names, output event types, GitHub Actions input fields, and exit semantics are concrete surfaces that move between releases. Treat every flag and field as a current snapshot and re-verify before relying on it in production CI.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

  1. What does --bare skip, and why does omitting it make a headless run non-reproducible?
  2. Which --output-format gives you total_cost_usd and session_id as parseable fields?
  3. With no human at the keyboard, which permission mode is the documented floor for a locked-down CI run, and what does it deny?
  4. What is a CI step actually gating on when it passes or fails a claude -p run? Name two conditions that make the run exit non-zero.
  5. In GitHub Actions v1.0, how do you pass --bare or --allowedTools through to the underlying claude -p?
Check your answers
  1. --bare skips auto-discovery of hooks, skills, plugins, MCP servers, auto memory, and CLAUDE.md; without it the run loads whatever the host machine has, so the same command behaves differently on different runners.
  2. --output-format json — a single payload with result, session_id, and total_cost_usd (plus a per-model cost breakdown).
  3. --permission-mode dontAsk — it denies anything not in permissions.allow or the read-only command set.
  4. The run’s process exit code (0 passes, non-zero fails); hitting --max-turns and piping stdin over the 10 MB cap both exit non-zero.
  5. Through the claude_args passthrough input — v1.0 keeps prompt for instructions and routes all CLI flags via claude_args.

The headless invocation

The entry point for everything in this chapter is one flag: “claude -p "<query>" is the canonical non-interactive invocation; the CLI exits after responding. All standard CLI options work with -p.” [Official] Run Claude Code programmatically · AnthropicT1-official original That single command runs the full agent loop and returns — no prompt, no session UI.

For CI you almost always pair it with --bare: “Add --bare to reduce startup time by skipping auto-discovery of hooks, skills, plugins, MCP servers, auto memory, and CLAUDE.md. Without it, claude -p loads the same context an interactive session would, including anything configured in the working directory or ~/.claude. Bare mode is useful for CI and scripts where you need the same result on every machine.” [Official] Run Claude Code programmatically · AnthropicT1-official original It is the recommended mode for scripted calls and is slated to become the default for -p in a future release. [Official] Run Claude Code programmatically · AnthropicT1-official original

Without --bare, headless runs are not reproducible

A bare claude -p loads whatever the machine happens to have — the repo’s CLAUDE.md, a developer’s personal ~/.claude config, locally-configured MCP servers. So the same command can produce different results on a laptop and a CI runner. If a pipeline must behave identically everywhere, --bare is not optional; it is what makes the run a function of its inputs rather than of the host.

Output formats

A headless run can emit one of three shapes, selected with --output-format: text (default), json (a single payload with result, session_id, and total_cost_usd), and stream-json (newline-delimited events). [Official] Run Claude Code programmatically · AnthropicT1-official original The json form is what makes a run scriptable: “With --output-format json, the response payload includes total_cost_usd and a per-model cost breakdown, so scripted callers can track spend per invocation without consulting the usage dashboard.” [Official] Run Claude Code programmatically · AnthropicT1-official original

Concept ·
  • text (default) — just the final response. For piping into another tool or a human-read log.
  • json — one payload with result, session_id, total_cost_usd, and a per-model cost breakdown. For scripts that parse a field (e.g. capture session_id to resume).
  • stream-json — newline-delimited events as they happen. Typically the first, system/init, reports the model, tools, MCP servers, and a plugin_errors field you can check to fail CI when a plugin did not load (with CLAUDE_CODE_SYNC_PLUGIN_INSTALL set, plugin_install events precede it).

Structured output with --json-schema

When a downstream step needs a specific shape rather than prose, constrain the output to a schema: “To get output conforming to a specific schema, use --output-format json with --json-schema and a JSON Schema definition. The response includes metadata about the request (session ID, usage, etc.) with the structured output in the structured_output field.” [Official] Run Claude Code programmatically · AnthropicT1-official original The flag is --json-schema '<schema>'; the CLI reference describes it as producing “validated JSON output matching a JSON Schema after agent completes its workflow (print mode only).” [Official] CLI reference · AnthropicT1-official original

A minimal schema looks like this (illustration only — the shape is yours to define):

claude -p "Classify this PR's risk" \
  --output-format json \
  --json-schema '{"type":"object","properties":{"severity":{"type":"string"}},"required":["severity"]}'

The schema-conforming result then arrives in structured_output, alongside the usual session_id and usage metadata.

--json-schema is print-mode only

--json-schema is documented as print-mode-only — it works under -p, not in an interactive session. The same is true of --max-turns, --max-budget-usd, and --no-session-persistence. Reaching for any of these is a signal you are (correctly) in headless territory; expecting them to apply to a normal terminal session is a category error.

Permission gates for a run with no human

The defining constraint of CI is that no one is there to approve a tool call, so the permission surface must be settled before the run starts. The locked-down mode from D3.4 is built for exactly this: --permission-mode dontAsk “denies anything not in permissions.allow or the read-only command set,” which the docs call out as useful for locked-down CI runs. [Official] Run Claude Code programmatically · AnthropicT1-official original Pair it with an allowlist: --allowedTools "Bash(git diff *),Read" auto-approves specific tools and supports prefix matching. [Official] Run Claude Code programmatically · AnthropicT1-official original

Two more knobs bound a run: --max-turns N “limits agentic turns and exits with error when reached,” and --max-budget-usd <N> caps dollar spend — both print-mode-only. [Official] CLI reference · AnthropicT1-official original At the far end, --permission-mode bypassPermissions (alias --dangerously-skip-permissions) skips prompts entirely [Official] Run Claude Code programmatically · AnthropicT1-official original — appropriate only inside an isolated container, never as a convenience.

Key idea

Interactive permission is a runtime negotiation; CI permission is a design-time contract. With no human to fall back on, an unguarded tool request does not pause for approval — it is denied or it derails the job. So you decide the exact tool surface up front: dontAsk for the floor, --allowedTools for the precise grants, --max-turns / --max-budget-usd for the ceiling.

Exit codes: what CI actually gates on

Output format decides what a run prints; the exit code decides whether the pipeline step passes. A CI step’s pass/fail is the exit status of the process it ran — so a headless claude -p that runs the full loop and “exits after responding” [Official] Run Claude Code programmatically · AnthropicT1-official original hands its exit code straight to the runner, and 0 means the step succeeds while non-zero fails it. The docs name concrete non-zero triggers you can rely on: --max-turns N “limits agentic turns and exits with error when reached,” [Official] CLI reference · AnthropicT1-official original and an over-cap stdin (10 MB) “returns a clear error and non-zero exit.” [Official] Run Claude Code programmatically · AnthropicT1-official original

The same mechanism gives you a clean pre-flight gate: claude auth status “exits 0 if logged in, 1 if not — useful as a CI gate before the agent step.” [Official] Run Claude Code programmatically · AnthropicT1-official original Run it first and the job fails fast with a clear cause instead of burning a turn on an unauthenticated agent call.

A step that ignores the exit code can pass a failed run

If your CI step only captures stdout and never checks the exit status ($?), a run that hit --max-turns or errored out can still look green — the next step consumes truncated or empty output as if it were a real result. The exit code is the success contract in headless mode; let the shell propagate it (don’t swallow it with || true or a trailing pipe that masks the status) so a failed agent run fails the pipeline.

A read-only review gate that fails correctly Worked example

Goal: a CI job that lets Claude read the repo and run tests, blocks on nothing (no human), bounds cost, and fails the pipeline if the run fails. A shell step:

# 1. Pre-flight: fail fast if the runner isn't authenticated.
claude auth status --text || { echo "Claude not authenticated"; exit 1; }

# 2. The gated run. Exit code propagates to the step.
claude -p "Review the staged diff for regressions; run the test suite." \
  --bare \
  --permission-mode dontAsk \
  --allowedTools "Read,Bash(git diff *),Bash(npm test)" \
  --max-turns 12 \
  --output-format json > result.json
# No `|| true` — if claude exits non-zero, the step (and the job) fails here.

Trace each guard: auth status (step 1) turns a missing credential into an immediate, legible failure — exit 1 — rather than a confusing agent error. --bare makes the run a function of its inputs, not the runner’s stray config. dontAsk + the tight --allowedTools fix the permission surface up front, so no tool request can stall waiting for an approval that will never come. --max-turns 12 is the ceiling: if the agent thrashes past twelve turns it exits with error, and because the step does not mask the status, the job goes red. On success the agent exits 0, result.json holds the parseable payload, and a later step can read total_cost_usd to log spend. The job’s green/red is the agent’s exit code — exactly the contract CI needs.

GitHub Actions and the credential model

The managed CI surface wraps all of the above. Claude Code GitHub Actions is “built on top of the Claude Agent SDK” [Official] Claude Code GitHub Actions · AnthropicT1-official original and wraps claude -p in a GitHub Action runner. Beyond the direct Anthropic API it supports two cloud providers — Amazon Bedrock (use_bedrock) and Google Vertex AI (use_vertex) — each authenticated through GitHub OIDC / Workload Identity Federation, so no static cloud keys are stored. [Official] Claude Code GitHub Actions · AnthropicT1-official original The v1.0 interface is deliberately small: “mode is auto-detected; use prompt for all instructions and claude_args for any CLI passthrough” [Official] Claude Code GitHub Actions · AnthropicT1-official original — so everything from earlier sections (--bare, --output-format, --allowedTools) reaches the runner through claude_args.

Never hardcode a key in a workflow file

Workflow YAML is committed to the repo, so a key written into it is a key published to everyone with read access. Supply the Anthropic key as a GitHub secret, and prefer the OIDC / Workload Identity Federation path for Bedrock and Vertex — that route uses no static keys at all. The same ${VAR}-not-literal discipline from D2.4’s .mcp.json applies to CI configuration.

Practice

Exercise

A CI job should let Claude read the repository and run the test suite — and nothing else: no edits, no pushes, and no hanging on a prompt, because there is no human to answer one. Which invocation best fits?

  • A. claude -p "..." --dangerously-skip-permissions, so the run never blocks on a permission check.
  • B. claude -p "..." --permission-mode dontAsk --allowedTools "Read,Bash(npm test)" --bare.
  • C. claude -p "..." --permission-mode acceptEdits, so it can write any test artifacts it needs.
  • D. claude -p "..." in the default mode, letting it prompt for approval if it needs more access.
Practice ◆◆◇◇

A CI script needs to capture the session ID and the total cost of a headless run so a later step can resume it and log spend. Which --output-format do you use, and which fields do you read?

Practice ◆◆◆◇

Two engineers run the identical claude -p "review this diff" command in two CI pipelines and get noticeably different behavior. Which single flag most likely removes the divergence, and what is the underlying cause it addresses?

Practice ◆◆◆◇

You want a CI job to fail when a headless Claude run does not complete cleanly. (a) What does the job actually gate on to decide pass/fail? (b) Name two conditions the docs say make a run exit non-zero. (c) What common shell mistake would cause a failed run to be reported as green?

Exercise solutions

Solution ↑ Exercise

B. The job has no human to approve anything, so the permission surface must be fixed up front: dontAsk denies anything not pre-approved, --allowedTools "Read,Bash(npm test)" grants exactly the read and test-run capability (prefix matching scopes Bash to the test command), and --bare makes the run reproducible across machines. A does the opposite of locking down — --dangerously-skip-permissions approves everything, including edits and pushes. C auto-approves file edits, but the job is supposed to be read-only. D is the classic headless trap: in CI there is no one to answer the approval prompt, so a tool that falls through to default mode stalls or is denied rather than helpfully pausing. The locked-down combination is dontAsk + a tight allowlist + --bare.

Solution ↑ Exercise

Use --output-format json and read the session_id and total_cost_usd fields. The json form returns a single payload with result, session_id, and total_cost_usd (plus a per-model cost breakdown), so the later step can resume with the captured session_id and log spend from total_cost_usd — no usage-dashboard round-trip. The default text format returns only the final response (nothing parseable), and stream-json would make you reassemble the fields from an event stream.

Solution ↑ Exercise

The flag is --bare, and the cause it addresses is non-reproducible context discovery. Without --bare, claude -p auto-discovers and loads whatever the machine has — the repo’s CLAUDE.md, a developer’s personal ~/.claude config, locally-configured MCP servers, hooks, skills, plugins, auto memory — so the same command becomes a function of the host rather than of its inputs, and two runners diverge. --bare skips that discovery, making the run reproducible across machines (and it is slated to become the -p default).

Solution ↑ Exercise

(a) The job gates on the run’s process exit code: claude -p exits after responding and hands its exit status to the runner — 0 passes the step, non-zero fails it. (b) Two documented non-zero conditions: hitting --max-turns (“exits with error when reached”) and an over-cap stdin (piped input above the 10 MB limit “returns a clear error and non-zero exit”); claude auth status exiting 1 when not logged in is a third, useful as a pre-flight gate. (c) Swallowing the status — e.g. ending the command with || true or masking it behind a pipe — so the shell reports success even though the agent run failed; let the exit code propagate.

Exam essentials

  • Headless entry — claude -p "<query>" runs non-interactively and exits; add --bare to skip discovery (hooks/skills/MCP/CLAUDE.md) for reproducible CI. --bare is slated to become the -p default.
  • Output formats — text (default), json (result, session_id, total_cost_usd, per-model cost), stream-json (newline-delimited events; system/init carries plugin_errors to fail CI).
  • Structured output — --output-format json with --json-schema adds a validated structured_output field; --json-schema is print-mode only (as are --max-turns, --max-budget-usd).
  • Permission gates — CI has no human to prompt, so decide the surface up front: dontAsk (deny anything not pre-approved) + --allowedTools (prefix matching, e.g. Bash(git diff *)); --max-turns / --max-budget-usd cap a run; bypassPermissions / --dangerously-skip-permissions skips all checks — containers only.
  • Exit codes are the CI contract — a step passes/fails on the run’s exit status (0 success, non-zero failure). --max-turns reached and over-cap stdin (10 MB) exit non-zero; claude auth status exits 0/1 as a pre-flight gate. Don’t mask the status (|| true) — a failed run would report green.
  • GitHub Actions — wraps claude -p; v1.0 auto-detects mode (prompt + claude_args); the Anthropic API plus Bedrock + Vertex (the two cloud providers via OIDC, no static keys); supply credentials as secrets, never hardcoded.

Part 3 · D3 Review

6 exercises across 6 chapters — interleaved review.

d3-01-claude-md-hierarchy

  1. d3-01-ex-scope-concat A monorepo has `/repo/CLAUDE.md` ("use tabs") and `/repo/services/api/CLAUDE.md` ("use 2-space indent"). A developer runs Claude Code from `/repo/services/api/`, and both files are discovered. Which statement describes what Claude Code actually loads, and why? - **A.** Only the `services/api/CLAUDE.md` loads — the closest file overrides its ancestors. - **B.** Only the root `/repo/CLAUDE.md` loads — the project-root file takes precedence. - **C.** Both load and concatenate (root first, then `api`); with no precedence, both instructions sit in context at once. - **D.** Both load, but `api`'s lines override the root's, the way `settings.local.json` overrides user settings.

d3-02-slash-commands-skills

  1. d3-02-ex-command-vs-skill Your team has a 2,000-token deployment runbook you want Claude to follow whenever it deploys — ideally without anyone having to remember to paste it. Where should it live, and why? - **A.** In `CLAUDE.md`, so the runbook is always available to every session. - **B.** As a skill at `.claude/skills/deploy/SKILL.md`, so only its ~100-token description costs context until it is invoked. - **C.** As a slash command at `.claude/commands/deploy.md`, since that is the only way to get a `/deploy` command. - **D.** Pasted inline into each prompt at deploy time, so it is always fresh.

d3-03-rules-path-scoping

  1. d3-03-ex-scoped-vs-unconditional You have a one-line standard — "all API endpoints must validate their input" — that is only relevant when someone edits files under `src/api/`. You want it in front of Claude during that work but not cluttering context the rest of the time. How should you author it? - **A.** As a line in `CLAUDE.md`, so it is always loaded and never missed. - **B.** As an unconditional rule `.claude/rules/api.md` with no `paths`, so it loads every session. - **C.** As a path-scoped rule `.claude/rules/api.md` with `paths: ["src/api/**/*.ts"]`, so it loads only when Claude reads an API file. - **D.** As a skill at `.claude/skills/api/SKILL.md`, so Claude can invoke it when needed.

d3-04-plan-mode

  1. d3-04-ex-plan-or-direct You are asked to rename a widely-used helper function across an unfamiliar ~40-file service, and you want Claude to carry it out. Which approach best contains the risk, and why? - **A.** Start in `acceptEdits` and let Claude rename call sites as it finds them — the fastest path to a green tree. - **B.** Start in plan mode so Claude maps every call site and proposes the complete change set before editing; approve once it is complete. - **C.** Use `bypassPermissions` so nothing interrupts the multi-file edit. - **D.** Work in `default` mode and approve each edit as it appears, catching mistakes one prompt at a time.

d3-05-iterative-refinement

  1. d3-05-ex-interview-vs-direct You are kicking off a sizeable new feature — a billing system — whose design space you have not fully thought through (proration rules, failure modes, and edge cases are still fuzzy). What is the most effective way to start Claude on it? - **A.** Write one detailed prompt with every requirement and verification criterion you can think of, for maximum autonomy. - **B.** Use the interview pattern: a minimal prompt asking Claude to interview you with `AskUserQuestion`, ending in a written `SPEC.md`, then a fresh session implements from it. - **C.** Give a vague one-liner and course-correct turn by turn as the design reveals itself. - **D.** Use plan mode alone — let Claude explore the codebase and propose an approach without interviewing you.

d3-06-cicd-integration

  1. d3-06-ex-locked-down-ci A CI job should let Claude read the repository and run the test suite — and nothing else: no edits, no pushes, and no hanging on a prompt, because there is no human to answer one. Which invocation best fits? - **A.** `claude -p "..." --dangerously-skip-permissions`, so the run never blocks on a permission check. - **B.** `claude -p "..." --permission-mode dontAsk --allowedTools "Read,Bash(npm test)" --bare`. - **C.** `claude -p "..." --permission-mode acceptEdits`, so it can write any test artifacts it needs. - **D.** `claude -p "..."` in the default mode, letting it prompt for approval if it needs more access.
Part 4 Chapter 1 Last verified 2026-06-08 Fresh

Explicit Criteria over Vague Instructions

The controllable lever for output quality is the specification, not the model. Name the success criteria and the output shape explicitly; positive instruction beats negative; the model will not infer a requirement you did not state. This is durable methodology — a stable principle, not a feature surface.

Volatility: stable-principle
Tools compared: claude-code
Before you start: None beyond having prompted Claude for a structured result and been surprised by the variation between runs.
You will learn
  • Apply explicit output-format specification instead of relying on the model to infer a shape
  • Distinguish positive instruction (“do this”) from negative instruction (“don’t do that”) and know which steers better
  • Evaluate when an explicit instruction is enough and when to escalate to examples or a hard schema
  • Recognize why this is a durable principle rather than a feature of any one model version

Part IV opens the domain the exam weights at 20% — getting a model to produce what you actually need. The first lever is the cheapest and the most overlooked: the prompt’s own precision. Everything later in this Part (few-shot, structured outputs, validation loops) is an escalation from this baseline. The principle here outlasts every model version — newer models make it more true, not less — which is why this chapter is a stable principle, not a feature surface.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

  1. Two runs of the same prompt return differently-shaped output. Where does the inconsistency actually live?
  2. A more-literal model (Opus 4.8) is given a prompt with one unstated-but-intended requirement. What happens to that requirement?
  3. Which steers more reliably — “don’t use markdown lists” or “respond in flowing prose” — and why?
  4. Name the three rungs of the output-control escalation ladder, cheapest first.
  5. Why does a newer, more capable model make “be explicit” more important rather than less?
Check your answers
  1. In the specification, not the model — the disagreement was latent in the prompt: a degree of freedom you left unstated, which the model resolved differently each run.
  2. It is simply not honored — Opus 4.8 interprets prompts literally and “does not infer requests you didn’t make,” so an unwritten requirement goes unmet.
  3. “Respond in flowing prose” — a positive instruction points straight at the destination, while a prohibition only names a forbidden region without locating the target inside the still-vast permitted one.
  4. Explicit instruction → few-shot examples (D4.2) → structured outputs / strict tools (D4.3) — each rung costs more, so you climb only as far as the stakes require.
  5. Because newer models follow instructions more literally — they invalidate the assumption that the model will generously guess your intent, so an unstated requirement becomes a liability, not a gap the model fills.

Specify the output format; do not leave it to inference

The single most reliable quality lever is to state the output contract explicitly. “Precisely define your desired output format using JSON, XML, or custom templates so that Claude understands every output formatting element you require.” [Official] Increase output consistency · AnthropicT1-official original A vague instruction (“summarize this”) leaves the shape — length, fields, ordering, what to do with missing data — for the model to guess, and a guess varies from run to run. An explicit instruction (“return JSON with keys sentiment, key_issues (list), and action_items (list of objects with team and task)”) removes the variance at its source.

Key idea

Consistency is a property of the specification, not of the model. If two runs of the same prompt disagree, the disagreement was latent in the prompt — you left a degree of freedom unstated and the model resolved it differently each time. Pin every degree of freedom you care about, and the output stops drifting.

The model will not infer what you did not ask for

Modern models follow instructions more literally, which makes implicit expectations a liability. “Claude Opus 4.8 interprets prompts literally and explicitly, particularly at lower effort levels. It does not silently generalize an instruction from one item to another, and it does not infer requests you didn’t make.” [Official] Prompting best practices · AnthropicT1-official original The upside is precision and less thrash; the cost is that a requirement you held in your head but never wrote down will simply not be honored. The fix is not a cleverer prompt that the model “figures out” — it is stating the requirement.

Expecting the model to read your intent

A prompt that works because the model generously guessed your intent is a prompt that will break the moment the model, the effort level, or the input distribution shifts. “It’ll figure out the format from the one example” is precisely the assumption the more-literal models invalidate by design. If a behavior matters, it belongs in the instruction — not in your expectation of the model’s charity.

Tell the model what to do, not what to avoid

When you are steering format or tone, a positive instruction outperforms a prohibition. The docs are explicit that demonstrating the wanted behavior beats forbidding the unwanted one: “Positive examples showing how Claude can communicate with the appropriate level of concision tend to be more effective than negative examples or instructions that tell the model what not to do.” [Official] Prompting best practices: Use examples effectively (multishot / few-shot) · AnthropicT1-official original “Respond in smoothly flowing prose” steers better than “do not use markdown lists,” because a prohibition names a forbidden region without locating the target inside the (still vast) permitted one. A positive instruction points straight at the destination. The same logic applies to eliminating preambles: state “respond directly without preamble” rather than enumerating the openings you dislike. [Official] Prompting best practices · AnthropicT1-official original

Key idea

A negative instruction shrinks the output space; a positive instruction aims within it. “Don’t be verbose” rules out one failure but leaves a thousand acceptable lengths; “answer in one sentence” picks the one you meant. Specify the target, not the anti-target.

The escalation ladder: instruction, then examples, then a hard schema

Explicit instruction is the first rung, not the only one. The documented hierarchy is to ask plainly first and escalate only when you need a stronger guarantee: “Try simply asking the model to conform to your output structure first, as newer models can reliably match complex schemas when told to, especially if implemented with retries. For classification tasks, use either tools with an enum field containing your valid labels or structured outputs.” [Official] Prompting best practices · AnthropicT1-official original Plain instruction handles most cases; a few-shot example (the next chapter) disambiguates edge cases; a hard schema (D4.3) makes a shape unrepresentable to violate. Each rung costs more — context, latency, setup — so you climb only as far as the stakes require.

Concept ·
  1. Explicit instruction — name the format and the criteria in the prompt. Cheapest; handles the common case.
  2. Few-shot examples (D4.2) — demonstrate the desired handling, especially on ambiguous inputs the instruction can’t fully pin down.
  3. Structured outputs / strict tools (D4.3) — constrain decoding so a non-conforming shape cannot be emitted at all. Strongest guarantee; highest setup cost.
Climbing the ladder on one prompt Worked example

A nightly job starts with a vague prompt and hardens it exactly as far as the stakes demand:

  • Rung 0 (vague). "Summarize this support ticket." Output drifts run to run — one line here, three paragraphs there, sentiment sometimes present — and the downstream parser breaks. The variance is latent in the prompt: length, fields, and sentiment-handling were never pinned.
  • Rung 1 (explicit instruction). "Return JSON: summary (≤2 sentences), sentiment (one of positive/neutral/negative), action_needed (boolean). If there is no clear action, set action_needed=false." Now every degree of freedom the parser cares about is fixed. This handles the common case and costs nothing but words.
  • Rung 2 (few-shot, D4.2). A batch of feature-request tickets keeps getting tagged negative because they describe a missing capability. The instruction can’t fully pin that judgment, so you add two worked examples showing a feature request labeled neutral. The example disambiguates what prose could not.
  • Rung 3 (structured outputs / strict tool, D4.3). The pipeline must never see a label outside the three — but the model occasionally emits "mixed". You move sentiment to an enum-constrained tool / structured output, so a fourth value is unrepresentable, not merely discouraged.

The discipline is to stop at the rung the stakes require. Most fields never leave Rung 1; only the crash-on-violation field (sentiment) earns Rung 3. Climbing further than the risk warrants just buys setup cost and latency you don’t need.

The hands-on craft of writing these prompts — the iteration rhythm, the worked before-and-after examples — is the Use book’s territory; its prompt-engineering chapter is the use-side companion to this exam-angle treatment (forthcoming in the handbook).

[Note]

Cross-book pointer is prose-only: the handbook’s prompt-engineering chapter is outlined but not yet published, so there is no stable URL to link yet.

Practice

Exercise

A nightly job asks Claude to “summarize each support ticket.” The summaries come back wildly inconsistent — some one line, some three paragraphs, some with a sentiment label and some without — and a downstream parser keeps breaking. What is the most direct fix?

  • A. Lower the sampling temperature so the model decodes more deterministically and repeats one shape.
  • B. Add “be consistent and don’t write too much” so the prompt tells it to self-regulate length.
  • C. Specify the exact output contract — a fixed set of typed, length-bounded fields the parser can rely on.
  • D. Switch to a larger model that infers the intended format more reliably from the same prompt.
Practice ◆◆◇◇

Rewrite the negative instruction “don’t include any preamble and don’t use markdown headers” as a positive specification, and explain in one sentence why the positive form steers more reliably.

Practice ◆◆◆◇

You need a model to return one of exactly four category labels, and a malformed or out-of-set label will crash the pipeline. Name the three rungs of the escalation ladder in order, and say which rung you would stop at here and why.

Exercise solutions

Solution ↑ Exercise

C. The inconsistency is latent in the prompt: “summarize” leaves length, fields, and sentiment-handling unstated, so the model resolves them differently each run. Specifying the exact contract — fixed fields, each with a type and a length bound — removes those degrees of freedom and is what a downstream parser needs. A (temperature) reduces token-level randomness but does nothing about an underspecified shape; a deterministic model still has to invent a structure you never gave it. B is a negative, vague instruction — “too much” is undefined and “be consistent” names the goal without specifying the target. D is the exact assumption modern literal-following models invalidate: a bigger model will not infer a contract you didn’t write, and may follow your vague prompt more faithfully, not less.

Solution ↑ Exercise

A positive rewrite: “Respond directly with the answer in flowing prose.” That single instruction covers both prohibitions — “directly” eliminates the preamble, “flowing prose” rules out headers — by naming the target instead of the forbidden regions. It steers more reliably because a prohibition (“don’t use headers,” “no preamble”) shrinks the output space without locating the destination inside the still-vast permitted region, whereas the positive form aims straight at the one shape you meant.

Solution ↑ Exercise

The three rungs, cheapest first: (1) explicit instruction — name the four labels in the prompt and ask the model to return exactly one; (2) few-shot examples — demonstrate the labeling on ambiguous inputs; (3) structured outputs / strict tools — constrain decoding so only an in-set label can be emitted. Here you should climb to rung 3: the premise is that a malformed or out-of-set label crashes the pipeline, so you need the shape to be unviolatable, not merely likely-correct. An enum-constrained tool or structured output makes an out-of-set value unrepresentable — the only rung that turns “should be valid” into “cannot be invalid,” which is what a crash-on-violation contract demands.

Exam essentials

  • Consistency lives in the specification — if two runs disagree, a degree of freedom was left unstated; pin the format (fields, types, lengths, missing-data handling) explicitly.
  • Modern models follow instructions literally — Opus 4.8 “does not infer requests you didn’t make,” so an unstated requirement is an unmet one; state it rather than expecting the model to read intent.
  • Positive beats negative — “respond in flowing prose” / “answer in one sentence” steers better than “don’t use lists” / “don’t be verbose”; aim at the target instead of ruling out one failure.
  • Escalation ladder — explicit instruction → few-shot examples (D4.2) → structured outputs / strict tools (D4.3); climb only as far as the stakes require, since each rung costs more.
  • Why stable-principle — “be explicit about what you want” survives every model version; newer, more-literal models make it more load-bearing, not less.
Part 4 Chapter 2 Last verified 2026-06-02 Fresh

Few-Shot Prompting for Ambiguous Cases

Examples are the most reliable way to steer format, tone, and structure — and the only clean way to pin down an ambiguous case. The pattern is 3-5 relevant, diverse, structured examples, with at least one placed on the edge case showing the desired handling.

Volatility: architectural-pattern
Tools compared: claude-code
Before you start: Chapter D4.1 (explicit criteria, and the escalation ladder where few-shot is the second rung).
You will learn
  • Apply the 3-5-example sweet spot and explain what each example is doing
  • Distinguish the three example-quality criteria — relevant, diverse, structured
  • Analyze an extraction or classification task to find the ambiguous case that needs an example
  • Recognize how few-shot composes with structured outputs rather than competing with it

D4.1’s escalation ladder put examples on the second rung: when a plain instruction can’t fully pin a behavior — especially on the messy, ambiguous inputs — a demonstration does what a description cannot. This chapter is that rung. It is an architectural pattern: the 3-5-example construction and its quality criteria are stable across model versions, with the example-tag syntax as the illustration that may shift.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

  1. When does a demonstration outperform a written instruction — what kind of case is few-shot’s home turf?
  2. What is the documented example-count sweet spot, and what fails below it and above it?
  3. Name the three example-quality criteria. Which one, when neglected, silently corrupts the prompt?
  4. To pin that a missing field resolves to null (not "unknown"), do you add a rule or an example — and where in the set?
  5. Few-shot and structured outputs (D4.3): competing or complementary? What does each one control?
Check your answers
  1. A demonstration wins where description is ambiguous — few-shot’s home turf is the case an instruction can’t fully specify, like how a borderline input should resolve.
  2. 3-5 examples is the documented sweet spot: with 1-2 the model latches onto an incidental trait, and at 6+ you burn context and risk disagreeing examples teaching “either is acceptable.”
  3. Relevant, diverse, structured — diverse is the most often neglected and most consequential: examples that all share an irrelevant trait teach that trait as if it were the rule.
  4. An example — show an input with the missing field whose output is null, placed in the middle of the set; the model generalizes from the example treatment, not from a prose rule beside it.
  5. Complementary — the schema locks the shape while examples teach content and edge-case handling.

Examples are the most reliable steering mechanism

When a behavior is hard to describe, demonstrate it. “Examples are one of the most reliable ways to steer Claude’s output format, tone, and structure. A few well-crafted examples (known as few-shot or multishot prompting) can dramatically improve accuracy and consistency.” [Official] Prompting best practices: Use examples effectively (multishot / few-shot) · AnthropicT1-official original An example carries information a sentence struggles to: the exact field ordering, the precise tone, how a borderline input should resolve. The model does not memorize the examples — it extracts the implicit pattern across them and applies it to the new input.

Key idea

An instruction is a rule the model must interpret; an example is a behavior the model can copy. Where description is ambiguous (“handle missing data gracefully”), demonstration is exact (here is an input with missing data and here is the output I want for it). Few-shot wins precisely on the cases instructions can’t fully specify.

The 3-5 sweet spot

The documented count is small and specific: “Include 3-5 examples for best results. You can also ask Claude to evaluate your examples for relevance and diversity, or to generate additional ones based on your initial set.” [Official] Prompting best practices: Use examples effectively (multishot / few-shot) · AnthropicT1-official original The range is not arbitrary — too few examples and the model latches onto an incidental trait; too many and you burn context for no gain and risk contradictory examples confusing it.

Concept ·
  • 0 examples — the model infers shape from the instruction alone; fine for trivial shapes, breaks on edge cases.
  • 1-2 examples — high risk of picking up an incidental pattern (a capitalization habit, a sentence length) instead of the intended one.
  • 3-5 examples — the sweet spot: one canonical case + one or two variants + one edge case. Enough diversity to disambiguate the real pattern, few enough to stay cheap.
  • 6+ examples — diminishing returns, rising context cost, and a growing chance two examples disagree and teach “either is acceptable.”

Relevant, diverse, structured

Quality matters more than count. The three criteria are explicit: “When adding examples, make them: Relevant: Mirror your actual use case closely. Diverse: Cover edge cases and vary enough that Claude doesn’t pick up unintended patterns. Structured: Wrap examples in <example> tags (multiple examples in <examples> tags) so Claude can distinguish them from instructions.” [Official] Prompting best practices: Use examples effectively (multishot / few-shot) · AnthropicT1-official original Diversity is the criterion most often neglected and the most consequential: three examples that all happen to share an irrelevant trait teach that trait as if it were the rule.

Accidental patterns and disagreeing examples

Two failure modes hide in a careless example set. First, an incidental shared trait — every sample input ends in a period, every output capitalizes the first field — gets learned as a requirement, because the model generalizes from whatever is common across the examples, intended or not. Second, two examples that handle the same edge case differently teach the model that either resolution is fine, which is worse than no example at all. Audit the set for both before adding a sixth example.

Target the ambiguous case directly

This is the heart of the cert task area. When an extraction or classification has a messy case — a missing field, an “other” bucket, an unusual variant — do not write a separate rule for it; show an example on that case with the handling you want. The model generalizes from the example treatment, not from a prose rule beside it. Put the ambiguous input in the middle of the set with its desired output:

<examples>
  <example>
    <input>Order #4815 shipped on Apr 3 via UPS tracking 1Z999AA10123456784.</input>
    <output>{"order_id": "4815", "carrier": "UPS", "tracking": "1Z999AA10123456784"}</output>
  </example>
  <example>
    <input>Customer asked about order status yesterday but gave no order number.</input>
    <output>{"order_id": null, "carrier": null, "tracking": null}</output>
  </example>
  <example>
    <input>Shipped today via 'FedEx Express Saver' - see ref 7712-4488-9933.</input>
    <output>{"order_id": null, "carrier": "FedEx", "tracking": "7712-4488-9933"}</output>
  </example>
</examples>

The middle example is the ambiguous case: it teaches that “no order number” resolves to null — not an empty string, not "unknown", not "n/a". [Official] Prompting best practices: Use examples effectively (multishot / few-shot) · AnthropicT1-official original Few-shot also composes with the next rung of the ladder: with structured outputs (D4.3), the schema locks the shape while examples still teach content and edge-case handling — the two are complementary, not redundant. [Official] Prompting best practices: Use examples effectively (multishot / few-shot) · AnthropicT1-official original A schema can require order_id to be a string-or-null; only an example teaches that this kind of input is the null one.

Constructing a 3-5 set that targets the edge case Worked example

Task: classify a support ticket as one of bug | feature_request | question | other. The hard case is a feature request phrased as a complaint (“Your export is useless without CSV”) — instructions keep mislabeling it bug. Build the set deliberately:

<examples>
  <example>  <!-- canonical bug -->
    <input>Clicking Export throws a 500 error every time.</input>
    <output>{"category": "bug"}</output>
  </example>
  <example>  <!-- the edge case, placed in the middle -->
    <input>Your export is useless without CSV — fix this.</input>
    <output>{"category": "feature_request"}</output>
  </example>
  <example>  <!-- a question, to vary the class -->
    <input>Where do I find the export button?</input>
    <output>{"category": "question"}</output>
  </example>
  <example>  <!-- the "other" bucket -->
    <input>Thanks for the great support last week!</input>
    <output>{"category": "other"}</output>
  </example>
</examples>

Read it against the three criteria. Relevant: every input is a real ticket in your actual phrasing, not a toy sentence. Diverse: four different categories, and the second example deliberately breaks the “angry tone → bug” pattern an instruction-only prompt was learning — that one example is doing the real work. Structured: each pair is wrapped in <example>, the set in <examples>, so the model separates demonstrations from the instruction and the input. Four examples, one per class with the edge case carrying its own slot — squarely in the 3-5 sweet spot. Note what is not here: a prose sentence saying “angry feature requests are not bugs.” The demonstration replaces that rule, and does it more reliably.

The use-side craft of building and iterating on an example set lives in the Use book’s prompt-engineering chapter (forthcoming), alongside D4.1’s hands-on companion material.

[Note]

Cross-book pointer is prose-only — the handbook’s prompt-engineering chapter is outlined but unpublished, so there is no stable URL yet.

Practice

Exercise

You are extracting {invoice_id, amount, due_date} from emails. Most are clean, but some have no due date at all, and the pipeline keeps emitting "due_date": "unknown" for those — which the downstream date parser rejects. You want missing dates to come back as null. What is the most reliable fix?

  • A. Add a sentence to the instruction: “if there is no due date, use null, not ‘unknown’.”
  • B. Include 3-5 examples, one of which is an email with no due date whose output shows "due_date": null.
  • C. Raise the example count to 10 clean invoices so the model has more data to generalize from.
  • D. Post-process every "unknown" string into null after the model returns.
Practice ◆◆◇◇

Name the three example-quality criteria from the docs, and explain why “diverse” is the one whose absence most often silently corrupts a few-shot prompt.

Practice ◆◆◆◇

A colleague uses a single, carefully chosen example and is puzzled that the model copies an odd quirk of it (it always quotes the first field). Explain what is happening in terms of the 3-5 guidance, and what changing the count fixes.

Exercise solutions

Solution ↑ Exercise

B. The ambiguous case — an input with no due date — is exactly what an example should demonstrate: place one in the set whose output shows "due_date": null, and the model generalizes from that treatment. A is the D4.1 instinct (be explicit), and it helps, but a prose rule beside the examples is weaker than a demonstration on the case itself: the model generalizes from the example treatment, not from a separate written rule sitting next to it (the principle established in “Target the ambiguous case” above). C adds volume without diversity: ten clean invoices never show the missing-date case, so the model still has nothing to copy for it. D works mechanically but is a band-aid that hard-codes one symptom (“unknown”); the model may emit “none” or "" next, and you are now maintaining a translation table instead of fixing the prompt. The few-shot fix targets the root cause.

Solution ↑ Exercise

The three criteria are relevant (mirror your actual use case closely), diverse (cover edge cases and vary enough that the model doesn’t latch onto an unintended pattern), and structured (wrap each example in <example> tags, the set in <examples>, so the model separates demonstrations from instructions). Diversity is the silent corrupter: when three examples happen to share an incidental trait — every input ends in a period, every output capitalizes the first field — the model generalizes from whatever is common across the set, so it learns that trait as if it were the rule. The prompt still “looks right,” but it has quietly taught the wrong invariant, and the failure only surfaces on inputs that don’t share the accidental trait.

Solution ↑ Exercise

With a single example, the model cannot tell which of the example’s traits are the pattern and which are incidental — quoting the first field is one concrete trait of that one sample, and with nothing to contrast it against, the model copies it as if it were required. This is the documented failure of 1-2 examples: high risk of picking up an incidental pattern instead of the intended one. Raising the count into the 3-5 range fixes it by adding contrast: across several examples that vary the incidental traits (some quote nothing, different field orders) while holding the intended pattern constant, the first-field-quoting habit no longer appears in every example, so the model stops treating it as the rule. The cure is diversity-via-count, not volume for its own sake.

Exam essentials

  • Few-shot is the most reliable steering mechanism for format, tone, and structure — the model extracts the implicit pattern across examples, so it disambiguates where instructions can’t.
  • 3-5 examples is the documented sweet spot: 1-2 risks learning an incidental trait, 6+ burns context and risks contradictions; budget the 3-5 as one canonical + variants + one edge case.
  • Relevant, diverse, structured — mirror the real use case, vary enough to avoid spurious patterns, and wrap each in <example> (group in <examples>) so the model separates examples from instructions and input.
  • Target the ambiguous case — put an example on the edge case (null field, “other” bucket) showing the desired handling; the example teaches it, a prose rule beside it teaches it less reliably.
  • Composes with structured outputs — the schema locks shape, examples teach content and edge-case handling; complementary, not redundant.
Part 4 Chapter 3 Last verified 2026-06-08 Fresh

Structured Output via Tool Use and JSON Schema

Forcing a known-shape JSON result has two generations. The classic pattern borrows the tool-call channel as a typed output slot; the modern features (strict tool use and output_config.format) use grammar-constrained decoding to make a non-conforming shape unrepresentable. The JSON-Schema subset, additionalProperties false, and the per-request limits are the surfaces to know.

Volatility: feature-surface
Tools compared: claude-code
Before you start: Chapter D4.1 (the escalation ladder — this is its top rung) and D2.1-D2.3 (tool definitions, input_schema, tool_choice).
You will learn
  • Apply the classic tool-use pattern that turns a tool call into a typed output slot
  • Apply strict: true and output_config.format to constrain tool inputs and responses
  • Recognize the JSON-Schema subset and the mandatory additionalProperties: false
  • Analyze the per-request limits, the 24-hour grammar cache, and the two failure modes (refusal and truncation) constrained decoding cannot prevent

D4.1 ended with a top rung: when a shape must not be violated, make it unrepresentable. This chapter is that rung’s machinery. It has two generations — the older tool-use pattern that is still the right tool for open-ended schemas, and the newer grammar-constrained features that eliminate schema-violation retries entirely — and because the substance here is named API fields, schema rules, and numeric limits, it is a feature surface.

[Note]

This is a feature-surface chapter: field names (output_config.format, strict: true), the schema subset, and the 20/24/16 request limits are concrete surfaces that move between releases. The output_format parameter has already migrated once. Treat every field and number as a current snapshot and re-verify before relying on it.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

  1. In the classic pattern, which field of the forced tool call holds your data, and what do you do with the tool’s “result”?
  2. What does strict: true guarantee that the classic pattern alone does not — and through which integration path is it silently dropped?
  3. Which JSON-Schema keyword is mandatory on every object node, and why does the constrained decoder require it?
  4. Name the two failure modes constrained decoding cannot prevent, and the stop_reason value that signals each.
  5. For open-ended extraction (“I don’t know which fields will appear”), do you reach for structured outputs or classic tool use, and why?
Check your answers
  1. The data is the forced call’s tool_use.input — that object is your extracted JSON; the tool’s “result” is discarded entirely.
  2. It grammar-constrains tool inputs to your schema — no wrong types ("2" for 2), no missing required fields; it is silently dropped by the OpenAI SDK compatibility layer, which honors the request but gives no grammar guarantee.
  3. additionalProperties: false — an open object has no closed grammar to compile, so the decoder must know exactly which keys are permitted at each step.
  4. Refusal (stop_reason: "refusal" — a 200 you are billed for, output may not match) and truncation (stop_reason: "max_tokens" — every token schema-valid but the object never closed).
  5. Classic tool use — open-ended extraction needs additionalProperties: true, which structured outputs cannot accept since it requires additionalProperties: false on every object.

The classic mechanism: a tool whose input is your output

The oldest reliable way to get JSON is to borrow the tool-call channel. Define a tool whose input_schema is exactly the shape you want back, force Claude to call it, and read the call’s input — that object is your extracted JSON; you discard the tool’s “result” entirely. The convention is to name the tool print_X (print_summary, print_entities) so the model treats it as committing data rather than taking an action. [Official] Extracting Structured JSON using Claude and Tool Use · AnthropicT1-official original Forcing the call is what guarantees the extraction happens: tool_choice: {type: "tool", name: "print_summary"}. [Official] Extracting Structured JSON using Claude and Tool Use · AnthropicT1-official original

tools = [{
    "name": "print_summary",
    "description": "Prints a summary of the article.",
    "input_schema": {
        "type": "object",
        "properties": {
            "author":  {"type": "string"},
            "topics":  {"type": "array", "items": {"type": "string"}},
            "summary": {"type": "string"},
        },
        "required": ["author", "topics", "summary"],
    },
}]
resp = client.messages.create(model="claude-opus-4-8", max_tokens=1024, tools=tools,
    tool_choice={"type": "tool", "name": "print_summary"}, messages=[...])
json_summary = next(b.input for b in resp.content if b.type == "tool_use")
Key idea

You are not calling a function — you are renting the typed tool-call slot as an output channel. The “tool” never does anything; its input_schema is just the most convenient place the API gives you to declare a shape, and tool_use.input is where the shaped result comes back.

Strict tool use: from shape to guarantee

The classic pattern controls which fields appear, but not their types — Claude could still emit "2" where you need 2. Setting strict: true on the tool definition closes that gap: “Setting strict: true on a tool definition guarantees Claude’s tool inputs match your JSON Schema by constraining the model’s token sampling to schema-valid outputs (a technique called grammar-constrained sampling).” [Official] Strict tool use · AnthropicT1-official original The motivation is operational: “Without strict mode, Claude might return incompatible types (‘2’ instead of 2) or missing required fields, breaking your functions and causing runtime errors.” [Official] Strict tool use · AnthropicT1-official original For “call one of N candidate tools and validate its inputs,” combine tool_choice: {type: "any"} with strict: true on each tool. [Official] Strict tool use · AnthropicT1-official original

strict is ignored through the OpenAI-compatibility layer

strict: true is honored only on the native Claude API. Calling through the OpenAI SDK compatibility layer silently drops it — the request succeeds, but you get no grammar guarantee, so a type mismatch you thought was impossible can reappear. Schema-shape guarantees require the native API path.

Structured outputs: constrain the response itself

Strict tool use constrains a tool call. Its sibling constrains Claude’s response directly: output_config.format coerces the final assistant text to a JSON schema using the same pipeline. The two are “two complementary features: JSON outputs (output_config.format) … Strict tool use (strict: true),” and the payoff is the elimination of retry loops: “Structured outputs guarantee schema-compliant responses through constrained decoding … No retries needed for schema violations.” [Official] Structured outputs · AnthropicT1-official original The request carries output_config: {format: {type: "json_schema", schema: {...}}}, and the conforming JSON arrives in the response text.

Concept ·
  • output_config.format — constrains what Claude says (the response text) to a JSON schema. Use when you want a free-form answer coerced to a fixed shape.
  • strict: true — constrains how Claude calls your tools (the tool_use.input). Use mid-agent-loop, or for forced single-tool extraction.
  • Both compile your JSON Schema into a grammar and constrain token sampling; they compose in one request.

Note the migration: “The output_format parameter has moved to output_config.format, and beta headers are no longer required. The old beta header (structured-outputs-2025-11-13) and output_format parameter will continue working for a transition period.” [Official] Structured outputs · AnthropicT1-official original

The JSON-Schema subset and its one mandatory rule

Both features accept a subset of JSON Schema, not the full draft. Objects, arrays, the scalar types, enum, const, anyOf, and internal $ref are supported; external $ref, recursive schemas, numerical bounds (minimum/maximum), and string-length bounds are not — unsupported features return a 400. [Official] Structured outputs · AnthropicT1-official original The one rule that catches everyone: additionalProperties: false is required on every object node — it is the most common 400 for hand-authored schemas. When you need a numeric or length bound, the SDK helpers strip it from the schema, encode it as description text, and validate it client-side after the call instead. [Official] Structured outputs · AnthropicT1-official original

The missing additionalProperties: false

A schema that validates fine in your editor and then 400s at the API almost always has an object node without additionalProperties: false. The constrained decoder requires it on every object — nested ones included — because an open object has no closed grammar to compile. When a structured-outputs request rejects with a schema error, scan for the object that is missing it before suspecting anything subtler.

Limits, caching, and the failure modes that still get through

Three operational facts complete the picture. Caching: the compiled grammar carries a first-request latency, then is “cached for 24 hours from last use,” and the cache “invalidates if you change the JSON schema structure or set of tools. Changing only name or description fields does NOT invalidate cache.” [Official] Structured outputs · AnthropicT1-official original Limits: a request allows at most 20 strict tools, 24 cumulative optional parameters across strict schemas, and 16 union-typed parameters; beyond that (or an internal grammar-size cap) you get a 400 “Schema is too complex for compilation.” [Official] Structured outputs · AnthropicT1-official original The failures constrained decoding cannot prevent. Grammar constraints guarantee every emitted token is schema-valid — but not that Claude emits a complete result, so two gaps survive and a caller must check stop_reason for both. First, a refusal: if Claude refuses, the response is stop_reason: "refusal" with a 200 status, you are billed, and the output may not match the schema. [Official] Structured outputs · AnthropicT1-official original Second, truncation: max_tokens is a hard cap on output, and stop_reason: "max_tokens" is its output-budget signal. [Official] How the agent loop works · AnthropicT1-official original If generation hits that cap mid-structure, every token emitted was schema-valid but the object never closed — the JSON is cut off, so a parser rejects it just the same. The fix is not a retry on the same budget but a larger max_tokens (or a smaller schema); the constrained decoder cannot finish a structure it ran out of room to write.

This is also why the classic cookbook pattern has a permanent niche: open-ended extraction with additionalProperties: true — “I don’t know which fields will be present” — is something structured outputs cannot do, since it requires additionalProperties: false. For open-ended schemas, plain tool use stays the right tool. [Official] Extracting Structured JSON using Claude and Tool Use · AnthropicT1-official original

Reading a strict extraction without trusting it blindly Worked example

You force a strict print_record extraction. The grammar guarantees the shape — but a naive caller that reads the result directly still crashes on the two failures above. Guard them:

resp = client.messages.create(
    model="claude-opus-4-8", max_tokens=1024, tools=[print_record],
    tool_choice={"type": "tool", "name": "print_record"}, messages=[...])

# strict: true makes each emitted token schema-valid -- not the response complete.
if resp.stop_reason == "refusal":
    raise ExtractionRefused()        # 200 status, you were billed, may not match schema
if resp.stop_reason == "max_tokens":
    raise OutputTruncated()          # valid tokens, but the object never closed
record = next(b.input for b in resp.content if b.type == "tool_use")

Trace the reasoning: the grammar buys you per-token schema-validity, so you will never see "3" where you required 3. It does not buy you a guarantee that a record arrived. A refusal returns a 200 you paid for with no conforming object; a max_tokens stop cuts the JSON off mid-write so tool_use.input is absent or malformed. Reading resp.content before checking stop_reason turns both into a confusing StopIteration or parse error three layers downstream. Two cheap branches convert them into legible, actionable failures — and the max_tokens branch tells you to raise the cap, not to retry the same doomed request.

Practice

Exercise

You extract a fixed record — {customer, plan, seats} with seats an integer — and a downstream billing function will crash on a string where it expects a number. You want exactly this one extraction, type-guaranteed, on the native Claude API. Which approach fits best?

  • A. Force a single print_record tool with tool_choice: {type: "tool", name: "print_record"} and set strict: true on it.
  • B. Use the classic cookbook pattern with additionalProperties: true so the model can include whatever it finds.
  • C. Call through the OpenAI SDK compatibility layer with strict: true for portability.
  • D. Ask in the prompt for “valid JSON with an integer seats field” and parse the response text.
Practice ◆◆◇◇

A hand-authored structured-outputs schema returns a 400 even though it is valid JSON Schema. State the most likely cause and the one-line fix, and explain why the decoder requires it.

Practice ◆◆◆◇

Structured outputs is the modern default, yet the older tool-use pattern still has a niche it cannot be replaced in. Name that niche, the exact schema feature that defines it, and why constrained decoding cannot serve it.

Practice ◆◆◆◇

Your strict extraction usually works, but on unusually long inputs the returned JSON is occasionally cut off mid-object and your parser throws — even though the grammar was supposed to guarantee a valid shape. What is actually happening, which stop_reason confirms it, and what is the fix? Explain why a plain retry of the identical request is not the fix.

Exercise solutions

Solution ↑ Exercise

A. A single forced tool plus strict: true gives both guarantees you need: the forced tool_choice ensures exactly this extraction runs, and strict constrains the inputs by grammar so seats is a real integer, not "3" — exactly the “incompatible types breaking your functions” case strict mode exists to prevent. B is the open-ended pattern; additionalProperties: true is for when you don’t know the fields, and it forgoes the strict guarantee — wrong for a fixed, type-critical record. C silently drops strict through the OpenAI-compatibility layer, so you lose the type guarantee precisely where you needed it. D is the unconstrained baseline D4.1 warned about — it can parse fine and still hand you "3", the failure you are trying to design out.

Solution ↑ Exercise

The most likely cause is an object node missing additionalProperties: false, and the one-line fix is to add it to every object in the schema — nested ones included. The decoder requires it because an open object (one allowing arbitrary extra keys) has no closed set of valid continuations to compile into a grammar: at each decoding step the model must know exactly which keys are permitted, and additionalProperties: false is what closes that set. Without it there is no finite grammar to constrain sampling against, so the API rejects the schema with a 400 rather than allow unconstrained keys. It is the top 400 for hand-authored schemas precisely because standard JSON Schema defaults additionalProperties to true, so a schema that “validates fine in your editor” still fails compilation here.

Solution ↑ Exercise

The niche is open-ended extraction — “I don’t know which fields will be present” — and the schema feature that defines it is additionalProperties: true (an object that may carry arbitrary, unknown keys). Constrained decoding cannot serve it because it requires additionalProperties: false on every object: the grammar must enumerate the permitted keys ahead of time, and an object that allows any key has no closed grammar to compile. So when the set of fields is genuinely unknown in advance, the classic print_X tool-use pattern — which imposes no such closure — remains the right tool; structured outputs is for shapes you can pin down completely.

Solution ↑ Exercise

What is happening is truncation, not a grammar failure. max_tokens is a hard cap on total output, and the generation ran into it partway through writing the object; the grammar did its job — every token emitted was schema-valid — but it cannot guarantee the structure finishes within the budget, so the JSON is cut off before its closing braces and the parser rejects it. The confirming signal is stop_reason: "max_tokens" (the output-budget value), as opposed to end_turn. The fix is to raise max_tokens (or shrink the schema / split the extraction). A plain retry of the identical request is not the fix because it re-runs against the same budget and truncates at the same place — you must enlarge the room before the structure can complete.

Exam essentials

  • Classic tool-use pattern — define a print_X tool whose input_schema is your output shape, force it with tool_choice: {type: "tool", name: ...}, read tool_use.input; the tool result is discarded.
  • strict: true — grammar-constrains tool inputs to the schema (no wrong types, no missing required fields); pair with tool_choice: {type: "any"} for “one-of-N and valid.” Ignored on the OpenAI-compat layer.
  • output_config.format — grammar-constrains Claude’s response to a JSON schema; “no retries needed for schema violations.” Migrated from the output_format param / structured-outputs-2025-11-13 beta header.
  • Schema subset — a subset of Draft 2020-12; additionalProperties: false is mandatory on every object (top 400 cause); no numeric/length bounds (SDKs strip them to descriptions + post-validate); no external $ref or recursion.
  • Limits + failure modes — 20 strict tools / 24 optional params / 16 union types per request; grammar cached 24h from last use (invalidated by schema/tool-set change, not name/description); two failures slip past constrained decoding — a refusal (stop_reason: "refusal"; 200, billed, may not match) and truncation (stop_reason: "max_tokens"; JSON cut off mid-object). Always check stop_reason for both; raise max_tokens for the second rather than retrying.
Part 4 Chapter 4 Last verified 2026-06-02 Fresh

Validation, Retry, and Feedback Loops

Constrained decoding eliminates schema errors, never semantic ones — valid JSON can still hold wrong data. The architect's job is the layer above the schema, discriminating the two error kinds, encoding semantic checks into the schema itself, and closing a bounded validate-feed-back-retry loop that escalates to a human on exhaustion.

Volatility: architectural-pattern
Tools compared: claude-code
Before you start: Chapter D4.3 (structured outputs and what constrained decoding guarantees — and what it doesn't).
You will learn
  • Distinguish schema errors from semantic errors and know which the API can and cannot catch
  • Apply the Agent SDK’s retry loop and discriminate its result by subtype
  • Design schema-level fields that turn an opaque “is it correct?” into a measurable cross-check
  • Evaluate how to bound a validate-feed-back-retry loop and when to escalate to a human

D4.3 closed with a hard guarantee — a schema a response cannot violate — and one caveat: a refusal still gets through. This chapter is about a deeper caveat. Constrained decoding guarantees the shape, never the truth. A perfectly schema-valid record can name the wrong customer or fabricate a total. The pattern for catching that lives above the API, in your validation loop — and the loop, not its current field names, is what this chapter is about, which makes it an architectural pattern.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

  1. A schema-valid record names the wrong customer. Which error class is this, and can the API catch it?
  2. An SDK structured-output run returns. How do you tell success from exhausted retries, and why must you check before reading the payload?
  3. To catch a total that doesn’t match the line items, what schema-level hook makes the error mechanically checkable?
  4. Name the three layers of the validate-feed-back-retry pattern. Which layer catches a semantic error?
  5. Your retry loop keeps exhausting on long inputs that come back cut off. Why won’t more retries help, and what will?
Check your answers
  1. A semantic error — valid JSON, incorrect data — and the API cannot catch it: a schema constrains form, never fact.
  2. Branch on subtype — success carries the payload in message.structured_output, error_max_structured_output_retries means fall back; exhaustion is a result, not an exception, so unchecked code silently reads undefined.
  3. The stated_total vs calculated_total pair — the model copies the document’s total and re-sums the line items, and the caller compares the two.
  4. API constrained decoding, application-code semantic checks, and a bounded feedback loop — the semantic error is caught by layer 2, application-code checks.
  5. The failure is truncation (stop_reason: "max_tokens") — re-prompting on the same budget truncates at the same place; raise the cap (or shrink the schema) instead.

Two kinds of error: schema and semantic

Structured outputs (D4.3) eliminates a whole class of failure: “Always valid: No more JSON.parse() errors. Type safe: Guaranteed field types and required fields. Reliable: No retries needed for schema violations.” [Official] Structured outputs · AnthropicT1-official original What it cannot touch is the other class — the semantic errors: responses that are valid JSON matching your schema but containing incorrect data, the very failures the SDK’s validate-and-feed-back machinery exists to catch. [Official] Get structured output from agents · AnthropicT1-official original A schema can require customer_name to be a non-empty string; it cannot know the source said “Jane” while the model wrote “John.”

Key idea

A schema constrains form, never fact. Constrained decoding makes a wrong-shaped output impossible and a wrong-data output exactly as likely as before. Once you adopt structured outputs, your remaining error budget is entirely semantic — so that is where your validation effort should move.

The SDK retry loop handles the schema layer

For the residual schema mismatches in a multi-tool agentic run, the Agent SDK adds a retry loop: “the SDK validates the output against it, re-prompting on mismatch. If validation does not succeed within the retry limit, the result is an error instead of structured data.” [Official] Get structured output from agents · AnthropicT1-official original Crucially, exhaustion is a result you inspect, not an exception that throws — you discriminate on subtype: success carries the typed payload in message.structured_output; error_max_structured_output_retries means the budget ran out and you must fall back. [Official] Get structured output from agents · AnthropicT1-official original

Not handling the exhausted-retries subtype

Because exhaustion returns a result rather than raising, code that reads message.structured_output without first checking subtype will silently process undefined on the failure path. The two subtypes — success and error_max_structured_output_retries — are the contract; branch on them explicitly and define the fallback (simpler schema, simpler prompt, or human review). The retry count itself is not documented, so never hard-code an assumption about how many attempts you got.

Encode the semantic check into the schema

You cannot retry your way out of a semantic error the SDK never sees — so make the model commit to signals a caller can check. The pattern is to add fields whose only job is verification:

Concept ·
  • detected_pattern — the model declares the format it inferred (e.g. currency_usd); the caller re-checks it against the value with a regex.
  • stated_total vs calculated_total — the model both copies the document’s total and re-sums the line items; the caller compares them, and a mismatch means bad source or a fabricated number.
  • conflict_detected + conflict_reason — the model flags when two source fields contradict; true routes to human review.
  • Nullable “other” + detail — a closed enum plus an always-nullable detail string lets the model refuse to confabulate a category instead of forcing a wrong one.
  • Provenance triple — claim + source.span_quote + confidence; the caller verifies the quoted span actually appears in the cited document (D5.6).

Each pattern converts an un-checkable judgment (“is this right?”) into a mechanical test (“does calculated_total equal the sum?”). [Official] Get structured output from agents · AnthropicT1-official original The model is doing the same extraction either way; you are just asking it to show enough of its work that a downstream check can catch a lie.

Close the loop: validate, feed back, retry, escalate

The full pattern stacks three independent layers, and skipping any one surfaces a different failure. [Official] Get structured output from agents · AnthropicT1-official original Layer 1 is constrained decoding (schema errors gone). Layer 2 is your application code running the semantic cross-checks above. Layer 3 re-prompts with the specific failures (“calculated_total does not equal the sum of line items; re-extract correcting this”) for a bounded number of attempts, then falls back.

Unbounded, schema-thrashing, or truncation-trapped retry loops

Three ways the feedback loop bites back. First, an unbounded loop on a genuinely ambiguous task never converges — bound the attempts and escalate to human review on exhaustion, because the alternative is paying inference forever for an answer that will not come. Second, each retry is a full inference, so a three-retry run costs roughly four times a clean one — and if you mutate the schema between attempts you also invalidate the grammar cache (D4.3) and re-pay compilation; keep the schema fixed across retries and vary only the feedback message. Third, a truncation is the one failure retries cannot fix on their own: if a response stopped at the max_tokens cap — stop_reason: "max_tokens", the output-budget signal — re-prompting on the same budget truncates at the same place and silently burns the whole retry allowance. [Official] How the agent loop works · AnthropicT1-official original Detect that stop reason and raise the cap (or shrink the schema) instead of spending a single retry on it.

The schema-design heuristics fold back into D4.3: keep schemas focused, and mark fields optional when the source might not contain them — an over-required schema turns a missing field into a retry and then an exhausted-budget error. [Official] Get structured output from agents · AnthropicT1-official original

The three layers on one invoice Worked example

A nightly invoice extractor runs the full pattern, and a fabricated total shows where each layer earns its place:

  • Layer 1 — constrained decoding. Structured outputs returns a schema-valid object: { "stated_total": 480.00, "calculated_total": 520.00, "line_items": [...] }, with subtype: "success". No JSON.parse error is possible and both totals are guaranteed floats. The schema layer is done — and it has caught nothing wrong, because nothing is wrong with the shape.
  • Layer 2 — application-code semantic check. Your code re-sums line_items (520.00) and compares it to stated_total (480.00). 480 ≠ 520 → a semantic error the API could never see: both numbers are valid, but they disagree. This catch exists only because you added the stated_total / calculated_total pair to the schema — the model showed enough work for a mechanical test to run.
  • Layer 3 — bounded feedback. The loop re-prompts with the specific failure: “stated_total (480.00) does not match the sum of line_items (520.00); re-extract, correcting the discrepancy.” Bound it to, say, three attempts. If a retry reconciles, return success; if the budget exhausts — subtype: "error_max_structured_output_retries" or a persistent mismatch — route to human review, never silently bill the cheaper of two totals.

The discipline: layer 1 is free and total; layer 2 is where your design effort goes (the verification fields don’t exist until you add them); layer 3 must be bounded with a human backstop. Skip layer 2 and the bad total reaches billing; leave layer 3 unbounded and you pay inference forever on an answer that will not come.

Practice

Exercise

An invoice extractor uses structured outputs and never returns malformed JSON, but once in a while it reports a total that doesn’t match the line items — and those slip through to billing. You want to catch them automatically. What is the most effective design?

  • A. Add strict: true to the extraction tool so the totals are type-guaranteed.
  • B. Add both stated_total and calculated_total to the schema and have application code compare them, routing any mismatch to human review.
  • C. Raise max_tokens so the model has room to compute the total more carefully.
  • D. Tighten the JSON schema with a minimum constraint on total.
Practice ◆◆◇◇

Name the two result subtypes the Agent SDK returns for a structured-output run, state what each means for the caller, and explain why you must branch on subtype before reading the payload.

Practice ◆◆◆◇

A pipeline must catch a fabricated customer_name (valid string, wrong person). Name the three layers of the validation-retry-feedback pattern, say which layer can catch this error and which cannot, and what schema-level hook would make the catch possible.

Practice ◆◆◆◇

Your validate-retry loop keeps hitting error_max_structured_output_retries on your longest documents, which come back as cut-off JSON. (a) What is the actual failure, and which stop_reason confirms it? (b) Why does the retry loop make it worse rather than better? (c) What is the fix?

Exercise solutions

Solution ↑ Exercise

B. A wrong-but-well-typed total is a semantic error — valid JSON, incorrect data — so no schema or type guarantee touches it. The fix is to make the error checkable: have the model emit both the document’s stated_total and its own calculated_total, then let application code compare them and route mismatches to review. A (strict) guarantees the total is a number, which it already was; it does nothing about a number being wrong. C (more tokens) addresses truncation, not arithmetic fabrication. D (minimum) is both unsupported by the structured-outputs subset and irrelevant — a bound on magnitude can’t detect a total that’s internally inconsistent with the line items.

Solution ↑ Exercise

The two subtypes are success — the run validated, and the typed payload is on message.structured_output — and error_max_structured_output_retries — validation failed within the retry budget, so there is no payload and you must fall back (simpler schema, simpler prompt, or human review). You must branch on subtype before reading the payload because exhaustion returns a result, not an exception: code that reads message.structured_output on the error path reads undefined and silently processes garbage downstream. The subtype is the success/failure contract; the payload is present and trustworthy only on success.

Solution ↑ Exercise

The three layers are (1) API constrained decoding (the schema layer — eliminates syntax/type/required/enum errors), (2) application-code semantic checks (the domain layer — cross-checks what the data means), and (3) a bounded feedback loop (re-prompts with the specific failures, then escalates on exhaustion). A fabricated customer_name — a valid string naming the wrong person — is a semantic error: layer 1 cannot catch it (the shape is perfect) and layer 2 is the one that must. The hook that makes the catch possible is a provenance field: have the model emit the source span it drew the name from (claim + source.span_quote + confidence), so application code can verify the quoted span actually appears in the document — turning “is this the right person?” into a mechanical string-containment check (D5.6).

Solution ↑ Exercise

(a) The failure is truncation, not a schema or semantic error: the response hit the max_tokens output cap partway through writing the object, confirmed by stop_reason: "max_tokens" (the output-budget value, versus end_turn). (b) The retry loop makes it worse because each re-prompt runs against the same max_tokens budget, so it truncates at the same place — every attempt fails validation identically, and the loop burns its entire allowance reaching error_max_structured_output_retries without ever being able to succeed. (c) The fix is to detect stop_reason: "max_tokens" and raise the cap (or shrink / split the schema) before retrying — retries cannot manufacture room the budget does not allow.

Exam essentials

  • Schema vs semantic — constrained decoding eliminates syntax/type/required/enum errors; semantic errors (valid JSON, wrong data) are invisible to the API and need domain logic.
  • SDK retry loop — validates and re-prompts on mismatch; the result is success (payload in message.structured_output) or error_max_structured_output_retries (fall back). It’s a result you check, not an exception; the retry count is undocumented.
  • Schema-level semantic hooks — detected_pattern, stated_total vs calculated_total, conflict_detected, nullable “other” + detail, and the provenance triple turn “is it correct?” into a mechanical cross-check.
  • Three layers — API constrained decoding (schema) + application-code semantic checks + a bounded feedback loop that re-prompts with the specific errors; escalate to human review on exhaustion.
  • Loop economics — bound the attempts (each retry is a full inference, ~4× for three retries) and keep the schema stable across retries so you don’t re-pay grammar compilation. A truncation (stop_reason: "max_tokens") is the trap retries can’t fix — re-prompting on the same budget re-truncates; detect it and raise the cap instead.
Part 4 Chapter 5 Last verified 2026-06-08 Fresh

Batch Processing: The Message Batches API

When nothing is waiting on the answer, batch trades latency for half the price. The Message Batches API processes up to 100,000 async requests at a 50 percent discount with a 24-hour SLA, and its one non-negotiable contract is custom_id matching, because results come back in any order.

Volatility: feature-surface
Tools compared: claude-code
Before you start: Chapter D4.3-D4.4 (structured outputs, which compose with batch). Familiarity with the Messages API request shape.
You will learn
  • Evaluate when an async batch beats a real-time call on cost and throughput
  • Apply the custom_id contract to match results that return out of order
  • Recognize the batch envelope — size limits, single-shot requests, no streaming
  • Recognize the billing model and the extended-output beta
  • Distinguish a batch-level result type from a per-message stop_reason (a succeeded result can still be a refusal or a truncation)

The first three chapters of Part IV controlled a single response. This one scales to a hundred thousand of them. A batch is the right tool whenever the work is large and nothing is waiting on the answer — and its surface (the endpoint, the size limits, the custom_id rule, the beta header) is exactly the kind of named detail that shifts between releases, so this is a feature-surface chapter.

[Note]

This is a feature-surface chapter: the /v1/messages/batches endpoint, the 100,000 / 256 MB limits, the 24-hour SLA, and the output-300k-2026-03-24 beta header are concrete surfaces that move between releases. Treat every number and header as a current snapshot and re-verify before relying on it.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

  1. What single factor decides batch versus real-time, and which way does each tolerance point?
  2. Why is custom_id mandatory, and what specifically breaks if you rely on result order?
  3. Name the two size limits on a single batch, and which one an HTTP 413 reports.
  4. A batch result comes back succeeded — does that mean its answer is usable? What must you still check?
  5. Which result types are not billed, and which billed-but-possibly-useless outcome is the trap?
Check your answers
  1. Latency tolerance alone decides: if a human or synchronous system is blocked on the result, batch is wrong; if the work is an overnight job, backfill, or offline evaluation, batch halves the bill.
  2. Results “can be returned in any order,” so the unique custom_id is the only sanctioned join key; relying on positional order silently mismatches outputs to inputs, and nothing in the response flags it.
  3. 100,000 requests or 256 MB, whichever is reached first — an HTTP 413 on creation reports the 256 MB payload limit.
  4. No — succeeded is a batch-level outcome that says the request ran; you must still inspect the message’s own stop_reason, because a refusal ("refusal") or truncation ("max_tokens") arrives as succeeded.
  5. errored, canceled, and expired are not billed; the trap is a succeeded refusal — it returns a 200, you pay for it, and it may not match your schema.

The cost-latency trade

The Message Batches API exists for one trade: give up immediacy, get half off. “The Message Batches API is a powerful, cost-effective way to asynchronously process large volumes of Messages requests. This approach is well-suited to tasks that do not require immediate responses, with most batches finishing in less than 1 hour while reducing costs by 50% and increasing throughput.” [Official] Batch processing (Message Batches API) · AnthropicT1-official original The discount is a flat 50% on both input and output across all tiers, and it stacks with prompt-caching discounts. The cost of the discount is a service-level agreement measured in hours, not milliseconds: most batches finish within an hour, but the guarantee is 24, and a batch that does not complete in 24 hours expires. [Official] Batch processing (Message Batches API) · AnthropicT1-official original Results stay retrievable for 29 days after creation.

Key idea

Batch is not “a faster way to do many calls” — it is a cheaper, slower way, and the decision rule is purely latency tolerance. If a human or a synchronous system is blocked on the result, batch is wrong. If the work is an overnight job, a backfill, or an offline evaluation, batch halves the bill for free.

The custom_id contract

A batch is a set, not a sequence, and that has one non-negotiable consequence: “Batch results can be returned in any order, and may not match the ordering of requests when the batch was created. … To correctly match results with their corresponding requests, always use the custom_id field.” [Official] Batch processing (Message Batches API) · AnthropicT1-official original Every request carries a unique custom_id (1–64 characters, alphanumeric plus - and _), and that id is the only thread connecting an output back to the input that produced it.

Relying on result order, or reusing custom_ids

Two failure modes destroy a batch’s results even when every request succeeds. Relying on positional order — assuming result n answers request n — silently mismatches outputs to inputs, and nothing in the response will flag it. Reusing a custom_id across requests makes two results indistinguishable. Treat the custom_id as a primary key: unique, meaningful, and the only sanctioned way to join results to requests.

The batch envelope: what fits and what it can’t do

A batch is bounded by size and by shape. Size: “A Message Batch is limited to either 100,000 Message requests or 256 MB in size, whichever is reached first.” [Official] Batch processing (Message Batches API) · AnthropicT1-official original Exceed the payload and the create call returns HTTP 413 — break huge datasets into multiple batches. Shape: a batch supports all Messages API features including beta features, “however, streaming is not supported for batch requests,” [Official] Batch processing (Message Batches API) · AnthropicT1-official original and each request is single-shot — there is no follow-up turn inside a batch, so multi-turn tool round-trips do not work. Structured outputs (D4.3), by contrast, compose cleanly: a batched request can carry output_config.format and you get schema-valid JSON at 50% off. [Official] Structured outputs · AnthropicT1-official original

Concept ·
  • Inside — vision, tool use, system prompts, structured outputs, prompt caching, other beta features; a single user→assistant turn per request.
  • Outside — streaming (unsupported); multi-turn tool loops (the request is single-shot, no tool_result round-trip); max_tokens: 0 cache pre-warm (the ephemeral cache would expire before any follow-up).

Billing, result types, and the lifecycle

You pay only for what works: a result is succeeded, errored, canceled, or expired, and “you are not billed for errored, canceled, or expired requests.” [Official] Batch processing (Message Batches API) · AnthropicT1-official original For unusually long generations there is an opt-in: the output-300k-2026-03-24 beta header “raises the max_tokens cap to 300,000 for batch requests using Claude Opus 4.8, Claude Opus 4.7, Claude Opus 4.6, or Claude Sonnet 4.6” — batch-only, and a single 300k generation can itself take over an hour, so submit it with the 24-hour window in mind. [Official] Batch processing (Message Batches API) · AnthropicT1-official original

A succeeded result is not automatically a usable one

The four result types — succeeded, errored, canceled, expired — are batch-level outcomes: they tell you the request ran, not that its answer is good. A succeeded result “includes the message result,” [Official] Batch processing (Message Batches API) · AnthropicT1-official original and that message carries its own stop_reason. Two values still bite — a refusal (stop_reason: "refusal") returns a 200, is billed, and may not match your schema; a truncation (stop_reason: "max_tokens") is incomplete output. [Official] Structured outputs · AnthropicT1-official original Note the cost asymmetry: you are not billed for errored/canceled/expired, but a succeeded refusal you do pay for. So per-result handling must inspect each succeeded message’s stop_reason, not stop at the result type.

Concept ·
  1. Create — POST /v1/messages/batches with a list of {custom_id, params} objects; the response carries processing_status: "in_progress" and expires_at (24h out).
  2. Poll — GET /v1/messages/batches/{id} until processing_status == "ended".
  3. Stream — pull results from results_url as JSONL, one result per line (stream rather than download whole for large batches).
  4. Match — join each result to its request by custom_id.
  5. Clean up (optional) — DELETE the batch before the 29-day retention window closes.
The lifecycle, with the check most callers skip Worked example

Classifying 80,000 tickets overnight. Two layers of result-checking, not one:

# 1. Create -- each request a unique custom_id (the only join key).
batch = client.messages.batches.create(requests=[
    {"custom_id": f"ticket-{t.id}", "params": {...}} for t in tickets])

# 2. Poll until ended (most < 1h; SLA 24h, then expiry).
while client.messages.batches.retrieve(batch.id).processing_status != "ended":
    sleep(60)

# 3. Stream JSONL -- order is NOT guaranteed.
for r in client.messages.batches.results(batch.id):
    if r.result.type != "succeeded":
        handle_failure(r.custom_id, r.result.type)        # errored/canceled/expired -- unbilled
        continue
    msg = r.result.message
    if msg.stop_reason in ("refusal", "max_tokens"):
        handle_unusable(r.custom_id, msg.stop_reason)      # succeeded but NOT usable -- and billed
        continue
    records[r.custom_id] = parse(msg)                       # 4. join by custom_id

The structure is the lesson. The first guard is the batch-level result type — succeeded versus the three unbilled failures. The second, the one most pipelines omit, is the message-level stop_reason inside a succeeded result: a refusal or a truncation reaches you as succeeded yet carries no answer you can use — and you paid for the refusal. Skip it and you silently ingest refused/truncated outputs as if they were classifications. And throughout, custom_id is the only correct join key, because the result stream is unordered.

Practice

Exercise

You must classify 80,000 archived support tickets overnight to populate an analytics dashboard read the next morning. Cost matters; latency does not. Which design fits best?

  • A. A real-time loop calling the Messages API once per ticket, parallelized for throughput.
  • B. One Message Batch of 80,000 requests, each with a unique custom_id, results matched by custom_id after processing_status reaches ended.
  • C. A streaming batch so the dashboard can update ticket-by-ticket as results arrive.
  • D. A single Messages API request containing all 80,000 tickets in one prompt.
Practice ◆◆◇◇

Explain why custom_id is mandatory rather than optional in a batch workflow, and describe the specific failure that occurs if a caller instead assumes results return in submission order.

Practice ◆◆◆◇

State the two limits that bound a single Message Batch and which one is reported by an HTTP 413 on creation. Then name one Messages API capability that does not work inside a batch and why.

Exercise solutions

Solution ↑ Exercise

B. The job is large, offline, and cost-sensitive with no one waiting — the exact profile batch is built for: 50% off, and 80,000 requests sits within the 100,000-request limit. Matching by custom_id after the batch ends is the required pattern because results return unordered. A works but forfeits the 50% discount and adds rate-limit and orchestration overhead for latency nobody needs. C is impossible — streaming is not supported for batch requests. D collapses 80,000 independent classifications into one prompt, which blows past context limits and produces a single entangled response with no per-ticket structure.

Solution ↑ Exercise

custom_id is mandatory because batch results “can be returned in any order” — a batch is a set, not a sequence, so there is no positional correspondence between the request list and the result stream to fall back on. The unique custom_id is the only thread joining an output back to the input that produced it. If a caller instead assumes submission order, the specific failure is a silent mis-join: result n is attributed to request n when it actually answers some other request, so records carry the wrong data and nothing in the response flags it. That is the most dangerous failure class — one that corrupts data without surfacing an error.

Solution ↑ Exercise

The two limits are 100,000 requests or 256 MB in size, whichever is reached first; the 256 MB payload limit is the one an HTTP 413 reports on creation (the fix is to split the dataset into multiple batches). A Messages API capability that does not work inside a batch: streaming (explicitly unsupported), or equally a multi-turn tool loop — each batched request is single-shot, with no tool_result round-trip, because a batch processes each request as one independent user→assistant turn with no follow-up.

Exam essentials

  • The trade — batch is async and 50% off (input and output, all tiers, stacks with caching); most finish under an hour, the SLA is 24 hours, then the batch expires; results retained 29 days. Choose it by latency tolerance alone.
  • custom_id contract — results return in any order; match by the unique custom_id (1–64 chars, alphanumeric + - + _). Never rely on positional order; never reuse an id.
  • Envelope — 100,000 requests or 256 MB per batch (HTTP 413 over payload); streaming unsupported; each request is single-shot (no multi-turn tool loop). Structured outputs compose (schema-valid at 50% off).
  • Billing + beta — billed only for succeeded; errored/canceled/expired are free; the output-300k-2026-03-24 beta raises max_tokens to 300,000 on batch for Opus 4.8/4.7/4.6 and Sonnet 4.6.
  • Succeeded ≠ usable — a succeeded result still carries a per-message stop_reason; a refusal ("refusal", 200, billed, may not match schema) or a truncation ("max_tokens", incomplete) reaches you as succeeded. Check each succeeded message’s stop_reason, not just the result type.
  • Lifecycle — POST /v1/messages/batches → poll until ended → stream results_url JSONL → match by custom_id → optional DELETE before the 29-day window.
Part 4 Chapter 6 Last verified 2026-06-02 Fresh

Multi-Pass Review: Independent Reviewers and Attention Dilution

A fresh context catches what a self-review cannot, because attention dilutes as the window fills and an implementer is biased toward its own code. The same independent-reviewer pattern scales from a two-session Writer/Reviewer pair to a fleet of specialists guarded by a verification pass.

Volatility: architectural-pattern
Tools compared: claude-code
Before you start: Chapter D1.2 (coordinator-subagent patterns, isolated context) and D4.4 (verification as the catch above the schema).
You will learn
  • Analyze why a fresh-context reviewer outperforms a model reviewing its own work
  • Apply the Writer/Reviewer pattern and its single-session verification-subagent form
  • Evaluate how a fleet of specialists plus a verification pass answers attention dilution
  • Recognize the convergence rules that keep multi-pass review from spamming low-value findings
  • Recognize Code Review’s concrete surfaces — availability, the three trigger modes, the severity taxonomy, and the neutral check run

Part IV’s final chapter is about checking the work — at scale. The instinct to “ask the model to double-check itself” is exactly the wrong one, for a structural reason: a model reviewing in the same window that wrote the code is both dilated by a full context and biased toward what it just produced. The fix is independence, and it scales from two sessions to a fleet. The principle is stable; the product that embodies it is the illustration — so this is an architectural pattern.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

  1. Why is asking a model to review its own code in the same session the weakest review — for which two independent reasons?
  2. Name the three scales of the independent-reviewer pattern, cheapest first.
  3. What does the verification pass do in a fleet review, and what fails without it?
  4. Name Code Review’s three severity tags, and which one you would block a merge on.
  5. Code Review’s check run never blocks a merge on its own — so how do you actually gate on its findings?
Check your answers
  1. Attention dilution (performance degrades as the context window fills) and implementer bias (a fresh context won’t be biased toward code it just wrote) — self-review is the most dilated and most biased reviewer you could pick.
  2. Verification subagent (one session, child in its own context window), Writer/Reviewer (two independent sessions), fleet (many parallel specialists, one issue class each).
  3. It re-checks each candidate finding against actual code behavior to filter out false positives; without it, parallel reviewers’ plausible-but-wrong findings accumulate — more reviewers means more noise.
  4. 🔴 Important / 🟡 Nit / 🟣 Pre-existing — block the merge on Important (a bug that should be fixed before merging).
  5. Read the severity breakdown from the check-run output in your own CI and fail the step yourself — exit non-zero when the 🔴 Important count is positive, so your own required check blocks the merge.

Why a fresh context beats self-review

Two independent forces make same-session self-review weak. The first is attention dilution: “LLM performance degrades as context fills. When the context window is getting full, Claude may start ‘forgetting’ earlier instructions or making more mistakes. The context window is the most important resource to manage.” [Official] Best practices for Claude Code · AnthropicT1-official original The second is implementer bias: “A fresh context improves code review since Claude won’t be biased toward code it just wrote.” [Official] Best practices for Claude Code · AnthropicT1-official original A reviewer that never watched the code get written carries neither the polluted context nor the sunk-cost instinct to defend it.

Key idea

Independence buys two things at once. A fresh window is at peak performance instead of operating in a context already crowded with the implementation; and a reviewer with no authorship has nothing to rationalize. Self-review forfeits both — it is the most dilated and the most biased reviewer you could pick.

The Writer/Reviewer pattern and its lightweight form

The canonical realization is two sessions. One writes; a second, with no inherited context, reviews; the first then addresses the feedback. The docs give a worked example: Session A implements a rate limiter, Session B reviews @src/middleware/rateLimiter.ts “looking for edge cases, race conditions, and consistency with existing middleware patterns,” and Session A applies the result. [Official] Best practices for Claude Code · AnthropicT1-official original The same shape works for tests: “have one Claude write tests, then another write code to pass them.” [Official] Best practices for Claude Code · AnthropicT1-official original When spinning up a second session is too heavy, the single-session analog is a verification subagent — “use a subagent to review this code for edge cases” — which runs in its own context window and so inherits none of the parent conversation’s assumptions. [Official] Best practices for Claude Code · AnthropicT1-official original

Concept ·
  • Verification subagent (one session) — a child in its own context window does a lightweight edge-case pass; cheapest, no separate session.
  • Writer/Reviewer (two sessions) — genuinely independent author and reviewer; the quality-critical default.
  • Fleet (many agents) — parallel specialists, each on one issue class, productized as a managed service.

The fleet: parallelism plus a verification pass

At the top of the scale, the pattern becomes a fleet. In Anthropic’s Code Review product, “when a review runs, multiple agents analyze the diff and surrounding code in parallel on Anthropic infrastructure. Each agent looks for a different class of issue, then a verification step checks candidates against actual code behavior to filter out false positives.” [Official] Code Review · AnthropicT1-official original The surviving findings “are deduplicated, ranked by severity, and posted as inline comments.” [Official] Code Review · AnthropicT1-official original Fanning out to specialists is the direct architectural answer to attention dilution: rather than ask one reviewer to hold every bug class in one window at once, each agent owns a single class — and the isolated-context, lead-plus-specialists shape is the same one Anthropic’s multi-agent research system uses. [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original

A fleet without a verification pass amplifies false positives

Parallel reviewers have a characteristic failure mode: each independently flags plausible-looking but wrong issues, and with no filter those candidates accumulate — more reviewers means more noise, not just more signal. The verification step (re-checking each candidate against actual code behavior) is not optional polish; it is what makes fan-out a net gain. Add reviewers without a verifier and you have built a faster way to generate false positives.

Code Review in practice: availability, triggers, severity

The productized form of the pattern has a concrete surface the exam can probe. Availability is gated: “Code Review is in research preview, available for Team and Enterprise subscriptions. It is not available for organizations with Zero Data Retention enabled.” [Official] Code Review · AnthropicT1-official original

Triggers come in three modes per repo: once after PR creation, after every push, or manual — invoked by commenting @claude review (which then subscribes to subsequent pushes) or @claude review once (a single one-off pass). [Official] Code Review · AnthropicT1-official original The one-off is the lever for “review this PR now, but don’t enroll it in re-review on every push.”

Severity is a fixed three-tag taxonomy on every finding:

Concept ·
  • 🔴 Important — a bug that should be fixed before merging.
  • 🟡 Nit — a minor issue, worth fixing but not blocking.
  • 🟣 Pre-existing — a bug already in the codebase, not introduced by this PR.

And one subtlety that catches people: the check run “always completes with a neutral conclusion so it never blocks merging through branch protection rules.” [Official] Code Review · AnthropicT1-official original Code Review advises; it does not gate. To actually block a merge on findings, read the severity breakdown from the check-run output in your own CI and fail the step yourself. [Official] Code Review · AnthropicT1-official original

Convergence: keep multi-pass from spamming

More passes is not strictly better — multi-pass review needs convergence rules, and attention dilution applies to the instructions as much as the code. The docs are explicit that “a long REVIEW.md dilutes the rules that matter most,” [Official] Code Review · AnthropicT1-official original so the broadcast instruction block stays short. And re-review on every push needs a damping rule so trivial diffs don’t draw endless commentary: an instruction like “after the first review, suppress new nits and post Important findings only” stops “a one-line fix from reaching round seven on style alone.” [Official] Code Review · AnthropicT1-official original Production-quality fleet review is real work, not free — Code Review averages roughly $15–25 and 20 minutes per review at current figures [Official] Code Review · AnthropicT1-official original — so the convergence rules are also cost control.

Key idea

The same dilution that motivates independent reviewers also governs how you instruct them: a sprawling rule file and an un-damped re-review loop both bury the findings that matter under volume. Independence catches more; convergence rules keep “more” from becoming “noise.”

Making an advisory review actually block a merge Worked example

Your team wants a merge blocked when Code Review finds a real bug — but its check run is neutral by design and never blocks through branch protection. So you build the gate yourself, on top of the surfaces above:

  1. Trigger once. On a PR you want checked but not re-reviewed on every push, comment @claude review once. The fleet runs, the verification pass filters false positives, and findings post as inline comments tagged 🔴 Important / 🟡 Nit / 🟣 Pre-existing.
  2. Read the severity breakdown. Code Review writes a machine-readable severity count into the check-run output; a CI step parses it (e.g. gh + jq) to pull the 🔴 Important count.
  3. Fail on Important. Exit non-zero when it is positive, so your own required check (not Code Review’s) goes red and branch protection blocks the merge:
important=$(gh ... --jq '... | .important')
[ "$important" -gt 0 ] && { echo "Code Review: $important Important finding(s)"; exit 1; }

The reasoning ties three threads together. Code Review deliberately advises rather than gates, so a research-preview tool can never wedge a team’s merge queue. The 🔴/🟡/🟣 taxonomy is what makes a self-built gate precise — block on Important, let Nits and Pre-existing through. And the exit-code discipline from D3.6 is what converts an advisory signal into an enforced one. Note also the cost lever: @claude review once runs a single pass, where “after every push” multiplies the ~$15–25 per-review cost by your push count.

Practice

Exercise

Claude has just implemented a subtle concurrent cache in one long session, and you want the strongest possible bug-catch before merge. Which approach is most likely to surface a race condition?

  • A. Ask the same session, in the next turn, to carefully review its own implementation for bugs.
  • B. Open a fresh session (or dispatch a verification subagent) that reviews the cache with no context from how it was written.
  • C. Re-run the same prompt at a higher temperature to get a different implementation to compare.
  • D. Keep reviewing in the same session but paste the code back in so it is fresh in context.
Practice ◆◆◇◇

Explain, in terms of attention dilution, why fanning a review out to several specialist agents (each looking for one issue class) tends to outperform a single agent asked to find every class of bug in one pass.

Practice ◆◆◆◇

A team runs five parallel reviewer agents but no verification step, and finds the review output noisy and untrustworthy. Name the specific multi-agent failure mode at work and what the verification pass does to address it.

Exercise solutions

Solution ↑ Exercise

B. A fresh session or a verification subagent reviews with no inherited context, which removes both weaknesses of self-review at once — the reviewer is at peak performance in a clean window and has no authorship bias toward the code, exactly the conditions the docs credit for catching what self-review misses. A is self-review: most dilated (the implementation already fills the context) and most biased (the model defends what it wrote). C generates a second implementation, not a review, and gives you two artifacts to reconcile rather than a found bug. D re-pastes code into an already-polluted context; freshening the code does nothing about the accumulated context or the authorship bias — only a fresh, independent context fixes those.

Solution ↑ Exercise

Attention dilution says performance degrades as the context window fills — so a single agent asked to find every class of bug must hold the diff, the surrounding code, and a long checklist of bug categories in one window, and its attention to any one class thins as the others crowd in. Fanning out gives each specialist its own fresh context with a single mandate — race conditions, or injection, or edge cases — so none of them is operating dilated, and each brings full attention to its one class. The fleet trades one over-loaded reviewer for many focused ones; that is the direct architectural answer to dilution (paired, crucially, with a verification pass so the extra candidates don’t become noise).

Solution ↑ Exercise

The failure mode is false-positive amplification: parallel reviewers each independently flag plausible-but-wrong issues, and with no filter those candidates accumulate, so adding reviewers adds noise, not just signal — five agents surface five streams of unverified guesses. The verification pass re-checks each candidate against actual code behavior to filter out false positives before anything is posted, so only findings that survive a behavioral check reach the human. It is what makes fan-out a net gain rather than a faster false-positive generator; the surviving findings are then deduplicated and ranked by severity.

Exam essentials

  • Fresh context beats self-review for two independent reasons: attention dilution (performance degrades as the window fills) and implementer bias (a model defends code it just wrote). Independence removes both.
  • Writer/Reviewer — one session writes, a second independent session reviews, the writer addresses feedback; the test/code split is a variant; the verification subagent is the single-session, isolated-context form.
  • Fleet + verification pass — parallel specialists each own one issue class (the answer to attention dilution); a verification step filters false positives, then dedupe + severity ranking. A fleet without verification amplifies false positives.
  • Convergence rules — keep the broadcast instruction block short (“a long REVIEW.md dilutes the rules that matter most”) and damp re-review (“suppress new nits, post Important findings only”) so trivial diffs don’t draw endless passes; this is also cost control.
  • Code Review surfaces — research preview, Team & Enterprise only, not under Zero Data Retention; three trigger modes (once after PR creation / after every push / manual via @claude review or one-off @claude review once); severity taxonomy 🔴 Important / 🟡 Nit / 🟣 Pre-existing; the check run is neutral and never blocks a merge — gate by reading the severity breakdown in your own CI.

Part 4 · D4 Review

6 exercises across 6 chapters — interleaved review.

d4-01-explicit-criteria

  1. d4-01-ex-vague-vs-explicit A nightly job asks Claude to "summarize each support ticket." The summaries come back wildly inconsistent — some one line, some three paragraphs, some with a sentiment label and some without — and a downstream parser keeps breaking. What is the most direct fix? - **A.** Lower the sampling temperature so the model decodes more deterministically and repeats one shape. - **B.** Add "be consistent and don't write too much" so the prompt tells it to self-regulate length. - **C.** Specify the exact output contract — a fixed set of typed, length-bounded fields the parser can rely on. - **D.** Switch to a larger model that infers the intended format more reliably from the same prompt.

d4-02-few-shot-prompting

  1. d4-02-ex-ambiguous-edge You are extracting `{invoice_id, amount, due_date}` from emails. Most are clean, but some have no due date at all, and the pipeline keeps emitting `"due_date": "unknown"` for those — which the downstream date parser rejects. You want missing dates to come back as `null`. What is the most reliable fix? - **A.** Add a sentence to the instruction: "if there is no due date, use null, not 'unknown'." - **B.** Include 3-5 examples, one of which is an email with no due date whose output shows `"due_date": null`. - **C.** Raise the example count to 10 clean invoices so the model has more data to generalize from. - **D.** Post-process every `"unknown"` string into `null` after the model returns.

d4-03-structured-output-tool-use

  1. d4-03-ex-pick-the-primitive You extract a fixed record — `{customer, plan, seats}` with `seats` an integer — and a downstream billing function will crash on a string where it expects a number. You want exactly this one extraction, type-guaranteed, on the native Claude API. Which approach fits best? - **A.** Force a single `print_record` tool with `tool_choice: {type: "tool", name: "print_record"}` and set `strict: true` on it. - **B.** Use the classic cookbook pattern with `additionalProperties: true` so the model can include whatever it finds. - **C.** Call through the OpenAI SDK compatibility layer with `strict: true` for portability. - **D.** Ask in the prompt for "valid JSON with an integer seats field" and parse the response text.

d4-04-validation-retry-feedback

  1. d4-04-ex-semantic-catch An invoice extractor uses structured outputs and never returns malformed JSON, but once in a while it reports a `total` that doesn't match the line items — and those slip through to billing. You want to catch them automatically. What is the most effective design? - **A.** Add `strict: true` to the extraction tool so the totals are type-guaranteed. - **B.** Add both `stated_total` and `calculated_total` to the schema and have application code compare them, routing any mismatch to human review. - **C.** Raise `max_tokens` so the model has room to compute the total more carefully. - **D.** Tighten the JSON schema with a `minimum` constraint on `total`.

d4-05-batch-processing

  1. d4-05-ex-batch-or-realtime You must classify 80,000 archived support tickets overnight to populate an analytics dashboard read the next morning. Cost matters; latency does not. Which design fits best? - **A.** A real-time loop calling the Messages API once per ticket, parallelized for throughput. - **B.** One Message Batch of 80,000 requests, each with a unique `custom_id`, results matched by `custom_id` after `processing_status` reaches `ended`. - **C.** A streaming batch so the dashboard can update ticket-by-ticket as results arrive. - **D.** A single Messages API request containing all 80,000 tickets in one prompt.

d4-06-multi-pass-review

  1. d4-06-ex-independent-review Claude has just implemented a subtle concurrent cache in one long session, and you want the strongest possible bug-catch before merge. Which approach is most likely to surface a race condition? - **A.** Ask the same session, in the next turn, to carefully review its own implementation for bugs. - **B.** Open a fresh session (or dispatch a verification subagent) that reviews the cache with no context from how it was written. - **C.** Re-run the same prompt at a higher temperature to get a different implementation to compare. - **D.** Keep reviewing in the same session but paste the code back in so it is fresh in context.
Part 5 Chapter 1 Last verified 2026-06-08 Fresh

Long-Conversation Context: Accumulation, Degradation, Compaction

Context is a finite, accumulating resource, and a long conversation degrades before it overflows. This chapter frames the exam angle — cumulative context, the lost-in-the-middle and summarization failure modes, and lossy automatic compaction — and points to the design book where the degradation mechanics are proven in depth.

Volatility: architectural-pattern
Tools compared: claude-code
Before you start: Chapter D1.1 (the agent loop and what accumulates per turn). Part V opens here; this is the context-management domain.
You will learn
  • Recognize context as a finite, cumulative resource and what consumes it
  • Distinguish the long-context failure modes — lost-in-the-middle and summarization loss — from simple overflow
  • Explain what automatic compaction does and the information it can drop
  • Apply the rule that durable instructions belong in re-injected context, not the opening prompt

Part V is the reliability domain — context management, escalation, error propagation, provenance. It opens with the most basic constraint behind all of them: the context window is finite, and a long conversation gets worse before it gets full. This chapter is the cert-exam angle; the mechanics of why long context degrades are proven in depth in the design book, to which it points. It is an architectural pattern — the accumulation-and-compaction shape is stable, while the window sizes and message types are the moving surface.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

  1. List what shares the cumulative context budget within a session.
  2. “Fits in the window” and “well-attended” — why are these different claims?
  3. What does automatic compaction do, and which part of the conversation does it tend to drop first?
  4. A rule given only in the opening prompt stops being followed right after a compaction summary. Where should it have lived, and why?
  5. Given a misbehaving long session, which three failure modes should you discriminate among before reaching for a fix?
Check your answers
  1. The system prompt, tool definitions, CLAUDE.md, conversation history, and tool inputs/outputs all accumulate in one finite budget that never refills within a session.
  2. The token limit is a capacity bound while attention is a quality that declines as the window fills — a conversation comfortably under the limit can still have lost the thread of an instruction given fifty turns ago (lost-in-the-middle).
  3. Near the context limit it replaces older messages with a summary — it summarizes the oldest turns first, so specific instructions from early in the conversation are most at risk.
  4. In CLAUDE.md, because its content is re-injected on every request and so survives compaction, unlike an opening-prompt rule whose survival depends on a summarizer’s discretion.
  5. Accumulation pressure, lost-in-the-middle, and post-compaction loss — three distinct conditions with three distinct remedies.

Context is a finite, accumulating resource

Everything in a session shares one budget. “Context window is cumulative within a session. System prompt, tool definitions, CLAUDE.md, conversation history, tool inputs/outputs all accumulate.” [Official] How the agent loop works · AnthropicT1-official original And the budget is concrete: current windows are 1M tokens on Opus 4.8 and Sonnet 4.6 and 200k on Haiku 4.5 — though Opus 4.8’s tokenizer can consume up to 35% more tokens for the same text, so the same conversation costs more of the budget on one model than another. [Official] Models overview · AnthropicT1-official original

Key idea

Every turn spends from a fixed pool that never refills within a session. A conversation that “still fits” is not free — it has spent budget that shapes how well the model attends to everything still in the window. Long context is a resource to manage, not a capacity to fill.

Degradation comes before overflow

The failure that matters is not hitting the limit — it is the quiet decline well before it. As a window fills, a model attends less reliably to material buried in the middle of a long context (the “lost-in-the-middle” effect), and any progressive summarization of earlier turns discards detail that may later turn out to matter. These degradation mechanisms — context rot, lost-in-the-middle, summarization loss — are the subject of the Agentic Systems Design book’s chapter on context rot, where they are established against the research; this chapter’s job is to make you recognize them on the exam.

Key idea

“Fits in the window” and “well-attended” are different claims. A conversation can be comfortably under the token limit and still have lost the thread of an instruction given fifty turns ago. Treat the onset of degradation, not the overflow error, as the thing to design against.

Compaction: the automatic defense, and its cost

When a session approaches the limit, the loop defends itself: “Automatic compaction triggers near the context limit.” [Official] How the agent loop works · AnthropicT1-official original The defense is lossy by construction: “Compaction replaces older messages with a summary, so specific instructions from early in the conversation may not be preserved. Persistent rules belong in CLAUDE.md (loaded via settingSources) rather than in the initial prompt, because CLAUDE.md content is re-injected on every request.” [Official] How the agent loop works · AnthropicT1-official original Compaction buys room by trading away fidelity to the early conversation — exactly the region most at risk from lost-in-the-middle in the first place.

A durable instruction stranded in the opening prompt

An instruction given only in the first message is the most likely thing compaction discards — it summarizes the oldest turns first, and a one-line rule from turn one rarely survives the summary intact. If a constraint must hold for the whole session, it belongs where it is re-injected on every request (CLAUDE.md), not in the opening prompt where its survival depends on a summarizer’s discretion.

Where the depth lives

This chapter is the exam-angle surface; the design book owns the mechanism. The degradation research, the measurement of context rot, and the assembly strategies that fight it live in the Agentic Systems Design book — its chapter on context rot for the failure modes and its chapter on context assembly for the deliberate construction of what goes in the window. The exam-relevant skill is diagnostic: given a long-session scenario, name whether it is accumulation pressure, lost-in-the-middle, or post-compaction loss, and reach for the matching mitigation.

[Note]

Cross-book links are provisional — they point at the chapter sources until the design book deploys, then repoint to published URLs.

Diagnosing three long-session failures Worked example

The exam-relevant skill is to name the failure mode before reaching for a fix. Three scenarios, three different diagnoses:

  • Scenario A — the rule dropped after a summary. A constraint you gave at turn 1 stops being honored around turn 40, just as a compaction summary appears. Diagnosis: post-compaction loss — the oldest turns were summarized and the early rule didn’t survive intact. Fix: move it to CLAUDE.md, which is re-injected every request.
  • Scenario B — a buried fact misremembered, no compaction. The session is comfortably under the token limit and never compacted, yet Claude misremembers a detail established ~60 turns ago in a long stretch of context. Diagnosis: lost-in-the-middle — material buried in the middle of a long window is attended to least reliably. Fix: re-surface the fact near the end of the context (restate it), or assemble the working context deliberately rather than letting it sprawl.
  • Scenario C — the session simply fills up. Turns keep growing until the window is near full and the loop compacts (or you hit the limit). Diagnosis: accumulation pressure — nothing is degraded yet; the budget is just spent. Fix: reduce what accumulates (tighter tool outputs, /clear and restart with a focused prompt) rather than treating it as a quality bug.

The discipline: “fits,” “buried,” and “summarized away” are three distinct conditions with three distinct remedies. Reaching for a bigger window (Scenario C’s instinct) does nothing for A or B — a larger context still loses its middle and still compacts eventually. Name the mode first.

Practice

Exercise

Early in a long session you tell Claude “always cite a file:line for any behavioral claim.” Forty turns later it stops doing so, right around when a compaction summary appeared. What is the most reliable fix?

  • A. Re-paste the instruction manually every few turns to keep it fresh in context.
  • B. Move the instruction into CLAUDE.md, which is re-injected on every request and so survives compaction.
  • C. Switch to a model with a larger context window so compaction never triggers.
  • D. Raise max_tokens so the model has more room to comply.
Practice ◆◆◇◇

List what accumulates in the context window over a session, and explain why “the conversation still fits” is not the same as “the model is attending to all of it well.”

Practice ◆◆◆◇

Explain what automatic compaction does when a session nears the context limit, why an instruction placed only in the opening prompt is at risk, and where a durable rule belongs instead.

Exercise solutions

Solution ↑ Exercise

B. CLAUDE.md content is re-injected on every request, so a rule placed there is present in the context after compaction just as before it — exactly what a session-long constraint needs. The timing (failure right after a compaction summary) is the tell that the original instruction was summarized away. A works for a few turns but fights the symptom by hand and will fail again at the next compaction. C delays the limit but does not address degradation — a larger window still loses the middle, and a long enough session compacts anyway. D governs output length, not whether an early instruction is retained, so it is unrelated to the failure.

Solution ↑ Exercise

What accumulates: the system prompt, tool definitions, CLAUDE.md, the full conversation history (every user and assistant turn), and all tool inputs and outputs — everything shares one cumulative budget within the session. “Still fits” is not “well-attended” because the token limit is a capacity bound while attention is a quality that declines as the window fills: a model attends less reliably to material buried in the middle of a long context, so a conversation comfortably under the limit can still have effectively lost an instruction given fifty turns ago. Fitting is necessary but not sufficient for the model to be using all of it well.

Solution ↑ Exercise

When a session nears the context limit, automatic compaction triggers and replaces older messages with a summary to buy room. An instruction placed only in the opening prompt is at risk because compaction summarizes the oldest turns first, and a one-line rule from turn one rarely survives the summary intact — so the constraint silently stops being honored (the failure often shows up right after a summary appears). A durable rule belongs where it is re-injected on every request: CLAUDE.md (loaded via settingSources), whose content is re-added to context each request and so is present after compaction exactly as before it — its survival no longer depends on a summarizer’s discretion.

Exam essentials

  • Cumulative budget — system prompt, tool defs, CLAUDE.md, conversation history, and tool I/O all accumulate in one finite window (1M tokens on Opus 4.8 / Sonnet 4.6, 200k on Haiku 4.5; tokenizer density varies by model).
  • Degradation precedes overflow — lost-in-the-middle and summarization loss erode a long context before it hits the limit; “fits” is not “well-attended.” Depth lives in the design book’s context-rot chapter.
  • Compaction is lossy — it triggers near the limit and replaces older messages with a summary, so early-conversation specifics may not be preserved.
  • Durable rules belong in re-injected context — put session-long constraints in CLAUDE.md (re-injected every request), not the opening prompt, so compaction cannot strand them.
Part 5 Chapter 2 Last verified 2026-06-02 Fresh

Escalation and Ambiguity Resolution

When an agent is uncertain or blocked, the reliable architecture makes it surface the decision rather than guess at intent. AskUserQuestion is the structured mechanism, the interview pattern is its proactive form, and the check-in is a control point where a human resolves what the model cannot.

Volatility: architectural-pattern
Tools compared: claude-code
Before you start: Chapter D3.4-D3.5 (plan mode and the interview pattern) and D1.3 (subagents and their isolation).
You will learn
  • Apply the escalate-don’t-guess principle when a task is ambiguous or blocked
  • Recognize the AskUserQuestion structured-clarification mechanism and its limits
  • Distinguish AskUserQuestion (the tool Claude calls) from canUseTool (the callback your app implements), and the six response patterns
  • Distinguish proactive escalation (the interview pattern) from reactive check-ins
  • Analyze why a subagent cannot escalate and how to design around it

Reliability is not only about catching errors after the fact; it is about not committing to a wrong interpretation in the first place. When intent is ambiguous, a well-built agent asks rather than assumes. This chapter is the exam angle on that discipline — the mechanism and its limits — and points to the handbook for the hands-on use-side workflow. The principle is durable; the tool that carries it is the illustration, so this is an architectural pattern.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

  1. When a task is genuinely ambiguous, what is the reliable move — and what is the cost of guessing instead?
  2. What are the structural limits of one AskUserQuestion call (questions per call, options per question, header)?
  3. AskUserQuestion is the tool Claude calls; what is canUseTool, and which two triggers fire it?
  4. Beyond a plain approve/deny, name two of canUseTool’s richer response patterns.
  5. Why can a subagent not escalate with AskUserQuestion, and how do you design the work around that?
Check your answers
  1. Escalate — surface the decision rather than silently pick an interpretation; asking costs one round trip, guessing wrong costs the whole task built on the wrong branch.
  2. One call carries 1–4 questions, each with 2–4 options (label + description) and a header of ≤12 characters.
  3. canUseTool is the callback your application implements; it fires when Claude wants to use a tool (a permission check) and when Claude calls AskUserQuestion (a clarification).
  4. Any two of: approve-with-changes (updatedInput), approve-and-remember (PermissionUpdate), suggest-alternative (deny with guidance), or redirect-entirely (inject a new instruction via streaming input).
  5. Subagents cannot call AskUserQuestion, so they must guess or fail; have the coordinator resolve the open questions first, then hand the subagent a fully-specified task.

Escalate, don’t guess

The foundational move is to surface a decision the model cannot make on its own. “While working on a task, Claude sometimes needs to check in with users. It might need permission before deleting files, or need to ask which database to use for a new project. Your application needs to surface these requests to users so Claude can continue with their input.” [Official] Handle approvals and user input · AnthropicT1-official original An architect’s job is to make those check-ins possible and routine — to design the agent so that hitting an ambiguity raises a question instead of silently resolving it.

Key idea

A silently-resolved ambiguity is a coin flip on intent that nobody chose to take. Escalation converts that flip into a decision made by the one party who actually knows the answer. The cost of asking is one round trip; the cost of guessing wrong is the whole task built on the wrong branch.

AskUserQuestion: structured clarification

The mechanism is a built-in tool with a deliberately bounded shape. Each AskUserQuestion call carries 1–4 questions, each question a header (≤12 characters) and 2–4 options with a label and a description; the response maps each question to the chosen label, and free-text is handled by offering an “Other” choice and passing the typed text rather than the literal "Other". [Official] Handle approvals and user input · AnthropicT1-official original The structure is the point: bounded multiple-choice makes the human’s answer fast to give and unambiguous to route back into the agent’s flow.

Concept ·
  • Per call — 1 to 4 questions.
  • Per question — a question, a short header (≤12 chars), 2 to 4 options, and a multiSelect flag.
  • Per option — a label and a description (optionally a preview for a visual mockup).
  • Answer — a map from each question to its chosen label (an array or comma-joined string when multiSelect); free-text rides in on an “Other” option.
A subagent that hits ambiguity cannot ask

Subagents cannot call AskUserQuestion. A subagent that runs into an ambiguous requirement has no way to escalate — it must guess or fail, the exact failure escalation exists to prevent. Resolve the open questions before delegating: have the coordinator clarify intent, then hand the subagent a fully-specified task. Design the decomposition so the part that needs a human stays with the agent that can reach one.

The application side: the canUseTool callback

AskUserQuestion is the tool Claude calls; canUseTool is the callback your application implements to receive these interruptions and answer them. It fires on two triggers: when Claude wants to use a tool (a permission check) and when Claude calls AskUserQuestion (a clarification). [Official] Handle approvals and user input · AnthropicT1-official original So one callback is the single surface through which your app both gates tools and answers questions; it returns PermissionResultAllow / PermissionResultDeny (Python) or { behavior: "allow" | "deny" } (TS). [Official] Handle approvals and user input · AnthropicT1-official original

And its response is far richer than yes/no — the docs document six patterns:

Concept ·
  1. Approve — pass the tool input through unchanged.
  2. Approve with changes — modify updatedInput; Claude sees the result but is not told you changed it.
  3. Approve and remember — echo a PermissionUpdate (e.g. destination: "localSettings") so future identical calls skip the prompt.
  4. Reject — return behavior: "deny" with a message.
  5. Suggest alternative — deny with guidance in the message; Claude reads it and adjusts its next step.
  6. Redirect entirely — inject a wholly new instruction via streaming input.
Key idea

AskUserQuestion and canUseTool are two ends of one wire. Claude raises an interruption — a tool it wants to run, or a question it needs answered — and your canUseTool callback resolves it. The six patterns mean the human’s role is not a binary gate but a steering wheel: approve, edit the inputs, remember the decision, refuse, counter-propose, or redirect entirely.

Two interactions are worth memorizing. The callback is skipped in dontAsk mode — anything not pre-approved is denied without ever calling it (it is the last step of the permission chain, and dontAsk short-circuits before reaching it). [Official] Handle approvals and user input · AnthropicT1-official original And for a human who is not watching the terminal, the PermissionRequest hook can fire an external notification (Slack, email, push) when Claude is waiting on approval. [Official] Handle approvals and user input · AnthropicT1-official original

The interview pattern: escalate before you start

Escalation is strongest when it is proactive — ask the questions before any work depends on the answers. That is the interview pattern from D3.5: Claude interviews you with AskUserQuestion, writes a spec, and a fresh session implements from it. [Official] Best practices for Claude Code · AnthropicT1-official original The natural home for this is plan mode, since “clarifying questions are especially common in plan mode, where Claude explores the codebase and asks questions before proposing a plan.” [Official] Handle approvals and user input · AnthropicT1-official original Front-loading the questions resolves ambiguity while it is still cheap — before a single edit is built on a guessed interpretation.

Key idea

Ambiguity gets more expensive the longer it survives — a wrong assumption caught at the interview costs a question; the same assumption caught after implementation costs a rebuild. Plan mode and the interview pattern are escalation moved as early as it will go.

The check-in as a control point

A clarifying question is also a deliberate pause, and the SDK treats it as one: the canUseTool callback “can stay pending indefinitely” while a human decides, and for long delays the agent can return a "defer" decision that ends the query and resumes later from the persisted session. [Official] Handle approvals and user input · AnthropicT1-official original That makes escalation the upstream half of human-in-the-loop — the agent yields control at the moment of uncertainty, and the human resolves what the model could not (the downstream half, routing low-confidence output to review, is D5.5). The hands-on, use-side treatment of when and how to prompt for clarification is the handbook’s territory (its escalation-patterns chapter is forthcoming).

[Note]

Cross-book pointer is prose-only — the handbook’s escalation chapter is outlined but unpublished, so there is no stable URL to link yet.

One callback, two triggers, three patterns Worked example

A coordinator wires a single canUseTool callback. Over one task it fields three interruptions, each resolved by a different pattern:

  1. Claude wants to run Bash(rm -rf build/) — trigger 1, a permission check. The callback approves with changes: it rewrites updatedInput to a scoped rm -rf ./build/ and allows it. Claude sees the result and is not told the command was tightened.
  2. Claude calls AskUserQuestion("Which database?", [Postgres, MySQL, SQLite]) — trigger 2, a clarification. The same callback surfaces it to the human (or fires a Slack notification via the PermissionRequest hook), stays pending while they decide, and returns the chosen label.
  3. Claude wants to run Bash(curl … | sh) — trigger 1 again. The callback suggests an alternative: it denies with the message “pipe-to-shell is blocked; download, verify the checksum, then run.” Claude reads the guidance and adjusts its next step.

One callback handled a permission and a question, and steered rather than merely gated — editing one input, answering one question, redirecting one plan. Had the session been in dontAsk mode, none of these would have reached the callback at all: anything not pre-approved is denied without calling it. That is the whole design — AskUserQuestion raises; canUseTool resolves.

Practice

Exercise

An agent is scaffolding a new service and reaches a step that needs a database, but the request never said which one. What is the most reliable design?

  • A. Pick the most common default (say, PostgreSQL) and proceed, noting the choice in a comment.
  • B. Call AskUserQuestion with a short set of bounded options (Postgres / MySQL / SQLite) and continue once the user chooses.
  • C. Infer the database from whatever is already installed on the build machine.
  • D. Fail the task with an error explaining that the requirement was underspecified.
Practice ◆◆◇◇

State the limits of a single AskUserQuestion call (questions per call, options per question) and describe how the response associates an answer with the question it answers.

Practice ◆◆◆◇

A subagent is dispatched to implement a feature and discovers the spec is ambiguous. Explain why it cannot resolve this with AskUserQuestion and how you would restructure the work so the ambiguity is handled correctly.

Exercise solutions

Solution ↑ Exercise

B. The database choice is a genuine intent decision the model cannot infer, so the reliable move is to surface it: AskUserQuestion with a few bounded options gets a fast, unambiguous answer and lets the agent continue on the right branch. A guesses — a reasonable default is still a coin flip on a decision with downstream lock-in (migrations, drivers, hosting). C infers intent from an accident of the environment; what is installed on a build machine is not a statement of what the project should use. D is the over-correction: failing throws away a recoverable situation that one bounded question would resolve. Escalation, not guessing and not giving up, is the pattern.

Solution ↑ Exercise

One AskUserQuestion call carries 1–4 questions; each question offers 2–4 options (each an option a label + description), plus a short header (≤12 characters) and a multiSelect flag. The response associates an answer with its question by mapping the question to the chosen option’s label — { "answers": { "<question text>": "<label>" } } — so the agent routes each selection back to the specific question it answers (a multiSelect answer returns as an array or a comma-joined string). Free-text is handled by offering an “Other” option and passing the typed text rather than the literal "Other".

Solution ↑ Exercise

A subagent cannot call AskUserQuestion — it is an explicit SDK limitation, and a subagent runs in an isolated context with no channel back to the user. So a subagent that discovers an ambiguous spec has no way to escalate: it must guess or fail, the precise outcome escalation exists to prevent. The fix is to restructure the decomposition so the part that needs a human stays with the agent that can reach one: have the coordinator resolve the open questions first (via AskUserQuestion, ideally during a plan-mode interview), then hand the subagent a fully-specified, unambiguous task. Delegate only after intent is pinned — never push an unresolved decision down to an agent that cannot ask about it.

Exam essentials

  • Escalate, don’t guess — design the agent to surface an ambiguous or blocked decision rather than silently pick an interpretation; the cost of asking is one round trip, the cost of guessing wrong is the whole task.
  • AskUserQuestion — 1–4 questions per call, 2–4 bounded options each (label + description, short header); the answer maps question to chosen label; free-text via an “Other” option. The bounded shape makes answers fast and routable.
  • canUseTool — the app-side callback — fires on two triggers (Claude wants a tool / Claude calls AskUserQuestion); six response patterns: approve, approve-with-changes (updatedInput), approve-and-remember (PermissionUpdate), reject, suggest-alternative, redirect-entirely. Skipped in dontAsk mode; the PermissionRequest hook sends external notifications while waiting.
  • Proactive beats reactive — the interview pattern and plan mode front-load clarifying questions, resolving ambiguity while it is still cheap, before work is built on a guess.
  • Subagents cannot escalate — AskUserQuestion is unavailable in subagents, so resolve open questions in the coordinator before delegating a fully-specified task.
  • Check-in as control point — a clarifying question is a deliberate pause; the callback can stay pending or defer-and-resume, making escalation the upstream half of human-in-the-loop (D5.5 is the downstream half).
Part 5 Chapter 3 Last verified 2026-06-02 Fresh

Error Propagation Across Multi-Agent Systems

In a chain of agents an error does not stay local — an upstream ambiguity becomes a downstream wrong decision, and concurrent faults compound into degradation no single component test reproduces. The defenses are structured error context across boundaries, independent validation, and circuit breakers.

Volatility: architectural-pattern
Tools compared: claude-code
Before you start: Chapter D1.2 (coordinator-subagent patterns), D2.2 (structured error responses), and D5.2 (escalation, which a mid-pipeline agent often cannot do).
You will learn
  • Analyze how an error or ambiguity propagates and compounds across agent handoffs
  • Explain why a mid-pipeline agent turns an upstream ambiguity into a silent wrong decision
  • Recognize why a compounding cross-system failure evades per-component testing
  • Apply the defenses — structured error context, independent validation, circuit breakers

A single agent fails visibly; a pipeline of agents fails by degrees. An error introduced at one stage rarely announces itself — it rides the handoff to the next stage as if it were sound input, and concurrent faults aggregate into a degradation that looks like nothing in particular. This chapter is about that propagation and the architecture that contains it. The pattern is durable; the percentages that illustrate it are evidence, so this is an architectural pattern.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

  1. Why is a pipeline of agents less reliable than its single most reliable agent?
  2. Why does a mid-pipeline agent turn an upstream ambiguity into a silent wrong decision, where an interactive one would not?
  3. The MAST taxonomy sorts multi-agent failures into which three categories — and which is the largest?
  4. Overlapping production bugs degraded quality for weeks, yet every component eval stayed green. Why couldn’t the unit tests see it?
  5. Name two boundary defenses that contain error propagation across agents.
Check your answers
  1. Because a chain’s reliability is the product of its handoffs, not its best agent’s — each boundary is both a place an error can enter and a place an existing error passes through unexamined.
  2. A mid-pipeline agent has no one to ask, so where an interactive agent would pause and clarify, it resolves the ambiguity itself and hands the guess downstream as settled fact.
  3. Specification problems (41.77%) — the largest — then coordination failures (36.94%) and verification gaps (21.30%).
  4. The degradation lived between components, not inside any one — each part passed its own eval, and the combined effect appeared only in the interaction, on traffic slices no single test exercises.
  5. Any two of: structured error context across boundaries, independent validation by an isolated judge, and circuit breakers that isolate a misbehaving agent before it cascades.

Failures compound across agent boundaries

Multi-agent systems fail far more often than their individual agents do. One practitioner analysis, drawing on the MAST taxonomy of 1,600-plus execution traces, reports that “multi-agent LLM systems fail at rates between 41-86.7% in production because specification ambiguity and unstructured coordination protocols cause agents to misinterpret roles, duplicate work, and skip verification.” [Practitioner] Why multi-agent LLM systems fail and how to fix them · Augment CodeT3-practitioner original That taxonomy sorts the breakdowns into three categories covering most of them: specification problems (41.77%), coordination failures (36.94%), and verification gaps (21.30%). [Practitioner] Why multi-agent LLM systems fail and how to fix them · Augment CodeT3-practitioner original

Key idea

A chain’s reliability is not its weakest agent’s reliability — it is the product of the handoffs. Each boundary between agents is both a place an error can enter and a place an existing error passes through unexamined. Adding agents multiplies the surfaces where intent can be dropped, so a pipeline is less reliable than any single stage unless the boundaries are actively defended.

An upstream ambiguity becomes a downstream wrong decision

The propagation mechanism is specific. “Agents cannot read between lines, infer context, or ask clarifying questions during execution. Every ambiguity becomes a decision point where agents explore all possible interpretations, selecting suboptimal ones.” [Practitioner] Why multi-agent LLM systems fail and how to fix them · Augment CodeT3-practitioner original This is the multi-agent counterpart to D5.2’s escalation problem: an interactive agent can pause and ask, but a mid-pipeline agent usually has no one to ask, so it resolves the ambiguity — and a quietly-wrong resolution is handed downstream as if it were settled fact. The next agent has no signal that the input it received was a guess.

Role-based decomposition with under-specified handoffs

The combination that manufactures silent errors is a role-split pipeline (“planner → coder → reviewer”) whose handoffs are prose rather than contract. Each role re-interprets the loosely-worded output of the last, and because no agent can stop to confirm, every interpretation gap becomes a committed decision. The fix is at the boundary, not the agent: specify what crosses it precisely enough that there is nothing left to guess.

Compounding failures evade per-component testing

When multiple faults run at once, their aggregate is not the sum of their symptoms — and that is what makes them hard to catch. Anthropic’s April-23 postmortem describes three production bugs with distinct, partly-overlapping windows: a reasoning-effort default change (Mar 4–Apr 7), a caching bug that broke thinking blocks (Mar 26–Apr 10), and a verbosity-reduction prompt (Apr 16–Apr 20). [Official] An update on recent Claude Code quality reports · AnthropicT1-official original Those windows union to a roughly seven-week span (Mar 4–Apr 20) — but that is the aggregate reach, not a stretch in which all three ran at once: the first two overlapped, while the third began only after both had been fixed. Even so, the combined effect “looked like broad, inconsistent degradation” that no single bug’s symptom resembled, and the most stubborn one — the caching bug — crossed context management, the API layer, and the extended-thinking system. [Official] An update on recent Claude Code quality reports · AnthropicT1-official original Detection was the central lesson: the bugs hit different traffic slices, and neither internal usage nor the existing eval suite reproduced them. [Official] An update on recent Claude Code quality reports · AnthropicT1-official original

Key idea

A compounding failure is invisible to component tests precisely because it lives between components. Each part passes its own eval; the degradation only appears in the interaction, on traffic slices no single test exercises. This is why “every unit test is green” is not evidence that a multi-stage system is healthy — the failure mode is the integration, not the unit.

Why three overlapping bugs passed every unit test Worked example

The April-23 incident is the canonical compounding failure. Three independent bugs, three windows:

BugWindowLayer(s) it touched
Reasoning-effort default changeMar 4 – Apr 7inference defaults
Caching bug (broke thinking blocks)Mar 26 – Apr 10context mgmt × API × extended thinking
Verbosity-reduction promptApr 16 – Apr 20system prompt

Read the dates carefully: the first two overlapped (Mar 26 – Apr 7), but the third started after both were already fixed. So “seven weeks” is the union of the windows (Mar 4 – Apr 20), not a span of three-way concurrency — a distinction worth getting right, because it changes what a responder is actually hunting for (one persistent fault versus a shifting set).

Now the reason it hid: each bug, tested in isolation, passes. The reasoning-effort change is a valid config; the cache works on most paths; the verbosity prompt is well-formed. The degradation lived in the interaction and in which traffic slices each bug touched — so neither internal usage nor the existing eval suite reproduced it, and “broad, inconsistent degradation” was all the aggregate looked like. The postmortem’s remedy is integration-level: per-model evaluations on every prompt change, ablation testing, and soak periods before rollout — exercising the system as it actually runs, because the failure was never inside any one unit.

Defenses: structured context, independent validation, circuit breakers

The countermeasures all push against under-specified, unchecked boundaries. The practitioner remedies are to convert prose specs into machine-validatable schemas, to enforce typed and schema-validated messages between agents (with MCP named as the schema-enforced substrate), to deploy isolated judge agents for independent validation, and to add circuit breakers that isolate a misbehaving agent before it cascades. [Practitioner] Why multi-agent LLM systems fail and how to fix them · Augment CodeT3-practitioner original The payoff is concrete: “PwC achieved a 7x accuracy improvement (10% to 70%) through structured validation loops.” [Practitioner] Why multi-agent LLM systems fail and how to fix them · Augment CodeT3-practitioner original

Concept ·
  • Structured error context — the error that crosses a boundary must carry machine-readable context (D2.2), not a prose apology, so the next stage can route on it instead of mis-reading it as input.
  • Independent validation — an isolated judge between stages (the D4.6 reviewer pattern) catches a wrong-but-plausible handoff before it propagates.
  • Circuit breakers — isolate a misbehaving agent so one stage’s failure does not cascade through the chain.
  • Escalation where possible — keep the human-reachable decision (D5.2) at the coordinator, not buried in a subagent that cannot ask.

The single-agent equivalent is the postmortem’s own remedy — broad per-model evaluations on every prompt change, ablation testing, and soak periods before rollout [Official] An update on recent Claude Code quality reports · AnthropicT1-official original — the same instinct as independent validation, applied to a pipeline of one.

Practice

Exercise

In a planner → coder → reviewer pipeline, the planner occasionally emits an ambiguous spec, the coder silently picks a wrong interpretation, and the reviewer — reading only the code — rates it fine. Bad output ships. What is the most effective fix?

  • A. Upgrade the coder to a more capable model so it interprets ambiguous specs correctly.
  • B. Make the planner emit a typed, schema-validated spec and add an independent validation step that checks the coder’s output against that spec before the reviewer runs.
  • C. Lengthen the reviewer’s prompt to tell it to look harder for problems.
  • D. Retry the whole pipeline a second time and compare the two outputs.
Practice ◆◆◇◇

Name the three MAST failure categories reported for multi-agent systems and which is the largest. Explain in one sentence why specification problems propagate especially badly in a pipeline.

Practice ◆◆◆◇

Using the April-23 postmortem as the example, explain why a compounding cross-system failure can pass every per-component evaluation, and what kind of testing the postmortem concluded was needed instead.

Exercise solutions

Solution ↑ Exercise

B. The failure is a boundary failure: an ambiguous handoff that no stage is positioned to catch. Fixing the boundary — a typed, schema-validated spec so there is less to misinterpret, plus an independent validation step that checks the coder’s output against that spec — intercepts the wrong interpretation before it reaches the reviewer. A makes one agent smarter but leaves the ambiguous interface intact; a better coder still has to guess at an under-specified spec. C asks the reviewer to work harder while still reading only the code, blind to the spec it was meant to satisfy. D doubles cost and gives two artifacts to compare with no oracle for which is right — a wrong-but-consistent interpretation reproduces on the retry.

Solution ↑ Exercise

The three MAST categories are specification problems (41.77%), coordination failures (36.94%), and verification gaps (21.30%); specification problems are the largest. Specification problems propagate especially badly in a pipeline because a mid-pipeline agent cannot pause to ask — an under-specified handoff becomes a guess the agent resolves and passes downstream as settled fact, so a single ambiguity at the top seeds a wrong decision that every later stage treats as valid input.

Solution ↑ Exercise

A compounding cross-system failure can pass every per-component evaluation because it lives between components, not inside any one: each part passes its own eval in isolation, and the degradation emerges only from their interaction, on traffic slices no single test exercises. In the April-23 case three bugs with overlapping windows (the union running Mar 4 – Apr 20) produced “broad, inconsistent degradation” that no single bug’s symptom resembled, and neither internal usage nor the existing eval suite reproduced it. The postmortem concluded that integration-level testing was needed instead — broad per-model evaluations on every prompt change, ablation testing, and soak periods before rollout — testing the system as it actually runs rather than each unit alone.

Exam essentials

  • Failures compound across boundaries — multi-agent systems fail at much higher rates than their parts (the MAST taxonomy: specification 41.77%, coordination 36.94%, verification 21.30%); a chain’s reliability is the product of its handoffs, not its best agent.
  • Ambiguity propagates silently — a mid-pipeline agent cannot pause to ask (unlike D5.2’s interactive escalation), so it resolves an ambiguity and passes the guess downstream as settled fact.
  • Compounding failures evade unit tests — they live between components, on traffic slices no single test exercises; all-green component evals do not prove a multi-stage system healthy.
  • Defenses — structured error context across boundaries (D2.2), independent validation / isolated judges (D4.6), circuit breakers to stop cascades, and keeping the escalable decision at the coordinator (D5.2); the single-agent analog is broad evals + ablation + soak periods.
Part 5 Chapter 4 Last verified 2026-06-02 Fresh

Large-Codebase Context: Compaction, Scratchpads, Delegation

A large codebase has more relevant files than any window holds, and reading them all is the trap, not the solution. The three levers that extend the horizon are compaction, scratchpads that externalize state to disk, and subagent delegation that pays exploration cost in a separate context.

Volatility: architectural-pattern
Tools compared: claude-code
Before you start: Chapter D5.1 (cumulative, finite context) and D1.7 (session state and scratchpads).
You will learn
  • Recognize why reading an entire large codebase into context is a failure mode, not a strategy
  • Apply compaction and its customization knobs to extend a long working session
  • Explain how subagent delegation pays exploration cost in a separate context window
  • Distinguish which lever — compaction, scratchpads, delegation — fits which situation
  • Distinguish /compact (condense, same task) from /clear (reset, new task), and why a disk scratchpad survives both

A large codebase has more relevant code than any context window can hold, so the question is never “how do I fit it all in” but “how do I keep the right slice in and the rest out.” This chapter is the exam-angle inventory of the levers; the design book owns the at-scale mechanics. It is an architectural pattern — the levers are stable, the commands and hooks that drive them are the surface.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

  1. Why is “read every possibly-relevant file into the main session” a failure mode rather than caution?
  2. Name the three knobs that customize or trigger compaction, and which one steers what content survives.
  3. /compact and /clear both free context — what is the difference, and when do you reach for each?
  4. A subagent explores twenty files; what does the main agent receive back, and where was the context cost paid?
  5. You wrote a PLAN.md scratchpad, then ran /compact, then later /clear. Is the plan still available? Why?
Check your answers
  1. Because context is cumulative and finite, reading everything “to be safe” fills the window with material that crowds out the work and degrades attention — the goal is to load the task’s slice, not the codebase.
  2. A CLAUDE.md “Summary instructions” section, the PreCompact hook, and manual /compact — the CLAUDE.md “Summary instructions” section steers what content survives the summary.
  3. /compact condenses the conversation in place so you continue the same task; /clear starts a fresh conversation to switch to an unrelated one — the decision rule is continuity (the previous session stays available via /resume).
  4. The main agent receives only a summary — the three-line answer, not the twenty files — because the exploration cost was paid in the subagent’s separate context window and discarded with it.
  5. Yes — a scratchpad on disk survives both: compaction only summarizes the window and /clear only wipes the window, so the agent re-reads PLAN.md exactly as it left it.

The codebase does not fit, and reading it all is the trap

Context is cumulative and finite (D5.1): conversation history and every tool input and output accumulate in the window over a session. [Official] How the agent loop works · AnthropicT1-official original A large codebase has far more relevant files than that window holds, and the naive response — have the agent read everything “to be safe” — is itself the failure: it fills the window with material that crowds out the work and degrades the model’s attention to what matters.

Key idea

The goal is not to load the codebase; it is to load the slice the task needs and keep the rest out. Every irrelevant file read is context spent against the budget and attention diluted across noise. On a large codebase, what you decline to read is as much a design decision as what you read.

Compaction extends a long session

When a working session approaches the limit, compaction reclaims room by summarizing older turns, and it is both automatic and steerable. Automatic compaction triggers near the context limit, [Official] How the agent loop works · AnthropicT1-official original and three knobs customize it: a “Summary instructions” section in CLAUDE.md that tells the compactor what to preserve, the PreCompact hook that runs before compaction, and a manual /compact sent on demand. [Official] How the agent loop works · AnthropicT1-official original

Concept ·
  • CLAUDE.md “Summary instructions” — the compactor reads CLAUDE.md like any context; a section describing what to preserve steers what survives the summary.
  • PreCompact hook — runs before compaction (e.g. to archive the full transcript first); receives whether the trigger was manual or automatic.
  • Manual /compact — trigger compaction on demand, before the automatic threshold, at a natural breakpoint you choose.

Compaction is lossy (D5.1), so the same caution applies: durable rules belong in re-injected CLAUDE.md, not in turns the summary may discard.

/compact vs /clear, and the scratchpad beneath both

Two commands free context, and confusing them wastes work. /compact [instructions] “frees context by summarizing” — it condenses the conversation in place, so you keep going on the same task with a shorter history, and the optional instructions focus what the summary keeps. [Official] Commands · AnthropicT1-official original /clear instead starts a fresh conversation — it discards the working context entirely, with the previous one still available via /resume (aliases /reset, /new). [Official] Commands · AnthropicT1-official original

The decision rule is continuity. Reach for /compact to continue a task whose history has grown long but is still relevant. Reach for /clear to switch to an unrelated task — or when a session is cluttered with failed approaches, the D3.5 rule that after more than two corrections on the same issue, “a clean session with a better prompt almost always outperforms a long session with accumulated corrections.” [Official] Best practices for Claude Code · AnthropicT1-official original Compaction keeps a lossy summary; clearing keeps nothing in the window at all.

Key idea

/compact condenses, /clear resets — and a scratchpad on disk survives both. Compaction only summarizes the window and /clear only wipes the window, so state you wrote to a file (D1.7) is untouched by either: the agent re-reads PLAN.md after a compaction, or in a freshly-cleared session, exactly as it left it. The durable layer of a long task does not live in the conversation at all.

Delegation pays exploration cost in another window

The lever that matters most for breadth is delegation. “Since context is your fundamental constraint, subagents are one of the most powerful tools available. When Claude researches a codebase it reads lots of files, all of which consume your context. Subagents run in separate context windows and report back summaries.” [Official] Best practices for Claude Code · AnthropicT1-official original A subagent can read the twenty files that answer “where is auth enforced,” and the main agent receives the three-line answer rather than the twenty files — the exploration cost is paid in the child’s window and discarded with it. Scratchpads are the complementary move: state written to a file (D1.7) lives on disk, not in the window, and the main agent reads it back only when needed.

Key idea

Compaction, delegation, and scratchpads all do one thing: move bulk out of the main context. Compaction summarizes it away, delegation spends it in a child window that returns only a summary, and a scratchpad parks it on disk. The main agent’s window stays scoped to the decision in front of it.

Four levers across one long task Worked example

A refactor spanning a 5,000-file service, in one sitting:

  1. Delegate the exploration. Instead of reading auth’s twenty files into the main window, dispatch a subagent — “find every place auth is enforced; report the files and the enforcement pattern.” It reads twenty files in its window and returns a three-line summary; the main context never absorbed the bulk.
  2. Externalize the plan. Write the refactor plan to PLAN.md — a scratchpad on disk, not in the window. State now lives somewhere compaction and clearing cannot reach.
  3. Condense, don’t reset, mid-task. Halfway through, the window nears the limit. Run /compact (optionally focused: “keep the auth findings and the plan”). The history condenses, you continue the same refactor, and PLAN.md is untouched on disk.
  4. Reset for the unrelated bug. A production bug interrupts — unrelated to the refactor. Run /clear for a fresh window (the refactor session stays in /resume), fix the bug, then return and re-read PLAN.md to resume the refactor exactly where you left it.

The lesson is matching the lever to the move: delegation for breadth, a scratchpad for state that must outlive the window, /compact to continue with less, /clear to switch cleanly. Confusing the two commands is the common error — compacting when you meant to switch drags stale context forward; clearing when you meant to continue throws away the thread (and only /resume can recover it).

Where the depth lives

This chapter is the exam-angle inventory; the design book owns the at-scale mechanics. The engineering of context at codebase scale — retrieval, the discipline of what to assemble into a window, and the cost trade-offs of fan-out — lives in the Agentic Systems Design book’s chapters on the environment at scale and context assembly. The exam-relevant skill is selecting the lever: compaction for a long single session, delegation for breadth of exploration, scratchpads for state that must outlive a compaction.

[Note]

Cross-book links are provisional — they point at the chapter sources until the design book deploys, then repoint to published URLs.

Practice

Exercise

An agent must understand how authentication flows through a 5,000-file codebase before making a change, and the relevant code is spread across dozens of files. What keeps the main session’s context usable?

  • A. Read every file that might be relevant into the main session so nothing is missed.
  • B. Dispatch subagents to explore subsystems and report back summaries, keeping the main context scoped to the change itself.
  • C. Raise max_tokens so the main session can hold more of the codebase.
  • D. Run /compact repeatedly while reading files so the window never fills.
Practice ◆◆◇◇

Name the three ways to customize or trigger compaction, and state which one steers what content survives the summary.

Practice ◆◆◆◇

Explain how dispatching a subagent to investigate part of a codebase protects the main agent’s context, and what the main agent receives back instead of the files the subagent read.

Exercise solutions

Solution ↑ Exercise

B. Delegation pays the exploration cost in the subagents’ separate context windows and returns only summaries, so the main session learns where authentication flows without absorbing dozens of files — exactly the “subagents run in separate context windows and report back summaries” pattern. A is the trap this chapter names: reading everything into the main window fills it with material that crowds out the actual change and dilutes attention. C buys a bigger budget but still spends it on noise, and a larger window degrades on irrelevant bulk just the same. D fights symptoms — compacting mid-exploration repeatedly summarizes away the very findings you are gathering, and is no substitute for never loading the bulk into the main window in the first place.

Solution ↑ Exercise

The three are (1) a CLAUDE.md “Summary instructions” section, (2) the PreCompact hook, and (3) manual /compact. The one that steers what content survives the summary is the CLAUDE.md “Summary instructions” section — the compactor reads CLAUDE.md like any other context, so a section describing what to preserve directs what the summary keeps. The PreCompact hook runs before compaction (e.g. to archive the full transcript) and manual /compact controls when it happens, not what survives — though /compact’s optional focus instructions also nudge the summary’s content.

Solution ↑ Exercise

Dispatching a subagent protects the main agent’s context because the subagent runs in its own separate context window: it reads the files needed to answer the question there, spending that exploration cost against its own budget, and that window is discarded when it returns. The main agent receives back only a summary — the three-line answer (“auth is enforced in middleware X via pattern Y”), not the twenty files the subagent read — so the main context learns the conclusion without absorbing the bulk that produced it. Exploration cost is paid in the child window and thrown away with it.

Exam essentials

  • Reading it all is the trap — a large codebase exceeds any window; loading everything “to be safe” fills the context with noise and degrades attention. Load the task’s slice, keep the rest out.
  • Compaction — triggers automatically near the limit and is steerable three ways: a CLAUDE.md “Summary instructions” section (steers what survives), the PreCompact hook, and manual /compact.
  • /compact vs /clear — /compact condenses the conversation to continue the same task (keeps a lossy summary); /clear starts a fresh conversation for an unrelated task or after >2 failed corrections (previous in /resume; aliases /reset, /new). A disk scratchpad (D1.7) survives both — neither command touches a file.
  • Delegation — subagents read files in their own context windows and report back summaries, so exploration cost is paid in the child window, not the main one; scratchpads (D1.7) park state on disk.
  • Pick the lever — compaction for a long single session, delegation for breadth of exploration, scratchpads for state that must outlive a compaction; depth lives in the design book’s at-scale and context-assembly chapters.
Part 5 Chapter 5 Last verified 2026-06-02 Fresh

Human Review and Confidence Calibration

Not every output earns automatic trust. The architect calibrates which results proceed and which route to a human, using checkable confidence signals and a tiered funnel — cheap auto-checks, then an isolated judge, then a person — so the human sees only the decisions where their judgment changes the outcome.

Volatility: architectural-pattern
Tools compared: claude-code
Before you start: Chapter D4.4 (semantic checks and the schema hooks), D4.6 (isolated reviewers), and D5.2 (escalation as the upstream half of human-in-the-loop).
You will learn
  • Evaluate which outputs can proceed automatically and which must route to a human
  • Apply checkable confidence signals as routing criteria rather than trusting self-reported confidence
  • Design a tiered review funnel that escalates only what each tier cannot resolve
  • Calibrate the human-review threshold by stakes and uncertainty

D4.4 closed the validation loop with the model — detect a semantic error, feed it back, retry. This chapter closes the other loop: when automation is not enough, route to a human. The architect’s job is calibration — deciding, per output, whether to trust it, verify it automatically, or escalate it to a person. The routing-and-funnel pattern is durable; the specific fields are illustration, so this is an architectural pattern.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

  1. What makes the decision to auto-accept vs route-to-human an economic one rather than a quality slogan?
  2. Why is a model’s self-reported “high confidence” not a signal you can route on directly?
  3. What does it mean for a confidence signal to be calibrated in the measurement sense, and how would you check it?
  4. Describe the three tiers of the review funnel and what each one escalates.
  5. What two factors set where the human-review threshold falls?
Check your answers
  1. Because for each output you weigh the cost of a wrong auto-accept against the cost of a human glance — cheap and reversible automates, expensive or irreversible routes to a person; “review everything” and “trust everything” are both failures to calibrate.
  2. Self-reported confidence is a claim, not a measurement — a confidently-wrong output reports high confidence too, so route on checkable signals (e.g. calculated_total ≠ stated_total, conflict_detected) instead.
  3. Calibrated means the stated value tracks actual accuracy — “90% confident” outputs are right about 90% of the time; check it by measuring, over real labeled data, the accuracy at each stated-confidence level.
  4. Auto-check → isolated judge → human: cheap automated checks handle the obvious cases, the isolated judge (fresh context, no authorship bias) catches the wrong-but-plausible ones, and each tier escalates only what it cannot resolve, so the human sees only what survives both.
  5. Stakes × uncertainty — low-stakes, high-confidence proceeds automatically; high-stakes, low-confidence goes to a person; the middle is where the judge tier earns its keep.

Not every output earns automatic trust

The cheapest reliability move is to give the model a way to check itself: “Include tests, screenshots, or expected outputs so Claude can check itself. This is the single highest-leverage thing you can do.” [Official] Best practices for Claude Code · AnthropicT1-official original But some judgments cannot be auto-verified — a wrong-but-plausible extraction, a borderline classification, a high-stakes decision with no ground truth to diff against. Those are where a human belongs. Confidence calibration is the discipline of deciding, output by output, which path each one takes.

Key idea

Calibration is an economic decision, not a quality slogan. For each output, weigh the cost of a wrong auto-accept against the cost of a human glance. Where the wrong-accept is cheap and reversible, automate; where it is expensive or irreversible, route to a person. “Review everything” and “trust everything” are both failures to calibrate.

Confidence as a routing signal

To route, you need a signal you can act on — and the reliable signals are checkable, not self-reported. The schema hooks from D4.4 double as confidence signals. (These are a design pattern this book recommends, not a built-in platform field — you add them to your schema.) When a model’s calculated_total disagrees with the document’s stated_total, both demand human review; when a conflict_detected flag is true, the record routes to a person; and a structured confidence field (high / medium / low) on an extraction gives the caller an explicit value to threshold on. Each is a place where the system can say “I am not sure” in a form a router can read.

Concept ·
  • Cross-check mismatch — calculated_total ≠ stated_total: either the source is bad or a value was fabricated; both need a human.
  • Self-flagged conflict — conflict_detected: true: the model found contradictory source fields and is asking for adjudication.
  • Low stated confidence — a confidence of medium or low on a high-stakes field.
  • Failed provenance — a cited span that does not appear in the source document (D5.6), meaning the citation was fabricated.
Trusting self-reported confidence as ground truth

A model’s own “high confidence” is a claim, not a measurement — a confidently-wrong output reports high confidence too. Treat self-reported confidence as one input, and calibrate it against checkable signals (the cross-checks above) rather than letting it gate the human queue by itself. The signals worth routing on are the ones a caller can independently verify.

Two senses of “calibration”

The word is doing double duty in this chapter, and the distinction is worth making sharp. There is the routing calibration the chapter is built on — which output goes to which tier — and there is measurement calibration: whether a confidence value actually tracks accuracy. A model is well-calibrated, in the measurement sense, only if its “90% confident” outputs are right about 90% of the time. Self-reported confidence usually fails this test — models tend to be over-confident, reporting high certainty on answers that are wrong — which is precisely why a raw “high” cannot gate the human queue on its own.

Key idea

Two outputs both marked “high confidence” are not equally trustworthy unless the signal is calibrated. You have two honest options: route on checkable signals instead (a cross-check either matches or it does not — no calibration required), or empirically calibrate the threshold — measure, over real labeled data, the accuracy at each stated-confidence level and set the bar where the signal starts reliably predicting correctness. Never assume the model’s number is calibrated out of the box, and re-measure when the model, prompt, or input distribution changes.

Calibrating a confidence signal against reality Worked example

An extraction pipeline emits a confidence field (high / medium / low). Before trusting it to route, you measure it on a labeled sample of 1,000 past extractions:

Stated confidenceCountActually correct
high70094%
medium22071%
low8038%

(Illustrative numbers — the method is the point.) Two readings follow. First, the signal is informative: accuracy falls monotonically from high to low, so it does carry real information about correctness. Second, it is not perfectly calibrated: “high” is 94%, not ~100%, so roughly six in a hundred high-confidence extractions are wrong — and on a high-stakes clinical field that residual is unacceptable. The routing decision now follows from the numbers, not the label: auto-accept high only if a ~6% error rate is tolerable for this field; otherwise send even high through the isolated judge, and route medium/low to a human. Had you trusted the word “high” as if it meant “certain,” you would have shipped that 6% silently.

The discipline: a confidence signal earns its routing role by measurement, not by its name — and you re-measure when the model, the prompt, or the input distribution shifts, because calibration is not permanent.

Independent validation before the human

Between auto-accept and the human sits an automated reviewer tier: an isolated judge. Independent validation — “deploy isolated judge agents” — is one of the practitioner-recommended defenses, and the gains are real: “PwC achieved a 7x accuracy improvement (10% to 70%) through structured validation loops.” [Practitioner] Why multi-agent LLM systems fail and how to fix them · Augment CodeT3-practitioner original The judge works for the same reason the D4.6 reviewer does: a fresh, isolated context has no authorship bias toward the output it is checking. Its job is to filter — resolve what it can, escalate only what it cannot — so the human queue stays small and high-value.

Key idea

Build review as a funnel, not a gate. Cheap automated checks handle the obvious cases; an isolated judge catches the wrong-but-plausible ones; the human sees only what survives both. Each tier escalates only what it cannot resolve, so the most expensive reviewer — the person — spends attention exclusively on the decisions that actually need it.

Calibrating the threshold

Where the human-review line falls is the calibration, and it is set by stakes times uncertainty. A low-stakes, high-confidence output proceeds automatically; a high-stakes, low-confidence one goes to a person; the middle is where the judge tier earns its keep. This makes human review the downstream half of human-in-the-loop, with escalation (D5.2) as the upstream half — the agent asks before acting when intent is unclear, and a human checks after producing when confidence is low. Calibrate the thresholds so a reviewer sees the few outputs where their judgment changes the outcome, and nothing it does not.

Practice

Exercise

A clinical-data extraction pipeline auto-accepts every result. Most are fine, but occasionally a high-stakes field is extracted wrong with no flag, and it reaches a patient record. You want to catch these without manually reviewing all output. What is the best design?

  • A. Retry every extraction with the model a second time and accept it if the two runs agree.
  • B. Have the model emit a confidence field and cross-check signals (e.g. conflict_detected), route low-confidence and flagged records through an isolated judge, and send what the judge cannot clear to a human.
  • C. Trust the model’s self-reported confidence and auto-accept anything it marks “high.”
  • D. Send every extracted record to a human reviewer to be safe.
Practice ◆◆◇◇

Name three signals an extraction pipeline can surface that should trigger human review, and explain why a model’s self-reported “high confidence” is not, by itself, a reliable one.

Practice ◆◆◆◇

Describe the tiered review funnel (auto-check → isolated judge → human), state what each tier escalates, and explain why the isolated judge catches errors that the cheap auto-checks miss.

Exercise solutions

Solution ↑ Exercise

B. The design calibrates: checkable signals (confidence, conflict_detected) select which records are uncertain, an isolated judge clears the merely-plausible ones, and only what the judge cannot resolve reaches a human — so high-stakes errors are caught without reviewing everything. A retries with the same model, and a confidently-wrong extraction reproduces on the second run, so agreement is not evidence of correctness. C trusts self-reported confidence, which a confidently-wrong output also reports as “high” — the exact trap. D is safe but uncalibrated: it spends the most expensive reviewer on every record, most of which need no human, and does not scale.

Solution ↑ Exercise

Three routable signals: a cross-check mismatch (calculated_total ≠ stated_total), a self-flagged conflict_detected: true, and a low stated confidence (a failed provenance check — a cited span absent from the source — is a fourth). A model’s self-reported “high confidence” is not reliable on its own because it is a claim, not a measurement: a confidently-wrong output reports high confidence too, and models tend to be over-confident, so stated confidence often fails to track actual accuracy. The signals worth routing on are the checkable ones — a cross-check either matches or it does not, independent of what the model believes — whereas self-reported confidence must first be empirically calibrated against observed accuracy before it can gate anything.

Solution ↑ Exercise

The funnel is auto-check → isolated judge → human. Tier 1 (cheap automated checks) handles the obvious cases — a cross-check mismatch or a thresholded signal — and escalates anything it cannot clear. Tier 2 (an isolated judge agent) reviews the merely-plausible cases in a fresh, independent context, resolving what it can and escalating only what it cannot. Tier 3 (the human) sees only what survived both. The isolated judge catches errors the cheap auto-checks miss because those errors are wrong-but-plausible — they pass the mechanical checks (valid shape, no flagged conflict) yet are semantically wrong, and a fresh-context reviewer with no authorship bias can judge correctness where a regex or equality test cannot. Each tier escalates only what it cannot resolve, so the most expensive reviewer spends attention only on the decisions that truly need human judgment.

Exam essentials

  • Calibrate, don’t blanket-trust or blanket-review — decide per output by weighing the cost of a wrong auto-accept against a human glance; verification is the highest-leverage habit where it is possible.
  • Route on checkable signals — cross-check mismatches (calculated_total ≠ stated_total), self-flagged conflict_detected, low confidence, and failed provenance are routable signals; self-reported confidence alone is a claim, not a measurement. Two senses of calibration: routing (which output to which tier) and measurement (does “90% confident” mean 90% correct?). Empirically calibrate a confidence field — accuracy per stated level — or prefer checkable signals that need no calibration.
  • Tiered funnel — cheap auto-checks → isolated judge (fresh context, no authorship bias) → human; each tier escalates only what it cannot resolve, keeping the human queue small (structured validation loops drove a documented 7× accuracy gain).
  • Threshold by stakes × uncertainty — high-stakes + low-confidence routes to a human; human review is the downstream half of human-in-the-loop, escalation (D5.2) the upstream half.
Part 5 Chapter 6 Last verified 2026-06-08 Fresh

Information Provenance: Citations and Temporal Validity

An architect tracks where each claim came from and when its data is valid. The native Citations API ties quoted text to real document spans so a source cannot be fabricated; the provenance triple is the schema-friendly fallback; and a model's knowledge cutoff bounds what it can be trusted to know without a dated source.

Volatility: feature-surface
Tools compared: claude-code
Before you start: Chapter D4.4 (the provenance schema hook) and D5.5 (failed provenance as a human-review signal). Closes Part V and the book.
You will learn
  • Apply the Citations API to map each claim to a verifiable source span
  • Distinguish the three citation location modes by document type
  • Recognize the provenance triple as the schema-friendly fallback when Citations cannot be used
  • Evaluate temporal provenance — when a model’s knowledge cutoff requires a dated source instead

The book closes on the question that underlies trust in any agent output: where did this claim come from, and is it still true? Provenance answers the first — a claim mapped to its source is auditable, one without a source is a trust-me. Temporal validity answers the second. This chapter is the exam-angle treatment; the named features — the Citations API surface, the location modes — are the moving parts, so it is a feature surface.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

  1. What does the Citations API actually guarantee about a cited quote — and what does it not guarantee (e.g. a JSON grammar)?
  2. Name the three citation location modes by document type.
  3. You need structured JSON and per-claim attribution. Why does combining Citations with structured outputs fail, and what is the fallback?
  4. Is a model’s reliable knowledge cutoff earlier or later than its training-data cutoff — and why?
  5. Within one request, on how many of the documents must citations be enabled?
Check your answers
  1. It guarantees the cited text is tied to an actual span in the source document — the model cannot fabricate a citation to text that is not there; it is span-bound, not grammar-constrained, so it guarantees nothing about output shape.
  2. Plain text → char_location, PDF → page_location, custom content → content_block_location.
  3. Cited text must interleave with the response prose, which a strict JSON schema forbids, so the API returns a 400; the fallback is the provenance triple (document_id, span_quote, confidence) verified caller-side.
  4. Earlier (or equal) — data near the training cutoff is sparse, so the model’s reliable knowledge stops before its training does.
  5. All or none — you cannot mix cited and uncited documents within a request.
[Note]

This is a feature-surface chapter: the Citations API toggle, the three location-mode field names, and the per-model knowledge cutoffs are concrete surfaces that move between releases. Treat every field and date as a current snapshot and re-verify before relying on it.

Provenance maps every claim to its source

The point of provenance is verifiability. “Claude is capable of providing detailed citations when answering questions about documents, helping you track and verify information sources in responses. All active models support citations, with the exception of Haiku 3.” [Official] Citations · AnthropicT1-official original You enable it per document with citations: {"enabled": true} on the document block, and each cited claim in the response carries a sibling citations array pointing back to the exact span of the source it came from. [Official] Citations · AnthropicT1-official original One enablement rule to memorize: citations must be enabled on all or none of the documents within a request — you cannot mix cited and uncited documents. [Official] Citations · AnthropicT1-official original

Key idea

A claim with a verifiable source is auditable; a claim without one is an assertion you have to take on faith. The Citations API’s real value is not formatting — it is that the cited text is tied to an actual span in the source document, so the model cannot fabricate a citation to text that is not there. Provenance turns “trust me” into “check line 14.”

The Citations API and its location modes

How a citation points at its source depends on the document type, and there are three modes. Plain text is chunked to sentences and cited by char_location; a PDF is cited by page_location; custom content, where you supply the chunks, is cited by content_block_location. [Official] Citations · AnthropicT1-official original The feature is also output-cheap: “the cited_text field is provided for convenience and does not count towards output tokens.” [Official] Citations · AnthropicT1-official original

Concept ·
  • Plain text → char_location — sentence-chunked; start/end character indices (0-indexed, exclusive end).
  • PDF → page_location — sentence-chunked; start/end page numbers (1-indexed, exclusive end). Scanned images without extractable text are not citable.
  • Custom content → content_block_location — you supply the chunks; start/end block indices (0-indexed, exclusive end).
Citations and Structured Outputs are mutually exclusive

“Citations cannot be used together with Structured Outputs … the API will return a 400 error,” [Official] Citations · AnthropicT1-official original because cited text must interleave with the response prose, which a strict JSON schema forbids. So you cannot get the Citations API’s span-bound attribution and a grammar-constrained JSON shape in the same call — and image citations are not yet supported at all. When you need both provenance and a schema, you fall to the next section.

The provenance triple: schema-friendly fallback

When the output must be structured JSON, the native Citations API is off the table, so you encode provenance into the schema yourself. This is the D4.4 hook applied to attribution — a design pattern this book recommends, not a platform feature: each extracted claim carries a source object with a document_id, a span_quote, and a confidence, and the caller verifies that span_quote actually appears in document_id. If it does not, the model fabricated the citation. It is a manual, checkable provenance you can drop inside any schema.

Key idea

The two provenance mechanisms trade the same way as everything in Part IV. Native Citations binds each cited quote to a real document span but cannot coexist with structured outputs; the provenance triple is structurally compatible with any schema but only as trustworthy as the verification you run on it. Choose by which constraint binds — a strict output shape, or a span-bound citation — because you cannot have both in one call.

Temporal provenance: knowing when data is valid

Provenance is not only where a claim came from but when it can be trusted. Each model has a reliable knowledge cutoff — Opus 4.8 at January 2026, Sonnet 4.6 at August 2025, Haiku 4.5 at February 2025 — and that reliable cutoff is earlier than (or equal to) the model’s training-data cutoff, not later: Sonnet 4.6 trained on data through January 2026 but is reliable only to August 2025, and Haiku 4.5 trained through July 2025 but is reliable to February 2025. [Official] Models overview · AnthropicT1-official original Data near the training cutoff is sparse, so the model’s reliable knowledge stops before its training does. Past the reliable cutoff the model has no dependable knowledge, so a time-sensitive fact must come from a dated source supplied at request time (retrieval with a citation), not from the model’s memory. The use-side workflow for recording claim sources and decision dates is the handbook’s territory (its provenance and ADR material is forthcoming).

[Note]

Cross-book pointer is prose-only — the handbook’s provenance chapter is outlined but unpublished, so there is no stable URL to link yet.

Provenance on both axes: where and when Worked example

A RAG assistant on Sonnet 4.6 (reliable knowledge cutoff August 2025) is asked: “What did the Q4 2025 earnings report say about revenue?” Both provenance axes are in play:

  1. When — temporal validity first. Q4 2025 is after the model’s reliable cutoff (August 2025), so the model has no dependable knowledge of it — answering from memory risks a confident fabrication. The fact must come from a dated source supplied at request time, not the model’s weights. (Note it is irrelevant that Sonnet 4.6 trained through January 2026: the reliable cutoff is the earlier date, and it is what bounds trust.)
  2. Where — bind the answer to the source. Supply the earnings PDF as a document with citations: {"enabled": true}. Because it is a PDF, the response cites by page_location, and each revenue claim carries the exact page span — auditable, not “trust me.” The cited_text echoes the source span back without costing output tokens.
  3. If the pipeline also needs structured JSON, you hit the wall: Citations + output_config.format returns a 400. Fall to the provenance triple — emit {claim, source: {document_id, span_quote, confidence}} per fact and verify each span_quote appears in the cited PDF caller-side.

The closing synthesis of the book: a trustworthy claim needs where (a source span, via Citations or the triple) and when (a dated source, because the reliable cutoff — earlier than training — bounds what the model knows). Retrieval supplies the dated source; a citation binds the answer to it; verification proves the binding is real.

Practice

Exercise

You are building an extraction pipeline that must return structured JSON and attribute each extracted fact to a source span. You try the Citations API with output_config.format and get a 400. What is the right design?

  • A. Drop the JSON schema and use the Citations API alone, parsing the prose response downstream.
  • B. Keep the schema and add a provenance triple per claim — document_id, span_quote, confidence — then have the caller verify each span_quote appears in its source.
  • C. Trust the model’s cited text without verification, since it was instructed to quote the source.
  • D. Send the request twice, once with citations and once with the schema, and merge the two responses.
Practice ◆◆◇◇

Name the three Citations API document modes and the location type each one emits, and state why cited_text is attractive on output cost.

Practice ◆◆◆◇

Explain why a question about an event after a model’s knowledge cutoff should be answered from a supplied dated source rather than the model’s own knowledge, and how that connects to provenance more broadly.

Exercise solutions

Solution ↑ Exercise

B. Citations and Structured Outputs are mutually exclusive — the 400 is the API telling you so — so when the structured shape is required, you encode provenance into the schema with a triple (document_id, span_quote, confidence) and verify each span against its source caller-side. A abandons the structured output the pipeline requires, trading one requirement for the other. C is the fabrication trap: an unverified span_quote may quote text that is not in the document, which is exactly the failure provenance exists to catch. D doubles cost and leaves you reconciling two responses with no guarantee the cited run and the schema run extracted the same facts.

Solution ↑ Exercise

The three modes are plain text → char_location (sentence-chunked; start/end character indices), PDF → page_location (start/end page numbers; scanned images without extractable text are not citable), and custom content → content_block_location (you supply the chunks; start/end block indices). cited_text is attractive on output cost because the field “is provided for convenience and does not count towards output tokens” — you get the quoted source span echoed back for verification without paying output tokens for it.

Solution ↑ Exercise

A question about an event after the model’s reliable knowledge cutoff should be answered from a supplied dated source because past that cutoff the model has no dependable knowledge — it may produce a plausible but fabricated answer. Crucially, the reliable cutoff is earlier than the training-data cutoff, so even data the model technically trained on near the boundary is unreliable; the earlier date is the one that bounds trust. Supplying the fact as a dated source at request time (retrieval plus a citation) makes the answer both correct and auditable. That connects to provenance broadly: provenance answers two questions — where a claim came from (a source span, via Citations or the triple) and when it is valid (a dated source past the cutoff). A time-sensitive claim needs both: an external dated source, bound to the answer by a verifiable citation.

Exam essentials

  • Provenance is verifiability — the Citations API (enable per document with citations: {"enabled": true}) ties each claim to a source span so a citation cannot be fabricated (span-bound, not grammar-constrained); cited_text does not count toward output tokens. Citations must be enabled on all or none of a request’s documents.
  • Three location modes — plain text → char_location, PDF → page_location, custom content → content_block_location; image citations are not yet supported.
  • Mutual exclusion — Citations + Structured Outputs return 400; when you need both a schema and provenance, use the provenance triple (document_id + span_quote + confidence) and verify the span caller-side.
  • Temporal provenance — each model has a reliable knowledge cutoff (Opus 4.8 Jan 2026, Sonnet 4.6 Aug 2025, Haiku 4.5 Feb 2025), which is earlier than (or equal to) the training-data cutoff — the model trains on later data but is only reliable to the earlier date (Sonnet 4.6 trained to Jan 2026, reliable to Aug 2025). Past the reliable cutoff, answer time-sensitive questions from a dated source, not the model’s memory.

Part 5 · D5 Review

6 exercises across 6 chapters — interleaved review.

d5-01-long-conversation-context

  1. d5-01-ex-durable-instruction Early in a long session you tell Claude "always cite a file:line for any behavioral claim." Forty turns later it stops doing so, right around when a compaction summary appeared. What is the most reliable fix? - **A.** Re-paste the instruction manually every few turns to keep it fresh in context. - **B.** Move the instruction into CLAUDE.md, which is re-injected on every request and so survives compaction. - **C.** Switch to a model with a larger context window so compaction never triggers. - **D.** Raise `max_tokens` so the model has more room to comply.

d5-02-escalation-ambiguity

  1. d5-02-ex-ambiguous-db An agent is scaffolding a new service and reaches a step that needs a database, but the request never said which one. What is the most reliable design? - **A.** Pick the most common default (say, PostgreSQL) and proceed, noting the choice in a comment. - **B.** Call `AskUserQuestion` with a short set of bounded options (Postgres / MySQL / SQLite) and continue once the user chooses. - **C.** Infer the database from whatever is already installed on the build machine. - **D.** Fail the task with an error explaining that the requirement was underspecified.

d5-03-error-propagation

  1. d5-03-ex-silent-propagation In a planner → coder → reviewer pipeline, the planner occasionally emits an ambiguous spec, the coder silently picks a wrong interpretation, and the reviewer — reading only the code — rates it fine. Bad output ships. What is the most effective fix? - **A.** Upgrade the coder to a more capable model so it interprets ambiguous specs correctly. - **B.** Make the planner emit a typed, schema-validated spec and add an independent validation step that checks the coder's output against that spec before the reviewer runs. - **C.** Lengthen the reviewer's prompt to tell it to look harder for problems. - **D.** Retry the whole pipeline a second time and compare the two outputs.

d5-04-large-codebase-context

  1. d5-04-ex-explore-large An agent must understand how authentication flows through a 5,000-file codebase before making a change, and the relevant code is spread across dozens of files. What keeps the main session's context usable? - **A.** Read every file that might be relevant into the main session so nothing is missed. - **B.** Dispatch subagents to explore subsystems and report back summaries, keeping the main context scoped to the change itself. - **C.** Raise `max_tokens` so the main session can hold more of the codebase. - **D.** Run `/compact` repeatedly while reading files so the window never fills.

d5-05-human-review-confidence

  1. d5-05-ex-route-to-human A clinical-data extraction pipeline auto-accepts every result. Most are fine, but occasionally a high-stakes field is extracted wrong with no flag, and it reaches a patient record. You want to catch these without manually reviewing all output. What is the best design? - **A.** Retry every extraction with the model a second time and accept it if the two runs agree. - **B.** Have the model emit a `confidence` field and cross-check signals (e.g. `conflict_detected`), route low-confidence and flagged records through an isolated judge, and send what the judge cannot clear to a human. - **C.** Trust the model's self-reported confidence and auto-accept anything it marks "high." - **D.** Send every extracted record to a human reviewer to be safe.

d5-06-information-provenance

  1. d5-06-ex-provenance-with-schema You are building an extraction pipeline that must return structured JSON *and* attribute each extracted fact to a source span. You try the Citations API with `output_config.format` and get a 400. What is the right design? - **A.** Drop the JSON schema and use the Citations API alone, parsing the prose response downstream. - **B.** Keep the schema and add a provenance triple per claim — `document_id`, `span_quote`, `confidence` — then have the caller verify each `span_quote` appears in its source. - **C.** Trust the model's cited text without verification, since it was instructed to quote the source. - **D.** Send the request twice, once with citations and once with the schema, and merge the two responses.