Part 1 Chapter 1 Last verified 2026-06-02 Fresh

Agentic Loops: stop_reason and Tool-Result Handling

The first chapter of Domain 1 and the substrate the rest of the book assumes — the agent loop as a control structure whose branch condition is stop_reason. Teaches the tool-use round-trip from first principles, the turn model, error and parallel tool-result handling, termination budgets, and every stop_reason value an architect must recognize.

Volatility: architectural-pattern

Tools compared: claude-code

Before you start: You can call the Claude Messages API (or the Agent SDK) and have seen a tool definition. No orchestration experience assumed — this chapter builds the loop from the ground up.

You will learn

Define what an agent is, and why “a model calling tools in a loop” is the whole architecture
Trace one full pass of the loop: the tool_use → execute → tool_result round-trip, and the condition that ends it
Distinguish a turn from an API request, and predict how max_turns counts
Handle the two tool-result cases the exam probes: a failed call (is_error) and parallel calls (return all results together)
Identify every stop_reason value and say what it means for whether the loop continues
Design a termination policy (max_turns / max_budget_usd + a subtype check) that fails safe

Domain 1 is the largest slice of the exam (27%), and this chapter is its floor: every later topic — subagents, workflows, hooks, session state — is a variation on the loop defined here. The exam tests whether you can read the loop’s control flow, not whether you can recite an SDK signature. We build the loop from the definition of an agent, trace it end to end with a worked example, name the stop_reason values that branch it, handle the error and parallel result cases, and fix the termination contract an architect owns.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

In one sentence, what is an agent, and what is the single branch condition that drives its loop?
A session makes three tool calls and then returns a text answer. How many turns is that, and what is the smallest max_turns that lets it finish?
A tool throws an exception inside your handler. What do you send back to the model, and what field do you set?
The model returns two tool_use blocks in one response. May you answer one now and the other next turn?
Your result object has subtype: "error_max_turns". Is .result populated? What should your code check before reading it?

Check your answers

An agent is an ordinary model in a loop — “LLMs using tools based on environmental feedback in a loop.” The branch condition is stop_reason: tool_use continues the loop, end_turn ends it.
Three turns — turns count tool-use round-trips, and the final text-only answer is not a turn. The smallest budget that finishes is max_turns = 3.
A normal tool_result block carrying the error text as content, with is_error: true — never throw out of the loop; the model reads the failure and adapts.
No — every tool_use block in a response needs its tool_result in the next user message, all returned together, keyed by tool_use_id.
.result is empty on every error_* subtype. Check subtype == "success" before reading it.

What an agent is

Start with the definition, because the loop falls out of it. Anthropic’s Building Effective Agents draws the line that organizes this entire domain: “Workflows are systems where LLMs and tools are orchestrated through predefined code paths. Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks.” [Official] Building effective agents · Erik Schluntz and Barry Zhang (2024)T1-official original A workflow runs on rails you laid down in code; an agent decides its own next step at runtime.

Operationally, that autonomy has a simple shape: “Agents are typically just LLMs using tools based on environmental feedback in a loop.” [Official] Building effective agents · Erik Schluntz and Barry Zhang (2024)T1-official original The model proposes an action, your code runs it, the result of running it becomes the model’s next input, and the model decides again. That feedback cycle — act, observe, decide — is the agent. Everything else in Domain 1 is a way of shaping, bounding, or distributing it.

The loop is a control structure

Because the agent is the loop, the loop is the thing you reason about — and it has exactly one branch condition. At the Messages API level, Claude returns stop_reason: "tool_use" together with one or more tool_use blocks; your application executes each call and returns tool_result blocks on the next user turn. [Official] Tool use with Claude · AnthropicT1-official original The loop repeats that exchange until Claude responds without a tool call.

The architectural point is that the model decides what happens next, but your code decides whether it gets to. Tool access is “one of the highest-leverage primitives you can give an agent,” [Official] Tool use with Claude · AnthropicT1-official original and the loop is where that leverage is either contained or left unbounded. Owning the loop — not authoring any single tool — is the orchestration discipline the rest of Domain 1 elaborates.

Tracing the loop: 'fix the failing tests' Worked example

Claude Code’s own documentation walks one task through the loop. Asked to fix failing tests, the model chains tool calls, each result informing the next decision:

Run the test suite — Bash(npm test) → stop_reason: "tool_use"; the result is the failure output.
Read the error output — the model now knows which tests fail and why.
Search for the relevant source files — Grep for the failing symbol.
Read those files — Read to understand the code.
Edit the files to fix the issue — Edit.
Run the tests again to verify — Bash(npm test); green this time → the model responds with text and stop_reason: "end_turn".

“Each tool use gives Claude new information that informs the next step.” [Official] How Claude Code works · AnthropicT1-official original Read top to bottom, this is five tool-use turns (steps 1–6 minus the final text) followed by one free text answer. Nothing in the path was pre-programmed — the model chose each tool from the previous result. That is an agent.

A turn is one tool-use round-trip

The word turn has a precise meaning, and the exam leans on it. A turn is “one round trip inside the loop: Claude produces output that includes tool calls, the SDK executes those tools, and the results feed back to Claude automatically … Turns continue until Claude produces output with no tool calls.” [Official] How the agent loop works · AnthropicT1-official original The Agent SDK embeds this loop for you — it ships “the same tools, agent loop, and context management that power Claude Code” [Official] Agent SDK overview · AnthropicT1-official original — so at the SDK level you observe messages, not the raw branch.

The consequence that trips candidates: a text-only final response is not a turn. A four-message session is three tool-use turns plus one final text answer, so max_turns=2 would stop before that final step. [Official] How the agent loop works · AnthropicT1-official original Size a turn budget to the tool calls a task needs, not to the messages you expect to see.

Handling tool results: errors and parallel calls

Step 4 of the round-trip hides two cases the exam specifically tests, because both are where a hand-written loop goes wrong.

A failed tool does not raise — it reports. When your handler hits an error (the file is missing, the command exits non-zero, the API times out), you do not throw out of the loop. You return a normal tool_result block with is_error: true and the error text as its content. [Official] Tool use with Claude · AnthropicT1-official original The model sees the failure and adapts — retries with a corrected argument, tries a different tool, or explains the blocker — exactly as it would read any other result. Raising an exception instead severs the loop and throws away the model’s ability to recover.

Parallel calls must be answered together. When a single response contains more than one tool_use block, the API requires every corresponding tool_result in the next user message — you cannot answer one tool now and defer the other to a later turn. Execute them (concurrently when they are read-only and independent; serially when one mutates state another reads), collect all results, and send them as one batch. The mechanics of when to parallelize live in D2.3 — for the loop, the rule to hold is: all of a turn’s results return together, keyed by tool_use_id.

stop_reason is the branch

Because stop_reason is the loop’s branch condition, recognizing each value — and its loop consequence — is core exam material. On a ResultMessage, stop_reason carries the value from the model’s last response. [Official] How the agent loop works · AnthropicT1-official original

`stop_reason`	What it means	Loop consequence
`tool_use`	The response contains tool calls	Continue — execute tools, return `tool_result`, request again
`end_turn`	The model finished naturally, no tool calls	Stop — deliver the final result
`max_tokens`	Output hit the token budget mid-response	Stop, but the answer is truncated — you may need to continue
`refusal`	The model declined to generate	Stop — handle as a non-answer, not a result

Termination is the architect’s safety contract

A model-driven loop that can run forever is a production incident waiting to happen, so the architect supplies the stop conditions the model cannot. The SDK exposes two budgets: max_turns (a turn count) and max_budget_usd (a client-side cost estimate). Hitting either ends the loop and sets ResultMessage.subtype to error_max_turns or error_max_budget_usd. [Official] How the agent loop works · AnthropicT1-official original

The subtype is the termination indicator, and it gates whether .result is even populated:

`subtype`	Meaning	`.result`?
`success`	Normal finish	yes
`error_max_turns`	Hit `max_turns`	no
`error_max_budget_usd`	Hit `max_budget_usd`	no
`error_during_execution`	API / cancellation error	no
`error_max_structured_output_retries`	JSON-Schema validation failed past the retry limit	no

Where Claude Code’s loop sits

Claude Code is one concrete harness around this loop. Its documentation states the architecture plainly: “The agentic loop is powered by two components: models that reason and tools that act.” [Official] How Claude Code works · AnthropicT1-official original The same source makes the dependency explicit — “Tools are what make Claude Code agentic. Without tools, Claude can only respond with text” [Official] How Claude Code works · AnthropicT1-official original — and the SDK’s account agrees that “Tools are the primary building blocks of execution for your agent.” [Official] Building agents with the Claude Agent SDK · Thariq Shihipar et al. (2025)T1-official original

For the exam, treat the loop in this chapter as the inner cycle. A long-running harness wraps it in an outer cycle that carries state across many context windows — but the branch condition, the turn model, and the termination contract are identical at every scale. Master the inner loop and the rest of Domain 1 is composition.

Practice

Exercise solutions

Solution ↑ Exercise

Three turns. Turns count tool-use round-trips: Read, Grep, Edit are turns 1–3; the final text summary is not a turn. The final response’s stop_reason is end_turn (no tool call). The smallest budget that still finishes is max_turns = 3 — it permits the three tool-use turns, and the free final text answer follows. max_turns = 2 would stop before the Edit.

Solution ↑ Exercise

B — end_turn. It means the model finished naturally with no tool calls, so the text response is the deliverable. A (tool_use) is the continue branch — there is more loop to run, not a final answer. C (max_tokens) stops the response but the answer is truncated mid-output, so it usually needs a continuation before it is usable. D (refusal) is a non-answer to be handled as a declined request, not delivered as a result. The discriminating idea: only end_turn is both terminal and complete.

Solution ↑ Exercise

A defensible policy: “Set max_turns to the tool calls a normal fix needs plus headroom (say 15), and max_budget_usd as a hard cost ceiling, because the re-reading loop fails by count and by spend and either bound should stop it. My handler checks ResultMessage.subtype first: on success I read .result; on any error_* subtype I surface the exhaustion (and the partial transcript) rather than printing an empty .result.” The exact field checked first is subtype — never .result before it.

Exam essentials

An agent is a model in a loop: “LLMs using tools based on environmental feedback in a loop.” Workflow = predefined code path; agent = model directs its own steps.
The loop has one branch: stop_reason: "tool_use" → run tools, return tool_result, request again; end_turn → stop.
A turn = one tool-use round-trip. The final text-only response does not count against max_turns.
Tool-result handling: a failed call returns a tool_result with is_error: true (the model adapts; you do not throw); parallel tool_use blocks need all their tool_results in the next user message, keyed by tool_use_id.
stop_reason values: tool_use (continue), end_turn (stop, usable), max_tokens (truncated — maybe continue), refusal (non-answer).
Termination is yours to set: max_turns + max_budget_usd; on exhaustion the ResultMessage.subtype becomes an error_* value and .result is empty.
Check subtype before reading .result — every error subtype leaves .result unpopulated.

Coordinator–Subagent Patterns: Hub-and-Spoke and Isolated Context

The coordinator–subagent (orchestrator-worker) pattern — a lead agent that decomposes a task and spawns isolated-context subagents. Teaches why a second agent ever helps, when the pattern earns its 3–10x token cost, the full single-vs-multi trade-off (including reliability and maintainability), why decomposition must split by context and not by role, and the one variant that works across domains.

Volatility: architectural-pattern

Tools compared: claude-code

Once the agent loop of D1.1 can run tools, the next architectural question is whether to run more than one agent. This chapter develops the canonical multi-agent shape — a coordinator that spawns isolated subagents — and, just as importantly, the discipline of not reaching for it. The exam tests judgment here: when the pattern wins, what it costs on every axis, how to cut the work, and which single variant is reliable.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

What single property defines a subagent and separates it from a plain tool call?
Name the three conditions under which coordinator–subagent earns its token cost.
Roughly what token multiplier does multi-agent cost versus a single agent, and does it buy speed or thoroughness?
Besides cost, name two axes on which a multi-agent system is worse than a single agent.
A teammate proposes splitting a task into planner → implementer → tester → reviewer. What is the name of this anti-pattern and the metaphor for its failure?

Check your answers

Isolated context — a subagent runs in its own context window and does not see the parent’s state, where a tool call returns directly into the calling agent’s context.
Context protection (large, mostly-irrelevant intermediate data stays out of the main window), parallelization (genuinely independent paths), and specialization (tool-set overload, conflicting personas, or deep domain expertise).
3–10× more tokens than a single agent — and it buys thoroughness, not speed (coordination often makes wall-clock slower).
Any two of: latency (often slower despite parallelism), reliability (multiple failure points), maintainability (multiple prompt sets to keep in sync), context coherence (fragmented at handoffs).
Role-based (problem-centric) decomposition — its failure metaphor is “the telephone game”: context loss at every handoff.

Why run more than one agent

A single agent (D1.1) is one model reasoning over one finite context window. That window is the bottleneck: everything the agent reads, every tool result, every intermediate thought accumulates in it, and a model attends less reliably as it fills. So the motivation for a second agent is not “more brains” — it is more windows. When a subtask would flood the main window with data the final answer doesn’t need, or when independent paths could be explored at once, splitting the work across separate context windows relieves the constraint a single loop cannot.

That is the line Building Effective Agents draws between a workflow and an agent: an agent is a “system where LLMs dynamically direct their own processes and tool usage.” [Official] Building effective agents · Erik Schluntz and Barry Zhang (2024)T1-official original A coordinator of subagents is one such system — an agent whose tool is “spawn another agent.”

The orchestrator and its workers

The canonical multi-agent shape is hub-and-spoke: a lead agent analyzes the task, plans a strategy, and spawns subagents that explore parts of it independently. [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original The lead synthesizes their results and decides whether more work is needed — it is an orchestrator, and the subagents are workers.

This is a real architecture with measured stakes, not a toy. In Anthropic’s research system, “a multi-agent system with Claude Opus 4 as the lead agent and Claude Sonnet 4 subagents outperformed single-agent Claude Opus 4 by 90.2% on our internal research eval.” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original The number is real but specific to that model pairing and that eval — read it as evidence the pattern can pay off, not as a portable benchmark.

Hub-and-spoke: comparing three regions' climate policy Worked example

A lead agent receives: “Compare the climate-disclosure rules of the EU, the US, and Japan.” Decomposed hub-and-spoke:

Lead plans — three regions are independent research paths with no shared state, so it spawns one subagent per region.
Subagents run in isolation — the EU subagent searches EU sources in its own context window; it never sees the US subagent’s intermediate pages, and neither pollutes the lead’s window.
Artifacts, not transcripts — each subagent writes its full findings to a file and returns a compact reference, so 2,000 tokens of raw sources per region don’t stream back through the lead. [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original
Lead synthesizes — it reads the three references and writes the comparison.

The win conditions stack here: context protection (raw sources stay out of the lead) and parallelization (three regions at once). Note what is not split — the final synthesis stays in one agent, because comparing the three regions needs all three in one window.

Isolated context is the whole point

The property that defines the pattern is that each subagent runs in its own context window and does not see the parent’s state. [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original A subagent is given a task and returns a result; the intermediate tokens it generates never touch the coordinator’s window. That isolation is the feature: it keeps a subtask’s noise out of the agent that has to reason over the whole problem.

Because results must cross a context boundary, large outputs use the artifacts pattern — a subagent writes its full output to the filesystem or external storage and passes a lightweight reference back, rather than streaming everything through the coordinator. [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original The coordinator stays lean; the high-fidelity output lives outside its window until needed.

When the pattern earns its cost

The capability is bought with tokens. “In our data, agents typically use about 4× more tokens than chat interactions, and multi-agent systems use about 15× more tokens than chats,” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original and against a single agent on an equivalent task, “multi-agent implementations typically use 3-10x more tokens than single-agent approaches.” [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original So the official guidance leads with restraint: “Start with the simplest approach that works, and add complexity only when evidence supports it.” [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original Try improved prompting, context compaction, and the Tool Search Tool on one agent first.

Reach for coordinator–subagent only when one of three conditions holds: [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original

Win condition	The signal	What it buys
Context protection	A subtask generates large, mostly-irrelevant intermediate data (>1000 tokens) that would pollute the main agent’s reasoning	A clean main-agent window
Parallelization	Genuinely independent paths to explore concurrently	Thoroughness, not speed (coordination often makes wall-clock slower)
Specialization	Tool-set overload (avoid 20+ tools on one agent), conflicting personas, or deep domain expertise	Focused agents that outperform an overloaded generalist

Cost is only the first axis. The full trade-off — the one a scenario question makes you weigh — is worse for multi-agent on most rows, and the architect must be able to name them:

Concept ·

Dimension	Single agent	Multi-agent
Token usage	Baseline	3–10× higher
Latency	Fast, sequential	Often slower despite parallelism (coordination + slowest-subagent)
Reliability	One point of failure	Multiple failure points — more places an error can enter
Maintainability	One prompt set	Multiple prompt sets to keep in sync
Context coherence	Unified	Fragmented at handoffs

Multi-agent is not “more advanced and therefore better”; it trades cost, latency, reliability, and maintainability for capability on tasks that genuinely need separate windows. Three of those five rows are downsides — which is why the exam frames the decision as restraint first.

Decompose by context, not by role

When you do split, how you cut the work is the most-tested judgment in this domain — and the most common way to get it wrong. The anti-pattern is role-based / problem-centric decomposition: planner → implementer → tester → reviewer. It feels organized but “creates constant coordination overhead and context loss at handoffs — the telephone game,” spending more tokens coordinating than executing. [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original

The reliable alternative is context-centric decomposition: split only at true context boundaries.

The verification subagent

One multi-agent shape “consistently succeeds across domains”: the verification subagent. [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original The main agent does the work; a separate agent blackbox-tests the result with clear success criteria and minimal context transfer. The isolation is the strength — the verifier has no stake in, and no memory of, how the work was produced.

Its failure mode is early victory: verifiers tend to declare success after one or two checks. The documented mitigation is an explicit instruction — “You MUST run the complete test suite before marking as passed.” [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original

Practice

Exercise solutions

Solution ↑ Exercise

Multi-agent is plausibly warranted — for specialization — but the proposed split is the role-based anti-pattern. The real signal is tool-set overload (40 tools on one agent; the guidance says avoid 20+ and prefer focused agents). So the justified cut is by tool/domain context — e.g. a CRM-and-orders agent vs a messaging-and-analytics agent — each carrying a focused tool set. The proposed intake → diagnosis → resolution → follow-up split is problem-centric: those are sequential phases of one tightly-coupled ticket, so they would lose fidelity at every handoff (the telephone game) and add coordination cost. Decompose by what context is independent (tool domains), not by what step comes next. And first confirm the Tool Search Tool alone can’t relieve the tool overload on a single agent.

Solution ↑ Exercise

The failure mode is context loss / information-fidelity degradation at handoffs, plus constant coordination overhead — the guidance calls it “the telephone game.” It happens because planner/implementer/tester/reviewer are sequential phases of one tightly-coupled task, not independent contexts. The rule: place a split only at a true context boundary — independent paths, clean-interface components, or blackbox verification — never by “what step comes next.”

Solution ↑ Exercise

Any two of: Reliability — single (one point of failure) → multi (multiple failure points). Maintainability — single (one prompt set) → multi (multiple prompt sets to keep in sync). Latency — single (fast sequential) → multi (often slower despite parallelism). Context coherence — single (unified) → multi (fragmented at handoffs). “More scalable” is not free: three of the five trade-off rows move the wrong way when you add agents.

Exam essentials

Why multi-agent at all: a single agent has one finite context window; extra agents buy more windows, not more intelligence.
Coordinator–subagent = hub-and-spoke: a lead decomposes, spawns subagents, and synthesizes; subagents run in isolated context windows and do not see parent state.
Isolation is the feature (context protection); large outputs use the artifacts pattern — write to storage, pass a reference back.
Cost is 3–10× tokens (and ~15× vs a chat). The 90.2% gain was Opus 4 lead + Sonnet 4 subagents vs single Opus 4 — not a portable number.
The full trade-off is mostly worse for multi-agent: higher tokens, often higher latency, multiple failure points, multiple prompt sets, fragmented coherence. Start single-agent; split only for context protection, parallelization, or specialization.
Decompose by context, not role. planner/implementer/tester/reviewer is the telephone-game anti-pattern; split at true context boundaries.
The verification subagent is the reliable variant — blackbox-test the result; mitigate early victory with “run the complete test suite before marking as passed.”

Subagent Invocation: AgentDefinition, the Agent Tool, and allowedTools

The mechanics beneath the coordinator–subagent pattern — how a subagent is actually invoked. The Agent tool and its three creation paths, the AgentDefinition contract, why "Agent" must be in allowedTools, the single prompt-string channel into a fresh context, what crosses back out, and what a subagent does not inherit (including parent permissions).

Volatility: architectural-pattern

Tools compared: claude-code

Chapter D1.2 established when to reach for a subagent and how to cut the work. This chapter drops one level — to the mechanics of actually defining and invoking one, and reading what crosses each way across the context boundary. The exam tests whether you can read an AgentDefinition, predict whether it will be invoked at all, and say what crosses into the subagent’s fresh context and what crosses back — and what silently does not.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

Through which single tool is every subagent invoked, and what was that tool previously named?
Which two AgentDefinition fields are required, and what does each control?
A perfectly-defined subagent never runs. Name the two most likely causes.
A subagent is told “fix the bug we discussed.” Why does it fail, and what is the only channel that could have carried the context in?
When the subagent finishes, what does the parent receive — and what might happen to it on the way back?

Check your answers

The Agent tool — every subagent is invoked through it; it was renamed from Task in Claude Code v2.1.63, so tool-name filters must match both values.
description (natural-language when to use this agent — drives automatic matching) and prompt (the agent’s system prompt: its role and behavior); everything else is optional.
"Agent" is missing from allowedTools (the run gate — the call is never approved) or the description is too vague for Claude to match the task to it (the match gate).
The subagent starts in a fresh context window with no parent conversation, so “the bug we discussed” never crossed; the only inbound channel is the Agent tool’s prompt string.
The parent receives the subagent’s final message as the Agent-tool result — and the parent may summarize it rather than carry it through verbatim.

The Agent tool is the invocation surface

A subagent is invoked through exactly one tool — the Agent tool — and there are three ways to give that tool something to invoke. [Official] Subagents in the SDK · AnthropicT1-official original Everything in this chapter hangs off that one tool: how you define what it runs, how you allow it to run, and what crosses the boundary in each direction.

One naming wrinkle is load-bearing on the exam and in tool-name filters. The tool was renamed from Task to Agent in Claude Code v2.1.63; current SDK releases emit Agent in tool_use blocks but still report Task in the system:init tools list and in result.permission_denials[].tool_name. [Official] Subagents in the SDK · AnthropicT1-official original Code that filters on the tool name must check both values.

AgentDefinition is the subagent’s contract

When you create a subagent programmatically, its AgentDefinition is a contract with two required halves: a description that says when to use it and a prompt that says how it behaves. [Official] Subagents in the SDK · AnthropicT1-official original Everything else is optional refinement.

Field	Required	Purpose
`description`	yes	Natural-language when to use this agent — drives automatic matching
`prompt`	yes	The agent’s system prompt: its role and behavior
`tools`	no	Allowed tool names; omit to inherit all of the parent’s tools
`model`	no	Model override (`sonnet` / `opus` / `haiku` / `inherit` / full ID)
`maxTurns`	no	Cap the subagent’s agentic turns (its own budget)

The description does double duty — it is also how Claude decides to invoke the agent automatically (below), so write it specific and keyword-rich rather than generic. [Official] Subagents in the SDK · AnthropicT1-official original

Enabling invocation: `Agent` in `allowedTools`

A defined subagent will not run unless the Agent tool itself is approved. Always include "Agent" in allowedTools to auto-approve subagent invocations; without it, the call falls through to your canUseTool callback or — in dontAsk mode — is denied outright. [Official] Subagents in the SDK · AnthropicT1-official original

The prompt string is the only channel in

A subagent starts in a fresh context window, and the only thing that crosses from parent to child is the Agent tool’s prompt string. [Official] Subagents in the SDK · AnthropicT1-official original

“A subagent’s context window starts fresh (no parent conversation) but isn’t empty. The only channel from parent to subagent is the Agent tool’s prompt string, so include any file paths, error messages, or decisions the subagent needs directly in that prompt.” [Official] Subagents in the SDK · AnthropicT1-official original

What that means concretely — the subagent receives its own system prompt (AgentDefinition.prompt), the Agent-tool prompt, the project CLAUDE.md (loaded via settingSources), and its tool definitions. It does not receive the parent’s conversation or tool results, the parent’s system prompt, or any preloaded skill content. [Official] Subagents in the SDK · AnthropicT1-official original Permissions are part of what does not cross: a subagent does not inherit the parent’s permissions — each runs its own evaluation chain — so a tool the parent could use is not automatically usable by the child. [Official] Configure permissions · AnthropicT1-official original

What crosses back: the return channel

The boundary is asymmetric, and the exam probes the outbound side too. When the subagent finishes, the parent receives the subagent’s final message as the Agent-tool result — but the parent may summarize it rather than carry it through verbatim. If a downstream step depends on the subagent’s exact output (a precise list, a diff, a structured payload), instruct the main agent to preserve the subagent’s result verbatim. [Official] Subagents in the SDK · AnthropicT1-official original Two more facts ride the return path: every message generated inside the subagent carries a parent_tool_use_id linking it to the invoking Agent call, [Official] How the agent loop works · AnthropicT1-official original and the subagent’s transcript persists independently of the main conversation (it survives main-session compaction). [Official] Subagents in the SDK · AnthropicT1-official original

Defining and invoking a read-only reviewer Worked example

A coordinator needs a focused doc reviewer. Programmatic definition, both gates satisfied, both channels handled:

const result = await query({
  prompt: "Use the doc-reviewer agent to check docs/api.md for broken links and stale version numbers.",
  options: {
    allowedTools: ["Read", "Grep", "Glob", "Agent"],   // <-- "Agent" is the RUN gate
    agents: {
      "doc-reviewer": {
        description: "Reviews Markdown docs for broken links, stale refs, and version drift; read-only.",  // MATCH gate: specific + keyword-rich
        prompt: "You are a meticulous documentation reviewer. Report issues as a bulleted list with file:line. Do not edit.",
        tools: ["Read", "Grep", "Glob"],                 // scoped: cannot edit or run commands
      },
    },
  },
});

Trace it: (1) match — the keyword-rich description lets Claude map “check docs/api.md” to the agent; (2) run — "Agent" in allowedTools approves the call; (3) in — the file path docs/api.md travels in the prompt string (the subagent’s fresh window has nothing else); (4) back — the subagent’s bulleted findings return as the Agent-tool result. Because the coordinator will act on exact file:line references, the parent prompt should add: “preserve the reviewer’s findings verbatim.” Drop the "Agent" entry and nothing runs; vague the description to “Reviews things” and nothing matches.

Invocation paths and the one-level limit

Once Agent is allowed, a subagent is invoked one of two ways. [Official] Subagents in the SDK · AnthropicT1-official original

Automatic — Claude matches the subagent’s description to the task. This is why the description must be specific and keyword-rich.
Explicit — name the agent in the prompt (“Use the code-reviewer agent to check the auth module”), bypassing automatic matching.

There is a hard structural limit: subagents cannot spawn subagents. Don’t include Agent in a subagent’s tools array — delegation is one level deep. [Official] Subagents in the SDK · AnthropicT1-official original

Practice

Exercise solutions

Solution ↑ Exercise

Two faults, neither in the tools array. (1) "Agent" is not in the parent’s allowedTools, so the Agent-tool call is never auto-approved — it falls through to canUseTool (or is denied in dontAsk). Fix: add "Agent" to allowedTools. (2) description: "Reviews things" is too vague for automatic matching — descriptions must be specific and keyword-rich. Fix: rewrite it (e.g. “Review Markdown/docs for accuracy, broken links, and stale references; read-only”), or invoke the agent explicitly by name to bypass matching. The tools: ["Read", "Grep", "Glob"] set is exactly right for read-only review — leave it. The two faults map to the two gates: allowedTools is the run gate, the description is the match gate.

Solution ↑ Exercise

Programmatic (an AgentDefinition in the agents option) — the recommended path for SDK apps; filesystem (.claude/agents/*.md, loaded at startup); and the built-in general-purpose agent. The built-in needs no AgentDefinition because it ships with a default description and prompt — Claude can invoke it through the Agent tool with nothing defined, which is why it is the zero-config fallback.

Solution ↑ Exercise

The return channel is lossy by default: only the subagent’s final message returns to the parent, and the parent may summarize it. So the coordinator is acting on a paraphrase of the reviewer’s findings, not the exact file:line list the subagent produced. The fix is to instruct the main agent to preserve the subagent’s result verbatim (and have the reviewer return a structured, easily-quoted format). The subagent’s own transcript being correct is the tell that the loss happened on the way back, not inside the subagent.

Exam essentials

One tool, three creation paths: subagents are invoked via the Agent tool (renamed from Task — filters must match both), created programmatically (agents option, recommended), via filesystem (.claude/agents/*.md), or as the built-in general-purpose agent.
AgentDefinition = description + prompt (both required); tools/model/maxTurns optional. The description drives automatic matching, so make it keyword-rich.
Agent must be in allowedTools or the subagent never runs. Two gates: description matches, allowedTools runs.
The prompt string is the only inbound channel. A subagent gets a fresh context — no parent conversation, no parent system prompt, no preloaded skills — but does get project CLAUDE.md + its tools. Permissions do not inherit; each subagent has its own evaluation chain.
The return channel is asymmetric and lossy: the parent receives the subagent’s final message but may summarize it; instruct the main agent to preserve it verbatim when fidelity matters. parent_tool_use_id attributes a message to its subagent.
Delegation is one level deep: subagents cannot spawn subagents (no Agent in a child’s tools).

Part 1 Chapter 4 Last verified 2026-06-02 Fresh

Multi-Step Workflows: Programmatic vs Prompt-Based Handoff

A multi-step task's control flow is enforced either in your code (programmatic) or in the model (prompt-based). When to choose each, why every step boundary is a handoff that leaks fidelity, the Writer/Reviewer pattern as the handoff that works, how a written artifact makes a handoff survivable, and how a programmatic validation gate rejects-and-retries a bad step before it propagates.

Volatility: architectural-pattern

Tools compared: claude-code

Most real agent work is several steps, not one. The question this chapter answers is not what the steps are but who enforces the order — your application code, or the model itself — and how you keep a bad step from poisoning the next one. That choice is an Evaluate-level judgment: the exam gives you a workflow and asks where the control flow belongs, how the handoff is specified, and where the gate goes.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

Name the two places a multi-step workflow’s control flow can live, and one property each buys.
Why is every step boundary a place fidelity can leak?
In the Writer/Reviewer pattern, why must the reviewer not inherit the writer’s context?
What three things does a handoff contract specify, and what file can carry it across a boundary?
A programmatic pipeline produces a malformed step-2 output. What does a validation gate do, and what are its two kinds of check?

Check your answers

In your code (programmatic) — which buys determinism and auditability — or in the model (prompt-based) — which buys flexibility and adaptivity.
Each boundary is a handoff — step N’s output becomes step N+1’s input — and a re-narrated handoff erodes detail at every retelling: the telephone game.
Because the absence of inheritance is the feature — a fresh context isn’t biased toward code it just wrote and cannot rationalize choices it never made.
An objective, an output format, and clear task boundaries — carried across the boundary as a written artifact the next step reads, such as a spec file or a test file.
It rejects and retries — re-prompting the failing step with the specific errors instead of passing the bad output downstream — using a schema/structural check (right shape) and a semantic check (content actually right).

Two places to enforce a workflow

A multi-step workflow’s control flow lives in exactly one of two places. Either your code drives the sequence — run a step, take its output, decide the next call — or the model drives it, having been told the steps in a prompt. The Agent SDK frames the split directly: “With the Client SDK, you implement a tool loop. With the Agent SDK, Claude handles it.” [Official] Agent SDK overview · AnthropicT1-official original

These are not rival products — Anthropic notes the same workflow “translate[s] directly” between the CLI and the SDK. [Official] Agent SDK overview · AnthropicT1-official original The architect’s decision is which layer holds the control flow, and it turns on how much the workflow needs determinism versus flexibility.

Every step boundary is a handoff

Whichever layer enforces the steps, each transition between them is a handoff — the output of step N becomes the input of step N+1 — and a handoff is where information is lost. Chapter D1.2 named the worst case: dividing a tightly-coupled task by role (planner → implementer → tester → reviewer) “creates constant coordination overhead and context loss at handoffs — the telephone game,” spending more tokens coordinating than executing. [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original Prompt-based handoffs across a long sequential chain are the most fidelity-fragile arrangement: each step re-narrates the last, and detail erodes at every retelling.

The Writer/Reviewer handoff that works

Not every handoff leaks — the canonical multi-step quality workflow depends on one. In the Writer/Reviewer pattern, one session implements and a second reviews: “A fresh context improves code review since Claude won’t be biased toward code it just wrote.” [Official] Best practices for Claude Code · AnthropicT1-official original Session A writes the rate limiter; Session B reviews the file for edge cases, race conditions, and consistency; Session A then addresses the feedback. The same shape works for tests — “have one Claude write tests, then another write code to pass them.” [Official] Best practices for Claude Code · AnthropicT1-official original

The handoff contract and its artifact

When work does cross a boundary, what crosses must be specified, not assumed. Anthropic’s research system makes each handoff an explicit contract: “Each subagent needs an objective, an output format, guidance on the tools and sources to use, and clear task boundaries.” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original That is the same discipline as D1.3’s rule that everything the subagent needs goes in the prompt — applied to every step of a multi-step flow.

The most robust way to carry that contract across a boundary is as a written artifact — a file the next step reads, not prose it re-narrates. Two concrete forms appear in the best-practices guidance:

A spec file. After an interview/planning phase, “start a fresh session to execute it … and you have a written spec to reference.” [Official] Best practices for Claude Code · AnthropicT1-official original The spec, not the conversation, is what crosses to the implementation step.
A test file. In the test/code split, the tests are the contract: one Claude writes them, another writes code to pass them. The implementer’s target is the file, not a description of it.

The validation gate: reject before propagating

A precise contract is also what lets a programmatic pipeline put a gate between steps — a check the step-N output must pass before step N+1 is allowed to consume it. The gate does two kinds of check, and the distinction is the one Domain 4 builds on (D4.4):

On failure, the gate does not pass the bad output downstream — it rejects and retries: re-prompt the failing step with the specific errors, and only advance when the output passes. That is the difference between a programmatic pipeline and a prompt-based one: the gate is enforced in your code, where a malformed step cannot quietly become the next step’s input.

A gated content pipeline Worked example

A programmatic flow: research → draft → [gate] → publish.

Research runs; its notes are written to research.md (the artifact crossing to draft).
Draft produces an article keyed to a contract: { title, sections[≥3], every claim cites a research.md line }.
The gate (your code, not the model) runs two checks on the draft:
- Schema: does it have a title and ≥3 sections, and does every claim carry a citation marker? (parse check)
- Semantic: does each cited line actually exist in research.md? (a fabricated citation passes the schema but fails here)
On failure — say a claim cites a non-existent line — the gate rejects the draft and re-prompts the draft step: “Claim 4 cites research.md:88, which does not exist. Re-cite from real lines.” It loops until the draft passes, then lets publish consume it.

Without the gate, the fabricated citation flows straight into publish — a silent failure caught only by a reader. The gate is the programmatic analogue of D1.1’s rule that a failed step is a result to handle, not something to wave through.

Choosing where the control flow lives

The Evaluate-level call: enforce programmatically when the workflow needs determinism, an audit trail, validation gates between steps, or a fixed and repeatable sequence — the steps are known in advance and you want them to run the same way every time. Stay prompt-based when the path is flexible, the model can sensibly self-direct, and the orchestration code would cost more than it saves.

This pairs with two neighboring decisions: whether to split into multiple agents at all (D1.2) and whether the decomposition is a fixed pipeline or an adaptive one (D1.6). Enforcement locus is how the steps are driven; those chapters cover whether and into what shape. The mechanics of the validation/retry loop itself — schema vs semantic errors, bounded retries — are developed in D4.4.

Practice

Exercise solutions

Solution ↑ Exercise

Choose (b), but split at one boundary only. The fact-check is failing for the exact reason the Writer/Reviewer pattern addresses — a context biased toward the draft it just produced rationalizes its own claims. Hand the fact-check to a fresh context (a second session or a verification subagent), passing an explicit handoff contract: the draft, the claims to verify, the success criteria, the output format. That is a programmatic handoff — your code routes the draft to the reviewer and the verdict back. Keep research → draft coupled in one context: they are tightly coupled and share state, so a handoff there would only leak fidelity. And do not split all four steps into role-agents — that is the telephone-game pipeline, four lossy handoffs where you needed one. The skill is placing the single split where fresh context buys independence.

Solution ↑ Exercise

Programmatic enforcement puts the control flow in your code — you sequence the steps, pass each output to the next, and can gate between them; choose it when you need determinism, an audit trail, validation gates, or a fixed repeatable sequence. Prompt-based enforcement puts the control flow in the model — it is told the procedure and self-directs; choose it when the path is flexible, the model can sensibly adapt, and orchestration code would cost more than it saves.

Solution ↑ Exercise

The schema is doing a structural check — right shape, fields present, types valid — and is missing the semantic check: whether the content is actually correct (valid JSON can still carry fabricated or contradictory data). The gate should not pass a failing output along; it should reject and retry — re-prompt the failing step with the specific error and only advance once the output passes both the structural and semantic checks.

Exam essentials

Two enforcement loci: a multi-step workflow’s control flow lives in your code (programmatic — you sequence steps and gate between them; deterministic/auditable) or the model (prompt-based — told the steps, self-directs; flexible).
Every step boundary is a handoff, and handoffs leak. A sequential chain of role-agents is the most fidelity-fragile arrangement — the telephone game.
Writer/Reviewer is the handoff that works because the reviewer has fresh context — it can’t defend code it never wrote. Don’t let an author review its own work.
Carry the contract in a written artifact — a spec file or a test file the next step reads from disk — so the handoff doesn’t depend on re-narration.
A programmatic validation gate runs a structural check and a semantic check between steps, and on failure rejects and retries rather than propagating the bad output. (The loop in depth: D4.4.)
Choose programmatic for determinism / audit / gates / fixed sequence; prompt-based for flexible self-direction. (Whether to split = D1.2; pipeline vs adaptive = D1.6.)

Part 1 Chapter 5 Last verified 2026-06-08 Fresh

Agent SDK Hooks: Intercepting, Gating, and Normalizing the Loop

A hook is a typed callback the SDK fires at a named lifecycle event — the architect's control plane to intercept, gate, and normalize the agent loop without touching the model. The events to know, the two interception modes (PreToolUse gates, PostToolUse normalizes), the four PreToolUse decisions including what defer does, and the deny-beats-defer-beats-ask-beats-allow precedence.

Volatility: feature-surface

Tools compared: claude-code

The agent loop of D1.1 runs the model’s tool calls automatically. Hooks are how the architect gets between the model and those calls — to block a dangerous one, rewrite its input, or clean up its output — without editing the model’s prompt. The exam tests three things: which events exist, the two modes of intervention, and who wins when hooks disagree.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

What is a hook, and does it consume the agent’s context window?
Which event gates a call before it runs, and which normalizes a result after?
Name the four decisions a PreToolUse hook can return — and say what defer does.
Three hooks return ask, allow, and deny on one event. What happens, and why?
A subagent can’t use a tool the parent could. Why — and what hook cleanly pre-approves it?

Check your answers

A hook is a callback that runs your code in response to an agent event, running in your application process — it does not consume the agent’s context window.
PreToolUse gates a call before it runs; PostToolUse normalizes the result before the model sees it.
allow, ask, deny, and defer — defer ends the query so the host can resume it later from the persisted session, a pause-and-hand-back, not an allow or a block.
The call is blocked: matching hooks run in parallel and the most restrictive result wins, per the precedence deny > defer > ask > allow.
Because subagents do not inherit the parent’s permissions — each runs its own evaluation chain; a PreToolUse hook cleanly pre-approves the tool.

Hooks intercept the loop at named events

A hook is a callback that runs your code in response to an agent event: “Hooks are callback functions that run your code in response to agent events, like a tool being called, a session starting, or execution stopping.” [Official] Intercept and control agent behavior with hooks · AnthropicT1-official original They arrive through two channels that share one lifecycle — programmatic hooks (callbacks in your query() options) and filesystem hooks (shell commands in settings.json, loaded via settingSources). [Official] Intercept and control agent behavior with hooks · AnthropicT1-official original

The events you must recognize

Hooks fire at named lifecycle points. The Python SDK exposes ten; the TypeScript SDK extends the same set to twenty. [Official] Intercept and control agent behavior with hooks · AnthropicT1-official original The ones an architect must recognize:

Event	Fires when	Typical use
`PreToolUse`	a tool call is requested	block or rewrite the call before it runs
`PostToolUse`	a tool returns a result	normalize or replace the result before the model sees it
`PostToolUseFailure`	a tool execution fails	log or handle the error
`UserPromptSubmit`	a prompt is submitted	inject extra context
`Stop`	the agent stops	persist state before exit
`SubagentStart` / `SubagentStop`	a subagent begins / ends	track spawned parallel work
`PreCompact`	compaction is about to run	archive the full transcript first
`PermissionRequest`	a permission dialog would appear	custom permission handling
`Notification`	an agent status message	forward to Slack / PagerDuty

The TypeScript-only additions (PostToolBatch, SessionStart, SessionEnd, Setup, and others) are why SessionStart / SessionEnd are not available as Python SDK callbacks — Python apps needing them load filesystem hooks from settings instead. [Official] Intercept and control agent behavior with hooks · AnthropicT1-official original

`PreToolUse` gates; `PostToolUse` normalizes

The two most-tested events define the two interception modes, and they sit on opposite sides of tool execution. [Official] Intercept and control agent behavior with hooks · AnthropicT1-official original

PreToolUse gates — it runs before the tool and returns a permissionDecision of allow, deny, ask, or defer, optionally with updatedInput to rewrite the call. Blocking a write to a .env file is a PreToolUse hook matching Write|Edit that returns deny when the target path ends in .env.
PostToolUse normalizes — it runs after the tool and returns either additionalContext (appended to the result) or updatedToolOutput (which replaces the output before Claude sees it). Stripping noise, redacting secrets, or reshaping a tool’s response into a clean form is PostToolUse work.

Matchers select which calls a hook sees: they are regex strings tested against the tool name — "Write|Edit", "^mcp__" for all MCP tools, or omitted to match everything. [Official] Intercept and control agent behavior with hooks · AnthropicT1-official original Matchers do not filter by argument; filter on the file path or command inside the callback.

A PreToolUse gate: forbid writes to .env Worked example

The rule: the agent may never write a .env file. That is a gate (a decision before the tool runs), so it is a PreToolUse hook. The matcher selects the write tools; the callback inspects the path and returns a decision:

{
  "hookSpecificOutput": {
    "permissionDecision": "deny",
    "permissionDecisionReason": "Cannot modify .env files"
  }
}

Wiring: a PreToolUse hook with matcher "Write|Edit"; inside the callback, if the target path ends in .env, return the object above; otherwise return "allow". Note what the matcher does not do — it selects by tool name, not by argument, so the .env test must happen inside the callback on the call’s input. Because this is a gate, it has to be PreToolUse: at PostToolUse the write has already happened.

Precedence: `deny` beats `defer` beats `ask` beats `allow`

When several hooks (or permission rules) act on one event, the outcome is decided by a fixed precedence, not by who ran first: “When multiple hooks or permission rules apply, deny takes priority over defer, which takes priority over ask, which takes priority over allow. If any hook returns deny, the operation is blocked regardless of other hooks.” [Official] Intercept and control agent behavior with hooks · AnthropicT1-official original The four decisions are not symmetric, and defer is the one candidates miss:

Hooks and subagents

Two subagent-aware events — SubagentStart and SubagentStop — let you track spawned work, but the operational gotcha is about permissions. As D1.3 established, subagents do not inherit the parent’s permissions; each runs its own evaluation chain. [Official] Configure permissions · AnthropicT1-official original The clean way to pre-approve a subagent’s tools is a PreToolUse hook rather than re-prompting inside every child. [Official] Intercept and control agent behavior with hooks · AnthropicT1-official original

Practice

Exercise solutions

Solution ↑ Exercise

(a) is a gate, (b) is a normalization — opposite sides of tool execution, so they need different events. (a) Use a PreToolUse hook with a matcher of "Write|Edit"; in the callback, inspect the target path and return hookSpecificOutput.permissionDecision: "deny" when it ends in .env. The decision must come before the write runs. (b) Use a PostToolUse hook with a matcher of "Bash"; return hookSpecificOutput.updatedToolOutput containing the result with ANSI codes stripped — updatedToolOutput replaces the output before Claude sees it. (b) can’t reuse PreToolUse because the output does not exist yet when the call is requested; you can only normalize a result after the tool has produced it.

Solution ↑ Exercise

The call is blocked. All matching hooks run in parallel and the most restrictive result wins, so the deny overrides the allow and the ask — the precedence is deny > defer > ask > allow. The one-word rule is restrictive (equivalently, “deny wins”): one hook saying deny is enough to block; permitting requires every hook to agree.

Solution ↑ Exercise

Both expectations are wrong. defer does not run the call — it ends the query so the host can resume it later from the persisted session (a pause-and-hand-back, not an allow). And updatedInput is ignored with defer — that field applies only to allow (or ask). To run a rewritten command, the hook must return allow with updatedInput, not defer.

Exam essentials

A hook is a callback at a named lifecycle event, delivered programmatically (query options) or via filesystem settings. It runs in your process and does not consume agent context.
Two interception modes: PreToolUse gates (returns allow/deny/ask/defer + optional updatedInput); PostToolUse normalizes (updatedToolOutput replaces the result, additionalContext appends).
The four decisions: allow / ask / deny, and defer — ends the query so the host can resume it later from the persisted session (and updatedInput is ignored with defer).
Precedence is deny > defer > ask > allow. Matching hooks run in parallel; the most restrictive wins; one deny blocks.
Don’t rely on hook order (non-deterministic) — write each hook independently.
Subagents don’t inherit permissions; pre-approve their tools with a PreToolUse hook. Watch the silent-failure traps: case-sensitive names, max_turns cut-offs, recursive subagent loops.

Part 1 Chapter 6 Last verified 2026-06-02 Fresh

Task Decomposition: Sequential Pipelines vs Adaptive

Once you've decided to decompose, the structural choice is fixed-in-advance (a sequential pipeline — predictable, cheap, auditable) versus decided-at-runtime (adaptive — the orchestrator scales effort to the task). Why open-ended work can't be hardcoded, why predictable work shouldn't be adaptive, the quantified token cost of choosing adaptive, and the failure modes at both extremes.

Volatility: architectural-pattern

Tools compared: claude-code

Chapter D1.2 settled whether to split a task and where the context boundaries are. This chapter asks a different question about the same task: once it is cut into pieces, is the set of pieces fixed in advance, or decided at runtime? That is the pipeline-versus-adaptive choice, and getting it wrong is expensive in opposite directions — an Evaluate-level judgment the exam probes with concrete tasks.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

What distinguishes a sequential pipeline from adaptive decomposition — and when is each structure set?
Why can’t open-ended research be reduced to a hardcoded pipeline?
Roughly what token multiplier does an adaptive multi-agent flow cost versus a single agent, and does parallelism buy speed or thoroughness?
Name the over-decomposition failure mode and the heuristic that guards against it.
Which structure is usually programmatically enforced, and why?

Check your answers

A sequential pipeline’s steps are hardcoded and fixed at design time; in adaptive decomposition an orchestrator decides the shape at runtime, scaling subtasks to the input.
Open-ended research is “inherently dynamic and path-dependent” — step N+1 depends on what step N discovered, so no design-time sequence can capture it.
Roughly 3–10× more tokens than a single agent (about 15× a chat), and parallelism buys thoroughness, not speed — coordination plus the slowest subagent often make wall-clock slower.
Over-decomposition — e.g. “spawning 50 subagents for simple queries” — guarded by the effort-scaling heuristic (1 agent / 2–4 subagents / 10+, sized to complexity).
The sequential pipeline — your code drives the fixed sequence, because nothing about the structure needs to be decided live.

Two shapes of a decomposed task

A decomposed task takes one of two structural shapes, distinguished by when the structure is determined.

The difference is not how many agents run but who decides the shape and when: the author, in advance, or the orchestrator, on the fly. Each is right for a different kind of task.

When the path can’t be hardcoded → adaptive

Some work resists a fixed pipeline by its nature. Anthropic’s research system is explicit about why: “Research work involves open-ended problems where it’s very difficult to predict the required steps in advance. You can’t hardcode a fixed path for exploring complex topics, as the process is inherently dynamic and path-dependent.” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original When step N+1 depends on what step N discovered, no design-time sequence can capture it.

Adaptive systems handle this by scaling effort to the input. The research system embeds the heuristic directly in its lead-agent prompt: “Simple fact-finding requires just 1 agent with 3-10 tool calls, direct comparisons might need 2-4 subagents with 10-15 calls each, and complex research might use more than 10 subagents with clearly divided responsibilities.” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original

When the path is predictable → pipeline

The opposite case is just as common and far cheaper to run. When a task’s steps are known, repeatable, and the same every time, a fixed sequential pipeline is the right structure: it is deterministic, auditable, and predictable in cost, and it asks the orchestrator to make no runtime judgment at all. A pipeline is typically programmatically enforced (D1.4) — your code drives the fixed sequence — precisely because nothing about the structure needs to be decided live. Reaching for adaptivity here is wasted capability: you pay for an orchestrator’s deliberation to re-derive a structure you already knew at design time.

The cost you are choosing

“Cheaper” and “more expensive” are not hand-waving — the choice has a price tag, and the exam expects the number. An adaptive multi-agent flow “typically use[s] 3-10x more tokens than single-agent approaches for equivalent tasks,” [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original and on the absolute scale, “multi-agent systems use about 15× more tokens than chats.” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original What that spend buys is thoroughness, not speed — parallel subagents explore a larger space, but coordination plus the slowest subagent often make the wall-clock slower, not faster. [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original

The same data, two shapes Worked example

Two tasks that look superficially similar decompose oppositely:

Pipeline — summarize 200 support tickets to a fixed JSON shape. The steps are identical for every ticket (read → extract fields → emit JSON), so the structure is fixed at design time. Cost is bounded: 200 × (one fixed path). You fan the same path across all 200 tickets — parallel throughput, not adaptive structure — and your code enforces it (D1.4). No orchestrator judgment, no token-multiplier surprise.

Adaptive — “find out whether any competitor shipped feature X; go as deep as the question needs.” Depth is unknowable up front and each finding changes what to look at next (path-dependent), so a lead agent sizes the decomposition at runtime via the ladder: a quick check might be 1 agent, 3–10 calls; a real comparison 2–4 subagents; a deep dive 10+. The capability to branch is the point — but it costs 3–10× the tokens of a single agent, and the guard is the effort ladder so it doesn’t spawn ten subagents to confirm an obvious “no.”

Same input domain (text about a topic); opposite structures, because one path is predictable and one is not.

The failure mode at each extreme

Both shapes fail, in opposite ways, when matched to the wrong task.

Over-decomposition is the adaptive failure: an orchestrator that misjudges effort produces absurd structures. Among the research system’s documented early failures was “spawning 50 subagents for simple queries” — capability with no judgment behind it, multiplying token cost for nothing. [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original The effort-scaling heuristic exists to mitigate exactly this.

Rigid pipelining is the inverse: forcing a fixed sequence onto path-dependent work, which then cannot adapt when a step surfaces something the design didn’t anticipate. And a pipeline cut by role rather than context is the telephone-game anti-pattern of D1.2 — sequential phases of one coupled task, losing fidelity at every handoff. [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original

Choosing the structure

The Evaluate-level call comes down to predictability. Choose a sequential pipeline when the steps are knowable in advance and you value determinism, auditability, and bounded cost. Choose adaptive decomposition when the task is open-ended and path-dependent and that capability is worth the orchestrator overhead and the 3–10× variable cost. Either way, the orchestrator (or the author) must size effort to the task — adaptivity does not excuse you from judgment; it relocates it to runtime.

Practice

Exercise solutions

Solution ↑ Exercise

(a) Sequential pipeline. (b) Adaptive. (a) The steps are known and identical for every ticket, and the output is a fixed shape — nothing to decide at runtime, so a deterministic pipeline wins on cost, auditability, and predictability. (You fan the same fixed path across 200 tickets — parallel throughput, not adaptive structure.) (b) The depth is unknowable up front and each finding changes what to look at next — “you can’t hardcode a fixed path… inherently dynamic and path-dependent” — so the orchestrator must scale the decomposition to what it discovers. On (b) guard against over-decomposition (don’t spawn ten subagents to confirm an obvious “no” — size effort via the 1 / 2–4 / 10+ ladder), and accept that you are paying roughly 3–10× the tokens of a single agent for the thoroughness.

Solution ↑ Exercise

Hardcode a sequential pipeline when the steps are knowable and identical in advance (predictable, repeatable); decompose adaptively only when the task is open-ended and path-dependent — when step N+1 depends on what step N discovers, so no design-time sequence can capture it.

Solution ↑ Exercise

They are paying roughly 3–10× the tokens of a single-agent pipeline (about 15× a chat). That spend buys thoroughness — a larger explored space — not speed (coordination plus the slowest subagent often make it slower in wall-clock). It is wasted here because a fixed, predictable nightly job has nothing to decide at runtime: the orchestrator’s judgment is re-deriving a structure already known at design time, so you pay the multiplier for capability the task never needed. A deterministic pipeline is the correct, bounded-cost shape.

Exam essentials

Two shapes, distinguished by when the structure is set: a sequential pipeline is fixed at design time; adaptive decomposition is decided at runtime by the orchestrator.
Adaptive when the path can’t be hardcoded — open-ended, path-dependent work where step N+1 depends on step N. Scale effort to the input (the 1 / 2–4 / 10+ heuristic).
Pipeline when steps are predictable — deterministic, auditable, bounded cost, usually programmatically enforced (D1.4). Adaptivity here is wasted.
Adaptive costs 3–10× the tokens of a single agent (≈15× a chat), and buys thoroughness, not speed. A pipeline’s cost is bounded; an adaptive flow’s is variable and can spike.
Two opposite failure modes: over-decomposition (“50 subagents for a simple query”) for adaptive; rigid/role-based pipelining (the telephone game) for fixed.
Choosing turns on predictability — and either way the orchestrator must size effort to the task. (Whether to split = D1.2; how to enforce = D1.4.)

Part 1 Chapter 7 Last verified 2026-06-02 Fresh

Session State: resume, fork, and Scratchpads

A session is the persisted conversation, not the filesystem. The architect's tools for carrying or branching state across context windows — continue, resume, and fork, with their literal Python/TS spellings — plus the encoded-cwd resume trap and the discipline of capturing durable artifacts as application state rather than shipping transcripts.

Volatility: architectural-pattern

Tools compared: claude-code

The loop of D1.1 runs inside a single context window. Real agents outlive one window — they pause, resume on another host, or branch to try an alternative. The state that survives is the session, and this chapter fixes what a session is, the three controls that carry or branch it, and the one discipline that outlasts the session itself. This closes Domain 1: the loop, scaled across many windows.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

What does a session persist — and what does it not?
Write the literal Python (or TS) call for continue, resume, and fork.
If a forked agent edits a file, is the change isolated from the original session? Why?
A resume call returns an empty, fresh session. What is the single most likely cause?
You must finish an interrupted job on a different host tomorrow. What is more robust than shipping the transcript?

Check your answers

The conversation — prompt, tool calls, results, and responses as JSONL under ~/.claude/projects/<encoded-cwd>/ — not the filesystem; reverting file changes is file checkpointing’s job.
continue_conversation=True / continue: true; resume=sessionId / resume: sessionId; fork_session=True / forkSession: true.
No — forking branches the conversation history, not the filesystem, so both forks share one disk and the edit is real and visible to any session in that directory.
A mismatched cwd — sessions are looked up under ~/.claude/projects/<encoded-cwd>/, so resuming from a different directory derives the wrong path and starts fresh.
Capture the artifacts you care about as application state (analysis output, decisions, file diffs) and pass them into a fresh session’s prompt — more robust than shipping transcript files around.

A session is the persisted conversation

A session is the conversation history — the prompt and every tool call, tool result, and response — persisted as JSONL on disk at ~/.claude/projects/<encoded-cwd>/<session-id>.jsonl. [Official] Work with sessions · AnthropicT1-official original The boundary that matters most is what a session does not include:

“Sessions persist the conversation, not the filesystem. To snapshot and revert file changes the agent made, use file checkpointing.” [Official] Work with sessions · AnthropicT1-official original

continue, resume, and fork

Three controls carry or branch a session — and the exam expects you to recognize their literal spellings, which differ between Python and TypeScript. [Official] Work with sessions · AnthropicT1-official original

Control	Literal call (Python / TS)	What it does	When to reach for it
`continue`	`continue_conversation=True` / `continue: true`	Picks up the most recent session in the current `cwd` — no ID needed	Resume after a process restart in the same directory
`resume`	`resume=sessionId` / `resume: sessionId`	Picks up a specific session by ID	Multi-user / multi-conversation apps where “most recent” is ambiguous
`fork`	`fork_session=True` / `forkSession: true`	Starts a new session ID from a copy of the original’s history; the original is untouched	Try an alternative direction without losing the original thread

continue and resume extend one thread; fork splits one into two. Capture the ID you’ll need later from ResultMessage.session_id — it is present even on errors. [Official] Work with sessions · AnthropicT1-official original

Fork branches the conversation, not the filesystem

The most-missed property of forking is the same boundary from section 1, sharpened:

“Forking branches the conversation history, not the filesystem. If a forked agent edits files, those changes are real and visible to any session working in the same directory.” [Official] Work with sessions · AnthropicT1-official original

Resume to recover — and the encoded-cwd trap

resume is the recovery tool for a loop that ended on a budget. When a session stops with error_max_turns (D1.1), you resume it with a higher limit and let it finish rather than restarting from scratch. [Official] Work with sessions · AnthropicT1-official original Because the work hit a budget, not a wall, the transcript is intact and resumable. [Official] How the agent loop works · AnthropicT1-official original

Recovering an error_max_turns session Worked example

A first run is bounded too tight and stops with subtype: "error_max_turns" (D1.1) — its .result is empty, but its transcript is intact. Recover it instead of restarting:

# First run returned error_max_turns; we captured its session_id from ResultMessage.session_id
async for message in query(
    prompt="Continue the refactor where you left off.",
    options=ClaudeAgentOptions(resume=session_id, max_turns=40),  # specific session, bigger budget
):
    ...

Two things must be right or resume silently starts fresh: (1) you pass the specific session_id (captured from the first run’s ResultMessage, which is populated even on the error), and (2) you run from the same cwd as the original, because the lookup path is ~/.claude/projects/<encoded-cwd>/. Bump max_turns so the resumed session can actually finish. Restarting from scratch would re-pay every turn already spent; resuming continues the intact transcript.

The single most common resume bug is the encoded-cwd mismatch:

“If a resume call returns a fresh session instead of the expected history, the most common cause is a mismatched cwd. Sessions are stored under ~/.claude/projects/<encoded-cwd>/*.jsonl, where <encoded-cwd> is the absolute working directory with every non-alphanumeric character replaced by -.” [Official] Work with sessions · AnthropicT1-official original

Scratchpads: durable state beyond the session

Sometimes the session itself is the wrong unit to carry — especially across hosts, where a CI worker or ephemeral container won’t have yesterday’s transcript file. The robust move is to lift the state you care about out of the conversation: “capture the artifacts you care about (analysis output, decisions, file diffs) as application state and pass into a fresh session’s prompt,” which the docs call “often more robust than shipping transcript files around.” [Official] Work with sessions · AnthropicT1-official original A scratchpad — a working file the agent writes to and reads from — is that same discipline applied within a run: the durable artifact, not the transcript, is the thing that survives. A fresh session then starts from that artifact, the way the best-practices guidance recommends executing a written spec in a clean session. [Official] Best practices for Claude Code · AnthropicT1-official original

This is where session state shades into memory — the design rationale for persisting durable context across many sessions is optional depth, in the Further reading.

Practice

Exercise solutions

Solution ↑ Exercise

(b) is more robust. (a) To resume by ID, the original session JSONL must be restored to the same path — ~/.claude/projects/<encoded-cwd>/<session-id>.jsonl — and the fresh worker must run from the same cwd, because the encoded-cwd is derived from the working directory; only then does resume=sessionId (with a bumped max_turns) find the transcript. That means shipping the transcript file and reproducing the exact directory on every worker. (b) Instead, capture the artifacts that matter — the decisions made, the diff so far, the remaining plan — as application state and seed a fresh session’s prompt with them. No transcript to ship, no cwd to reproduce; the docs call this “often more robust than shipping transcript files around.” Resume is for recovering in place; cross-host work favors captured artifacts. (And note: neither option restores the files the agent edited — that is file checkpointing, separate from session state.)

Solution ↑ Exercise

C — fork. Forking starts a new session ID from a copy of the original’s history, and the original session is left untouched — exactly “branch to try an alternative without losing the original thread.” A (continue) and B (resume) both extend the same thread, so the alternative attempt would pollute the original conversation, not branch from it. D (file checkpointing) snapshots files, not the conversation — useful if the risky refactor must be revertible on disk, but it does not give you a second conversation. (Reminder: fork branches the conversation, not the filesystem, so pair it with checkpointing if the files must branch too.)

Solution ↑ Exercise

The most common cause is a mismatched cwd. Sessions are stored under ~/.claude/projects/<encoded-cwd>/*.jsonl, where <encoded-cwd> is the absolute working directory with every non-alphanumeric character replaced by -; if you resume from a different directory, the SDK derives a different encoded path, finds nothing, and starts a fresh session. The fix: run the resume from the same working directory as the original session (or otherwise ensure the encoded-cwd path matches).

Exam essentials

A session is the persisted conversation, not the filesystem — JSONL under ~/.claude/projects/<encoded-cwd>/<session-id>.jsonl. File state is separate (file checkpointing).
Three controls, with literal spellings: continue (continue_conversation=True / continue: true), resume (resume=sessionId / resume: sessionId), fork (fork_session=True / forkSession: true). resume/continue carry; fork branches (new ID from a copy; original untouched).
Fork branches the conversation, not the disk — forked file edits are real and shared. Pair with file checkpointing to branch files.
resume recovers an error_max_turns session with a bumped budget; the #1 bug is a mismatched cwd → wrong encoded path → a fresh, empty session.
For cross-host work, capture artifacts as application state and seed a fresh session’s prompt — more robust than shipping transcripts.

Part 1 · D1 Review

7 exercises across 7 chapters — interleaved review.

d1-01-agentic-loops

d1-01-ex-trace A session runs: (1) Claude calls `Read`; (2) Claude calls `Grep`; (3) Claude calls `Edit`; (4) Claude returns a text summary with no tool call. How many *turns* is this? What is the `stop_reason` on the final response, and what is the smallest `max_turns` that still lets the session finish?

d1-02-coordinator-subagent-patterns

d1-02-ex-decide A support agent has 40 tools spanning CRM, order history, messaging, and analytics, and its accuracy on multi-step tickets is falling. A teammate proposes splitting it into four subagents: *intake*, *diagnosis*, *resolution*, *follow-up*. Walk the decision framework: is multi-agent warranted here, and if so, is this the right way to cut it? State the win-condition (if any) and name the anti-pattern (if any).

d1-03-subagent-invocation

d1-03-ex-fix-delegation You define a read-only `doc-reviewer` subagent via the `agents` option — `description: "Reviews things"`, `tools: ["Read", "Grep", "Glob"]` — but it never triggers; the main agent just reviews inline. The parent's `allowedTools` is `["Read", "Edit", "Bash"]`. Diagnose the **two** reasons it won't delegate, and give the fix for each. Is the `tools` array itself part of the problem?

d1-04-multi-step-workflows

d1-04-ex-pipeline A content workflow runs four steps — research → draft → fact-check → polish — today as a single agent in one context. Quality is slipping specifically at the fact-check step: the agent waves through claims it wrote moments earlier. Evaluate two options: (a) keep it one prompt-based agent, or (b) hand the fact-check to a fresh-context reviewer with an explicit programmatic handoff. Which do you choose, where exactly do you place the split, and what would over-splitting into four role-agents cost you?

d1-05-agent-sdk-hooks

d1-05-ex-gate-normalize You must enforce two rules on an agent: (a) it may never write to any `.env` file, and (b) every `Bash` result must have its ANSI color codes stripped before the model reads it. For each rule, name the hook event you would use and the specific return field that does the work. Why can't (b) be done with the same event as (a)?

d1-06-task-decomposition

d1-06-ex-pipeline-vs-adaptive Evaluate the right decomposition structure for each task and justify it in one or two sentences. (a) A nightly job that summarizes each of 200 incoming support tickets into the same fixed JSON shape. (b) "Find out whether any of our competitors have shipped feature X — go as deep as the question needs." For whichever you call adaptive, name the failure mode you must guard against and roughly the token cost you are accepting.

d1-07-session-state

d1-07-ex-resume-ci A CI job runs an agent that stops with `error_max_turns` partway through a refactor. The job's container is torn down, and you must finish the work on a **fresh worker** tomorrow. Walk two options: (a) resume the original session by ID — what must be in place for that to work? — and (b) capture artifacts as application state. Which is more robust across hosts, and why?

Part 2 Chapter 1 Last verified 2026-06-02 Fresh

Effective Tool Interfaces: Descriptions, Boundaries, and Naming

A tool's caller-facing contract — description, input examples, operation boundary, name, and response shape — is what a non-deterministic model reads to select and use it. Why the description is the highest-leverage surface, how input_examples show correct usage, when to consolidate, how to namespace, and the object schemas (input and output) every interface stands on.

Volatility: architectural-pattern

Tools compared: claude-code

Part I built the agent and its orchestration; Part II turns to the tools that agent reaches for. A tool is a contract between a deterministic system and a non-deterministic caller — and the architect’s leverage is not the implementation behind it but the surfaces the model actually reads: the description, the input examples, the operation boundary, the name, and the response. Get those right and a capable model selects the tool correctly; get them wrong and no amount of model quality rescues it.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

Which single field on a tool definition moves its performance the most, and what is the documented length floor?
What does input_examples do, and what is the one hard rule every example must satisfy?
You have create_pr, review_pr, and merge_pr. What is the documented redesign, and why?
Give the namespaced names for a GitHub search and a Jira search, and the mcp__ form an MCP tool surfaces as.
What is structurally required of every tool’s input schema — and what optional schema governs its output?

Check your answers

The description — “by far the most important factor in tool performance” — with a documented floor of at least 3–4 sentences per tool description, more if the tool is complex.
input_examples is an array of example argument objects that show the model correct calls; each example must validate against the tool’s input_schema, or the request returns a 400.
Consolidate them into a single tool with an action parameter — fewer, more capable tools reduce selection ambiguity.
github_search and jira_search; an MCP tool surfaces as mcp__<server>__<tool> (e.g. mcp__github__list_issues).
The input schema must be a JSON Schema object (a no-argument tool still declares an empty object); the optional MCP outputSchema governs the output, obligating conforming structuredContent.

The description is the highest-leverage surface

Of every field on a tool definition, the description moves performance the most: detailed descriptions are “by far the most important factor in tool performance.” [Official] Define tools · AnthropicT1-official original A description is not documentation for a human reader — it is the surface the model selects from, so it must spell out what the tool does, when it should be used (and when it should not), what each parameter means, and any caveats. [Official] Define tools · AnthropicT1-official original The guidance even sets a floor: aim for “at least 3-4 sentences per tool description, more if the tool is complex.” [Official] Define tools · AnthropicT1-official original

The gap is concrete. A get_stock_price described as “Retrieves the current stock price for a given ticker symbol… returns the latest trade price in USD… It will not provide any other information” tells the model exactly when to reach for it and what it gets back; the same tool described as “Gets the stock price for a ticker” leaves it guessing about inputs, outputs, and boundaries. [Official] Define tools · AnthropicT1-official original

Show correct usage with input_examples

The description tells the model how to use a tool; input_examples show it. This optional field carries an array of example argument objects that demonstrate correct calls — the documented “Tool Use Examples” feature. [Official] Define tools · AnthropicT1-official original A weather tool can ship three: a full call, a call with a different unit, and a call that omits the optional field — teaching the model the shape by demonstration rather than prose.

The one hard rule: each example must validate against the tool’s input_schema, or the request returns a 400. [Official] Define tools · AnthropicT1-official original Two more facts for the exam: input_examples are for client (user-defined) tools, not server-side tools, and they cost roughly 20–50 tokens for simple examples, 100–200 for complex nested ones — a context cost you pay deliberately where ambiguity is high. [Official] Define tools · AnthropicT1-official original

A description plus input_examples Worked example

A get_weather tool, with the two model-facing surfaces working together:

{
  "name": "get_weather",
  "description": "Get the current weather for a location. Use when the user asks about present conditions; not for forecasts. `unit` is optional and defaults to celsius.",
  "input_schema": {
    "type": "object",
    "properties": {
      "location": { "type": "string" },
      "unit": { "type": "string", "enum": ["celsius", "fahrenheit"] }
    },
    "required": ["location"]
  },
  "input_examples": [
    { "location": "San Francisco, CA", "unit": "fahrenheit" },
    { "location": "Tokyo, Japan", "unit": "celsius" },
    { "location": "New York, NY" }
  ]
}

The third example deliberately omits unit to show it is optional. Every example validates against input_schema — if you typo "unit": "kelvin", the example fails the enum and the whole request 400s, so the examples double as a self-check on your schema. The description draws the boundary (“not for forecasts”); the examples remove any doubt about argument shape.

Consolidate operations to reduce selection ambiguity

The next surface is the operation boundary — how much each tool does. The documented default is to consolidate: “Consolidate related operations into fewer tools. Rather than creating a separate tool for every action (create_pr, review_pr, merge_pr), group them into a single tool with an action parameter. Fewer, more capable tools reduce selection ambiguity.” [Official] Define tools · AnthropicT1-official original Every extra near-equivalent tool is one more line the model can pick wrong.

The deeper principle is to design for the agent’s affordances, not mirror your API’s endpoints: rather than make the model chain list_users + list_events + create_event, give it one schedule_event; rather than get_customer_by_id + list_transactions + list_notes, give it get_customer_context. [Official] Writing tools for agents · AnthropicT1-official original A tool that returns exactly the workflow the agent needs beats three tools it must orchestrate.

Namespace tool names by service

A name is the model’s fastest disambiguator, and the documented convention is to namespace by service: “Use meaningful namespacing in tool names… prefix names with the service (e.g., github_list_prs, slack_send_message). This makes tool selection unambiguous as your library grows.” [Official] Define tools · AnthropicT1-official original Bare search becomes a liability the moment a second search exists; github_search and jira_search never collide.

Names also carry hard constraints that differ by regime. A Claude API tool name must match ^[a-zA-Z0-9_-]{1,64}$. [Official] Define tools · AnthropicT1-official original An MCP tool name should be 1–128 characters of ASCII letters, digits, underscore, hyphen, or dot — no spaces — and unique within its server. [Official] Tools — Model Context Protocol Specification 2025-11-25 · AnthropicT1-official original Those MCP tools then reach the agent through a fixed pattern, mcp__<server>__<tool>: a list_issues tool on a server keyed github becomes mcp__github__list_issues. [Official] Connect to external tools with MCP · AnthropicT1-official original

Return only high-signal information

The response is the half of the contract authors forget. The model reads every token a tool returns, so a tool should “return only high-signal information… semantic, stable identifiers (e.g., slugs or UUIDs) rather than opaque internal references, and include only the fields Claude needs to reason about its next step.” [Official] Define tools · AnthropicT1-official original Bloated responses waste the context window and bury the fields that matter. The shape of the response also shapes the next call: a semantic identifier the model can pass straight into the following tool keeps a multi-step task cheap; an opaque internal handle forces a re-lookup. [Official] Writing tools for agents · AnthropicT1-official original

When the response should be machine-shaped, MCP lets a tool declare an optional outputSchema — and when it does, the server MUST return structuredContent conforming to that schema (mirroring it in a text block for compatibility). [Official] Tools — Model Context Protocol Specification 2025-11-25 · AnthropicT1-official original That is the output-side analogue of the required input schema; the structured-output machinery that drives it is Domain 4’s subject (D4.3).

The structural floor: an object input schema

Beneath the design judgments sits a requirement no interface can skip. Every tool’s input schema is a JSON Schema object: in the Claude API a tool definition’s three required fields are name, description, and an input_schema object; [Official] Define tools · AnthropicT1-official original in MCP the inputSchema is required and must be a valid JSON Schema object, not null. [Official] Tools — Model Context Protocol Specification 2025-11-25 · AnthropicT1-official original A tool that takes no arguments still declares an empty object schema — the object is the floor every interface stands on.

Practice

Exercise solutions

Solution ↑ Exercise

Consolidate the three into one get_customer_context tool (namespace it — e.g. crm_get_customer_context — if the agent spans services). Its description should state what it returns and when to use it: “Returns a customer’s profile, recent transactions, and notes for a given customer ID; use it whenever you need context about a customer before acting.” The redesign applies consolidation (fewer, more capable tools reduce selection ambiguity) and design-for-affordances (one call returns the context the agent needs instead of three CRUD calls it must chain). The agent stalled because three thin tools forced multi-step chaining the descriptions never made obvious; a single high-signal response also lets any follow-up call reuse the returned identifiers cheaply.

Solution ↑ Exercise

The most likely cause is that one of the examples does not validate against the tool’s input_schema — an invalid input_examples entry returns a 400. Every example must conform to the same input_schema the real calls do (right types, required fields present, enum values legal); a single bad example (a typo’d enum, a missing required field) fails the whole request. The examples have to agree with input_schema — which is also why they double as a check on the schema itself.

Solution ↑ Exercise

A good description must add, at minimum: (1) what the tool does concretely (not “gets data” but which data, in what form); (2) when to use it and when not to — the boundary that prevents misrouting; (3) what each parameter means (and what the response returns). Aim for 3–4 sentences. The audience is the model, which selects tools by description alone and never reads the implementation — so an opaque description is a performance bug the model cannot route around, making the description the single highest-leverage fix (“by far the most important factor in tool performance”). Adding input_examples compounds the gain by showing correct argument shape.

Exam essentials

The description is the highest-leverage surface — “by far the most important factor in tool performance.” Say what the tool does, when (and when not) to use it, and what each parameter means; 3–4 sentences minimum.
input_examples show correct usage — an array of example argument objects; each must validate against input_schema (invalid → 400). Client tools only, not server tools; ~20–50 / ~100–200 tokens.
Consolidate to reduce selection ambiguity — fewer, more capable tools (an action parameter over create_pr/review_pr/merge_pr); design for the agent’s affordances, not your API’s endpoints.
Namespace names by service — github_list_prs, not a bare search. API names match ^[a-zA-Z0-9_-]{1,64}$; MCP names are 1–128 ASCII chars and surface as mcp__server__tool.
Return only high-signal information — semantic, stable identifiers and only the fields the model needs. MCP’s optional outputSchema governs the machine-shaped output (server must then return conforming structuredContent).
The input schema must be an object — the structural floor of every tool; strict: true (D2.2/D2.3) then makes inputs conform to it.

Part 2 Chapter 2 Last verified 2026-06-02 Fresh

Structured Error Responses: isError, Retryability, and the Protocol-Error Split

A tool's failure contract — which channel a failure travels down, what its text says, and whether the schema could have prevented it. The two regimes (Messages-API is_error vs MCP isError + JSON-RPC), why is_error turns a failure into a recoverable signal, the normative execution-vs-protocol error split, and the difference between steering a retry and preventing the error.

Volatility: architectural-pattern

Tools compared: claude-code

D2.1 designed a tool’s happy path; this chapter designs its failure path. When a call goes wrong, the architect decides three things: which channel the failure travels down, what the failure text says, and whether the schema could have prevented it at all. But “which channel” depends on which regime you are in — and conflating the Claude Messages API with MCP is the most common mistake here, so we separate them first.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

In the Claude Messages API, what field flags a failed tool result — and what is its exact casing?
In MCP, what are the two error channels, and which one does a validation failure belong in?
Does the Claude Messages API have a JSON-RPC -32602 channel for a bad tool call? If not, where does a protocol-level problem surface?
How many times does Claude retry a bad call, and is that a parameter you can set?
Which failures does strict: true prevent, and which does it not?

Check your answers

is_error: true on the tool_result block — snake_case (is_error); the camelCase isError belongs to MCP.
Execution errors ride isError: true inside a successful result; protocol errors ride a JSON-RPC error response (e.g. -32602). A validation failure belongs in the isError channel (SEP-1303), so the model can correct and retry.
No — the Messages API has one tool-failure signal (is_error); a protocol-level problem surfaces as an HTTP error (e.g. 400), not a JSON-RPC channel.
2–3 times with corrections before apologizing — documented default behavior, not a parameter you can set; your only lever is the quality of the error content.
strict: true prevents schema violations (missing parameters, type mismatches); it does not prevent runtime API errors, business-logic violations, or semantic constraints a JSON Schema can’t express.

Two regimes, two spellings

“Structured error” means two related-but-distinct things depending on the surface you are on, and the exam (and real code) punish conflating them.

The casing is the tell: is_error is the Claude Messages API; isError is MCP. The two-channel (isError vs JSON-RPC) split below is an MCP model — the direct Messages API has only the single is_error signal for tool failures.

`is_error: true` is the canonical failure signal (Messages API)

On the Claude Messages API, a failed tool still returns a tool_result — but flagged. is_error: true is the canonical signal that a tool call failed: Claude folds the error into its next-turn reasoning and may retry. [Official] Handle tool calls · AnthropicT1-official original The flag is what turns a failure into a message to the model rather than a dead end — a result whose content reads ConnectionError: the weather service API is not available (HTTP 500) with is_error: true lets the next turn reason about what to do. [Official] Handle tool calls · AnthropicT1-official original The design principle is two lines: set is_error: true on the tool_result block, and make the content text actionable. [Official] Writing tools for agents · AnthropicT1-official original

Write instructive error messages

The flag says that it failed; the content must say what to do next. The documented principle is explicit: “Write instructive error messages. Instead of generic errors like ‘failed’, include what went wrong and what Claude should try next, e.g., ‘Rate limit exceeded. Retry after 60 seconds.’ This gives Claude the context it needs to recover or adapt without guessing.” [Official] Handle tool calls · AnthropicT1-official original

Execution vs protocol errors: the MCP channel split

Within MCP, a failure travels down one of two channels, and the choice is normative, not stylistic. The specification draws the line: isError: true inside a successful result is for execution errors the model should self-correct on — input validation, API failures, business-logic errors — while a JSON-RPC error response is for protocol errors the model cannot fix, such as an unknown tool or a malformed request. [Official] Tools — Model Context Protocol Specification 2025-11-25 · AnthropicT1-official original

The most-tested trap lives on this line: a validation failure belongs in the isError channel, not in a JSON-RPC -32602. The 2025-11-25 spec (per SEP-1303) is explicit that input-validation errors return as isError: true content — for example, Invalid departure date: must be in the future. Current date is 08/08/2025. — so the model can correct and retry. [Official] Tools — Model Context Protocol Specification 2025-11-25 · AnthropicT1-official original

A past departure date: which channel? Worked example

A booking tool receives departure_date in the past. Two ways to report it — one right, one wrong:

// WRONG (MCP) — a JSON-RPC protocol error for a recoverable input problem:
{ "jsonrpc": "2.0", "id": 7,
  "error": { "code": -32602, "message": "Invalid params" } }   // model can't read/recover

// RIGHT (MCP) — an execution error in the result, addressed to the model:
{ "content": [{ "type": "text",
    "text": "Invalid departure date: must be in the future. Current date is 2026-06-02." }],
  "isError": true }                                            // model corrects and retries

A past date is an input-validation / business-logic failure — exactly what isError: true exists for (SEP-1303). Sending it as -32602 routes a recoverable error to the host, which silently denies the model its retry. On the Claude Messages API the same failure is a tool_result with is_error: true and that actionable text — same idea, different spelling, no JSON-RPC channel involved.

Retryability is documented behavior, not a parameter

The model already retries failed calls on its own: “If a tool request is invalid or missing parameters, Claude will retry 2-3 times with corrections before apologizing to the user.” [Official] Handle tool calls · AnthropicT1-official original That loop is why the channel choice and the content quality matter so much — a failure returned as legible error content feeds each retry something to correct against, whereas a protocol error the model cannot read gives it nothing to adjust and burns the budget toward an apology.

Prevent the error: `strict: true`

The cheapest error to handle is the one that never happens. For the largest class of failures — malformed inputs — there is a prevention switch: “To eliminate invalid tool calls entirely, use strict tool use with strict: true on your tool definitions. This guarantees that tool inputs will always match your schema exactly, preventing missing parameters and type mismatches.” [Official] Handle tool calls · AnthropicT1-official original Define tools · AnthropicT1-official original With strict: true the schema-violation error class disappears before it can reach your handler.

Practice

Exercise solutions

Solution ↑ Exercise

(a) Return isError: true content, not a JSON-RPC error. A past departure date is an input-validation / business-logic failure the model can self-correct, and the MCP spec (SEP-1303) routes validation errors to the isError channel, reserving JSON-RPC errors for protocol problems the model cannot fix. (b) Make it actionable, e.g. “Invalid departure date: must be in the future. Current date is 2026-06-02.” (c) No. strict: true guarantees the input matches the schema (a correctly-typed date string), but “must be in the future” is a semantic constraint a JSON Schema type cannot express, so this failure must be caught at runtime and returned as legible error content. (d) Over the Claude Messages API the flag is is_error (snake_case) on the tool_result block — same meaning, different spelling — and there is no JSON-RPC channel at all (protocol problems would be HTTP 400s).

Solution ↑ Exercise

Two things are wrong. (1) Wrong regime: the Claude Messages API has no JSON-RPC error channel — JSON-RPC -32602 is an MCP protocol error. On the direct API a tool failure is a tool_result with is_error: true; protocol-level problems are HTTP errors (e.g. 400), not JSON-RPC. (2) Wrong channel even in MCP: an internal API timeout is an execution error the model could retry against, so even under MCP it belongs in isError: true content, not a protocol -32602. The candidate conflated the two regimes and mis-routed a recoverable error.

Solution ↑ Exercise

Claude retries the call 2–3 times with corrections before apologizing to the user — that is documented default behavior, not something you configure. Making the error content actionable matters because each retry reads that content to decide its correction; there is no retry-count knob, so the legibility of the error text is the only lever you have over whether those 2–3 attempts converge on a fix or burn down to an apology.

Exam essentials

Two regimes, two spellings: Claude Messages API uses is_error (snake_case) on a tool_result, with one tool-failure signal (protocol problems are HTTP errors). MCP uses isError (camelCase) on a CallToolResult, plus a separate JSON-RPC channel for protocol errors. Don’t conflate them.
is_error/isError is the canonical failure signal — it turns a failed result into a message the model reasons over and may retry against. Flag it and make the content actionable (“Rate limit exceeded. Retry after 60 seconds.” beats "failed").
MCP’s two channels, two audiences — isError: true content = execution errors the model self-corrects (validation, API, business logic); a JSON-RPC error = protocol errors it cannot fix (unknown tool, malformed request). Validation errors go in the isError channel (SEP-1303), never JSON-RPC -32602.
Retry is default behavior, not a parameter — Claude retries 2–3 times with corrections; there is no retry-count knob, so error-content quality is your only lever.
Prevent with strict: true — eliminates schema-violation errors entirely; reserve error content for runtime, business-logic, and semantic failures you cannot prevent.

Part 2 Chapter 3 Last verified 2026-06-08 Fresh

Tool Distribution and tool_choice: auto, any, Forced, and none

Controlling whether and which tool the model may call — the four tool_choice modes, the extended-thinking constraint, the any+strict guarantee of a schema-valid call, and the prompt-cache invalidation cost — plus allowedTools scoping versus bypassPermissions, with parallel execution as its own orthogonal axis.

Volatility: feature-surface

Tools compared: claude-code

Defining a good tool (D2.1) and a good failure (D2.2) is only half the architect’s job; the other half is controlling whether and which tool the model may reach for. Two knobs do this at two different scopes: tool_choice steers a single request — force a tool, free the model, or forbid all tools — while the SDK’s allowedTools defines which tools even exist for an agent. The exam tests the four tool_choice modes (especially the one constraint that trips people up), the any+strict guarantee, and the difference between steering a call and distributing a surface.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

Name the four tool_choice modes and what each forces.
Which two modes are unusable when extended thinking is on?
You need a guaranteed schema-valid tool call without forcing one specific tool. What two settings do you combine?
What does changing tool_choice do to your prompt cache?
tool_choice vs allowedTools — which steers a request, and which defines the toolbox?

Check your answers

auto (Claude decides — the default when tools are provided), any (must use some tool), forced tool (must use this tool), and none (no tools this turn).
any and forced tool — only auto and none are compatible with extended (or adaptive) thinking; the forced modes error the request.
tool_choice: {"type": "any"} plus strict: true on the tool definition — any guarantees a tool fires, strict guarantees its inputs match the schema.
It invalidates the cached message blocks — tool definitions and the system prompt stay cached, but message content must be reprocessed; keep tool_choice stable across cached turns.
tool_choice steers a single request; allowedTools defines which tools the agent has at all — one shapes a call, the other shapes the toolbox.

The four `tool_choice` modes

tool_choice is the per-request control over tool calling, and it has four documented modes: auto (Claude decides — the default when tools are provided), any (Claude must use some tool but picks which), {"type": "tool", "name": …} (forces one specific tool), and none (no tools this turn — the default when none are provided). [Official] Define tools · AnthropicT1-official original Tool use with Claude · AnthropicT1-official original

Forced modes are incompatible with extended thinking

The constraint the exam loves: only auto and none are compatible with extended thinking; any and forced tool return an error, and adaptive thinking carries the same limitation. [Official] Tool use with Claude · AnthropicT1-official original Define tools · AnthropicT1-official original If you need the model to reason before acting, you cannot also force it to call a tool — the two are mutually exclusive.

Forced modes prefill the assistant message

Forcing has a second, subtler effect. “When you have tool_choice as any or tool, the API prefills the assistant message to force a tool to be used. This means that the models will not emit a natural language response or explanation before tool_use content blocks, even if explicitly asked to do so.” [Official] Define tools · AnthropicT1-official original A forced call therefore cannot also produce a spoken preamble — there is no room before the tool_use block for one.

Guarantee a schema-valid call: `any` + `strict`

any guarantees that a tool fires, but not that its inputs are valid — and forcing one specific tool isn’t always what you want. Compose two switches to get both guarantees at once: “Combine tool_choice: {'type': 'any'} with strict tool use to guarantee both that one of your tools will be called AND that the tool inputs strictly follow your schema. Set strict: true on your tool definitions to enable schema validation.” [Official] Define tools · AnthropicT1-official original any covers that a tool is called; strict: true (D2.2) covers that its arguments match the schema. Together they make “some tool, well-formed” a hard guarantee — the right shape for a classifier or extractor that must always emit structured output through a tool.

A classifier that must always emit valid JSON Worked example

A record_decision tool must be called on every turn with a schema-valid payload ({ "label": "approve" | "deny" | "escalate", "reason": string }). Thinking is off for this step.

tool_choice: {"type": "any"} guarantees a tool is called — with only record_decision available, that is it.
strict: true on record_decision guarantees the label/reason inputs match the schema exactly (no missing field, no out-of-enum label).

// request
{ "tools": [{ "name": "record_decision", "strict": true, "input_schema": { /* label enum + reason */ } }],
  "tool_choice": { "type": "any" } }

The pair makes “some tool, well-formed” a hard guarantee. Note what you cannot add: extended thinking — any is incompatible with it. If you needed the model to reason first, you would drop to auto and lose the hard guarantee, or move the reasoning to a prior, tool-free turn.

Distribution: scope the surface with `allowedTools`

tool_choice steers one request; distribution decides which tools an agent has at all, and in the SDK that knob is allowedTools / disallowedTools. The two behave differently: allowed_tools=["Read", "Grep"] pre-approves the listed tools (others still exist and fall through to the permission mode), while disallowed_tools=["Bash"] removes the tool from the request entirely, so the model never sees it. [Official] Configure permissions · AnthropicT1-official original

For MCP access the documented guidance is to scope with allowedTools rather than open the gates with a permission mode: a mcp__github__* wildcard “grants exactly the MCP server you want and nothing more,” whereas permissionMode: "bypassPermissions" auto-approves MCP tools but disables every other safety prompt — broader than necessary. [Official] Connect to external tools with MCP · AnthropicT1-official original

Parallel execution is a separate request-level control

A third knob is easy to confuse with tool_choice but is orthogonal to it: disable_parallel_tool_use. Claude 4 models may emit several tool_use blocks in one turn by default; setting disable_parallel_tool_use=true caps that — with tool_choice: auto Claude then uses at most one tool, and with any or forced tool it uses exactly one. [Official] Parallel tool use · AnthropicT1-official original

Practice

Exercise solutions

Solution ↑ Exercise

Forced tool mode is incompatible with extended thinking, so the request errors — you cannot both force a specific tool and let the model reason with extended (or adaptive) thinking. To get a schema-valid guaranteed tool call, combine tool_choice: {"type": "any"} (guarantees some tool fires — here only record_decision exists) with strict: true on the tool (guarantees the inputs match the schema). But any is also incompatible with thinking, so to keep the hard guarantee you must give up extended thinking on this turn (or move the reasoning to a prior tool-free turn and force/any the decision on the next). No single configuration gives you a forced, schema-valid call and visible reasoning at once; forcing also prefills the assistant turn and suppresses the preamble.

Solution ↑ Exercise

Use allowedTools: ["mcp__linear__*"] — the wildcard pre-approves exactly the linear server’s tools and nothing else. It is preferable to permissionMode: "bypassPermissions" because bypass auto-approves the MCP tools and disables every other safety prompt across the whole agent (far broader than you need), whereas the scoped wildcard grants exactly the one server and leaves all other gates intact.

Solution ↑ Exercise

Every turn that changes tool_choice invalidates the cached message blocks — tool definitions and the system prompt stay cached, but the message content has to be reprocessed — so alternating auto/forced means roughly every other turn pays full message-processing cost, which is why caching “barely helps.” The fix: keep tool_choice stable across the cached turns (don’t toggle it per turn); if some steps genuinely need a forced tool, group them so the value changes as rarely as possible rather than every turn.

Exam essentials

Four tool_choice modes — auto (default; Claude decides), any (must use some tool), forced {"type":"tool","name":…} (this tool), none (no tools). A spectrum from free to coerced.
Forced modes break extended thinking — only auto and none work with extended (or adaptive) thinking; any and forced tool error. The single most-tested tool_choice constraint.
any + strict: true = a schema-valid guaranteed call — any guarantees a tool fires, strict: true guarantees its inputs match the schema; compose them for classifiers/extractors. (strict is a per-tool property.)
Forced modes prefill the assistant turn — no natural-language preamble before the tool_use block; for preamble plus a specific tool, use auto and ask in the user message.
tool_choice changes invalidate the prompt cache — message blocks must be reprocessed (tool defs + system prompt stay cached); keep tool_choice stable across cached turns.
Distribution ≠ steering — allowedTools defines which tools exist (disallowed_tools removes one entirely); tool_choice steers a request. For MCP, a mcp__server__* wildcard beats bypassPermissions (narrower). Parallelism is its own axis (disable_parallel_tool_use).

Part 2 Chapter 4 Last verified 2026-06-08 Fresh

MCP Server Configuration: .mcp.json, Scopes, and Env-Var Expansion

Wiring an MCP server so it resolves predictably across personal, team, and machine contexts. The two config paths and strictMcpConfig, claude mcp add --scope, the three scopes and their precedence, the local-scope-versus-local-settings trap, env-var expansion for secrets, verifying the connection via system:init, and the transports atop a mid-revision wire protocol.

Volatility: feature-surface

Tools compared: claude-code

D2.1 through D2.3 designed tools, their failures, and their distribution; this chapter is where an external MCP server actually gets connected. Almost every trap here is about location — which file holds the config, which scope it lives in, which directory it resolves against — plus one notorious naming collision and one silent-failure mode: a server that never connected. Get the location right and check the connection, and a server resolves predictably across personal, team, and machine contexts; get it wrong and the agent silently never sees the tools.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

Name the two ways to configure an MCP server, and what strictMcpConfig: true does.
What flag on claude mcp add chooses local vs project vs user scope?
Where does a local-scoped MCP server live — and how does that differ from “local settings”?
After wiring a server, how do you confirm it actually connected before the agent runs?
What transports can an MCP server use (name the deprecated one and the .mcp.json-only one), and what is the default connection timeout?

Check your answers

Programmatically (mcp_servers in Python, mcpServers in TypeScript) or via a .mcp.json at the project root; strictMcpConfig: true uses only the servers you pass in mcpServers, ignoring .mcp.json, user settings, and plugins.
--scope <local|project|user> — omit it and the default is Local.
In ~/.claude.json (your home directory, under a per-project key) — not .claude/settings.local.json, which is the project’s general local-settings file.
Read the system:init message before the agent runs — each server’s status is one of connected | failed | needs-auth | pending | disabled.
stdio, sse (deprecated), and http (alias streamable-http), plus ws, configurable only via .mcp.json or claude mcp add-json; the default connection timeout is 60 seconds.

Two ways to configure a server

An MCP server reaches an agent through one of two configuration paths. You can register it programmatically — mcp_servers in Python, mcpServers in TypeScript — or declare it in a .mcp.json file at the project root. [Official] Connect to external tools with MCP · AnthropicT1-official original The file-based path is not automatic, though: .mcp.json loads only when the SDK’s settingSources includes "project". [Official] Use Claude Code features in the SDK · AnthropicT1-official original Connect to external tools with MCP · AnthropicT1-official original

For a reproducible, clean-room config there is a third lever: strictMcpConfig: true uses only the servers you pass in mcpServers, ignoring .mcp.json, user settings, and plugins. [Official] Connect to external tools with MCP · AnthropicT1-official original It is how you guarantee an SDK run sees exactly the servers you declared and nothing the machine happens to carry.

Three scopes — and `claude mcp add --scope`

Claude Code stores MCP servers in three scopes, each a different file with a different audience: Local (~/.claude.json, under a per-project key — current project only, not shared), Project (.mcp.json at the repo root — shared via version control, with a one-time approval prompt on first use), and User (~/.claude.json — available across all your projects). [Official] Connect Claude Code to tools via MCP · AnthropicT1-official original

The CLI is how you actually install one and pick its scope: claude mcp add <name> --scope <local|project|user> --transport <http|stdio|sse> … — for example, claude mcp add --transport http --scope project notion https://mcp.notion.com/mcp registers a project-scoped HTTP server. [Official] Connect Claude Code to tools via MCP · AnthropicT1-official original --scope is the flag that decides who sees the server; omit it and you get Local (the default).

When the same server name appears in more than one scope, Claude Code connects once, using the highest-precedence source: Local → Project → User → plugin-provided → claude.ai connectors (the first three match duplicates by name). [Official] Connect Claude Code to tools via MCP · AnthropicT1-official original

"Local scope” is not “local settings”

The single most confusing collision in MCP configuration is the word local. “MCP local-scoped servers are stored in ~/.claude.json (your home directory), while general local settings use .claude/settings.local.json (in the project directory).” [Official] Connect Claude Code to tools via MCP · AnthropicT1-official original They are different files in different directories — one in your home, one in the project — and they hold different things.

Env-var expansion keeps secrets out of the file

Because a Project-scoped .mcp.json is committed to version control, secrets must never be written into it literally — they are referenced through env-var expansion instead. .mcp.json supports ${VAR} (expands, or fails the parse if unset) and ${VAR:-default} (expands, or uses the default), and the expansion works inside command, args, env, url, and headers. [Official] Connect Claude Code to tools via MCP · AnthropicT1-official original So a committed config carries "Authorization": "Bearer ${API_KEY}", and the key itself lives only in the environment.

Verify the server connected

A wired server that never connected is the silent failure of this chapter — the agent simply runs without the tools and you find out from a confusing answer. Don’t assume; check. Detect connection failures via the system:init message: each server’s status is one of connected | failed | needs-auth | pending | disabled — read it before letting the agent run. [Official] Connect to external tools with MCP · AnthropicT1-official original The default connection timeout is 60 seconds for server initialization, so a slow-starting server may need pre-warming or a lighter-weight package. [Official] Connect to external tools with MCP · AnthropicT1-official original

Wire a project server, then confirm it connected Worked example

Add a project-scoped server with the CLI, then gate the run on its status:

# the CLI sets scope (--scope) and transport (--transport); project => committed .mcp.json
claude mcp add --transport http --scope project notion https://mcp.notion.com/mcp

# Before letting the agent act, read system:init and refuse to run on a bad status:
async for message in query(prompt="…", options=options):
    if message.type == "system" and message.subtype == "init":
        bad = [s for s in message.data.get("mcp_servers", [])
               if s.get("status") != "connected"]
        if bad:
            raise RuntimeError(f"MCP servers not connected: {bad}")  # status: failed / needs-auth / pending / disabled

If notion comes back needs-auth, the OAuth flow hasn’t completed; failed usually means a missing env var, an uninstalled package, a bad connection string, or an unreachable host — and remember the 60-second init timeout. Checking status turns a silent “the tools just aren’t there” into an explicit, actionable failure.

Transports and the snapshot-dated wire protocol

A server’s type selects its transport: stdio for local processes, sse (Server-Sent Events, now deprecated — use HTTP), and http (Streamable HTTP; JSON configs accept streamable-http as an alias for http). A fourth type, ws (WebSocket), is configurable only through .mcp.json or claude mcp add-json, not the --transport flag (whose values are http/stdio/sse). [Official] Connect Claude Code to tools via MCP · AnthropicT1-official original Separately — and this is not a .mcp.json type — the SDK lets you run an MCP server in-process inside your application (an SDK deployment mode, e.g. a built-in tool server), rather than as an external process or endpoint. [Official] Connect to external tools with MCP · AnthropicT1-official original

Beneath the config sits the MCP wire protocol — and it is mid-revision, so cite it with a date. Under the 2025-11-25 specification, an initialize handshake “MUST be the first interaction between client and server,” negotiating protocol version and capabilities before any tool call. [Official] Lifecycle — Model Context Protocol Specification 2025-11-25 · AnthropicT1-official original The 2026-07-28 release candidate (locked May 2026; the final spec ships 2026-07-28) removes that handshake for a stateless model, so the wire details here are a dated snapshot, not a permanent contract. [Official] The 2026-07-28 MCP Specification Release Candidate · Model Context ProtocolT2-release-notes original

Practice

Exercise solutions

Solution ↑ Exercise

(a) Project scope — a .mcp.json at the repo root (committed so every clone gets the server), installed with claude mcp add … --scope project. (b) Reference the secret through env-var expansion rather than inlining it, e.g. "env": { "DATABASE_URL": "${DATABASE_URL}" } (or "${DATABASE_URL:-…}" if a safe default exists) — expansion works in env, url, and headers, so no literal credential is committed. (c) The teammate’s definition wins on their machine. Their ~/.claude.json entry is Local scope, and precedence runs Local → Project → User, matched by name — so the local definition overrides the shared project one for them. That is the intended override path for personal credentials, not a conflict.

Solution ↑ Exercise

The mistake is conflating “MCP local scope” with “general local settings.” A local-scoped MCP server is stored in ~/.claude.json (the home directory, under a per-project key), not in .claude/settings.local.json (the project’s machine-local settings). Editing the latter to change an MCP server is a silent no-op — point them at ~/.claude.json.

Solution ↑ Exercise

Inspect the system:init message and its mcp_servers field — each server reports a status. A status other than connected explains the missing tools: failed (missing env var, uninstalled package, bad connection string, unreachable host), needs-auth (OAuth not completed), pending, or disabled. The 60-second default initialization timeout is a common cause of failed/pending for slow-starting servers — pre-warm or use a lighter package. Always read status before letting the agent run rather than discovering the gap from a wrong answer.

Exam essentials

Two config paths — programmatic (mcp_servers/mcpServers) or a .mcp.json at the project root (loads only when settingSources includes "project"). strictMcpConfig: true uses only mcpServers, ignoring .mcp.json/user/plugins.
claude mcp add … --scope <local|project|user> --transport <http|stdio|sse> installs a server and picks its scope; --scope defaults to Local.
Three scopes, three audiences — Local (~/.claude.json, per-project, private), Project (.mcp.json, committed/shared, approval-prompted), User (~/.claude.json, all projects). Precedence: Local → Project → User → plugin → claude.ai, matched by name.
“Local scope” ≠ “local settings” — a local-scoped MCP server lives in ~/.claude.json (home), not .claude/settings.local.json (project).
Env-var expansion — ${VAR} / ${VAR:-default} in command/args/env/url/headers; ${CLAUDE_PROJECT_DIR} needs the :- form in hand-written configs.
Verify the connection — read the system:init status (connected/failed/needs-auth/pending/disabled) before running; the default init timeout is 60s.
Config transports — stdio / sse (deprecated) / http (streamable-http alias), plus ws (WebSocket, .mcp.json-only); the SDK in-process server is a separate deployment mode, not a type. The 2025-11-25 initialize handshake is a dated snapshot the 2026-07-28 RC removes.

Part 2 Chapter 5 Last verified 2026-06-08 Fresh

Built-in Tools: The Roster, Execution Order, and Permission Gating

The fixed roster of built-in tools every agent ships with — Read, Write, Edit, Bash, Grep, Glob — their exact, case-sensitive names, the read-only-versus-state-modifying line that decides which run in parallel, the six permission modes, the five-step evaluation order, and the allow/deny rules that gate them. Closes on the allowlist-is-not-a-sandbox trap.

Volatility: feature-surface

Tools compared: claude-code

D2.1 through D2.4 designed tools, their failure contracts, their distribution, and the wiring of external MCP servers. This chapter steps back to the tools an agent already has on the first turn — the fixed built-in roster — and the permission machinery that decides whether any given one actually fires. The exam angle is recognition: the exact roster, the read-versus-write execution split, the six modes, the evaluation order, and one high-value trap where a developer thinks they have locked an agent down and have not.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

Name the six core built-in tools. Does allowed_tools=["read"] pre-approve the file reader?
Which built-in tools may run concurrently, and what single property decides it?
Name the six permission modes; which one restricts the agent to read-only tools?
In what order does the SDK evaluate hooks, deny rules, the permission mode, allow rules, and canUseTool?
You set allowed_tools=["Read"] and permission_mode="bypassPermissions". What can the agent actually do?

Check your answers

Read, Write, Edit, Grep, Glob, Bash — names are matched exactly, so allowed_tools=["read"] pre-approves nothing (the tool is Read).
Read-only tools (Read, Glob, Grep) may run concurrently; the deciding property is whether a tool is read-only or state-modifying — Edit, Write, and Bash run sequentially.
default, acceptEdits, plan, dontAsk, bypassPermissions, auto (TypeScript-only) — plan restricts the agent to read-only tools.
Hooks → Deny rules → Permission mode → Allow rules → canUseTool — deny rules and hooks sit above the mode, so they bind even under bypassPermissions.
Every tool, including Bash, Write, and Edit — allowed_tools only pre-approves and never restricts; the allowlist is not a sandbox.

The built-in tool roster

Every agent starts with a fixed roster of built-in tools — the SDK ships roughly fourteen of them, in six categories, identical to those that power Claude Code. [Official] Agent SDK overview · AnthropicT1-official original How the agent loop works · AnthropicT1-official original The six that do the everyday work of reading and changing a codebase are Read, Write, Edit (file operations), Grep, Glob (search), and Bash (execution). The parity with Claude Code is explicit: “The SDK includes the same tools that power Claude Code,” [Official] How the agent loop works · AnthropicT1-official original and “Everything that makes Claude Code powerful is available in the SDK.” [Official] Agent SDK overview · AnthropicT1-official original Beyond this built-in set sit MCP server tools (Chapter D2.4) and your own custom tools; this chapter is about the built-ins every agent has from the start.

These names are exact — they appear verbatim in allowed_tools / allowedTools rules and as the tool_use.name block in messages, so Read is the tool and read is not. [Official] How the agent loop works · AnthropicT1-official original

Read-only and state-modifying tools run differently

The roster splits along a line that the runtime cares about: whether a tool reads state or changes it. Read-only tools — Read, Glob, Grep, and MCP tools marked read-only — can run concurrently; tools that modify state — Edit, Write, and Bash — run sequentially to avoid conflicts. Custom tools default to sequential execution and opt into parallelism by setting readOnlyHint in their annotations. [Official] How the agent loop works · AnthropicT1-official original

Gating the tools: six permission modes

Having a tool in the roster does not mean it fires. When the model requests a tool, the active permission mode is consulted, and there are six: default, acceptEdits, plan, dontAsk, bypassPermissions, and auto (TypeScript-only). [Official] Configure permissions · AnthropicT1-official original Two are worth memorizing for the exam because they change the tool surface directly: plan restricts the agent to read-only tools, so it explores and proposes a plan without editing source files; acceptEdits auto-approves file edits and filesystem commands (mkdir, touch, rm, rmdir, mv, cp, sed) — but only inside cwd plus additionalDirectories, and paths outside that scope, or protected paths, still prompt. [Official] Configure permissions · AnthropicT1-official original

Concept ·

Mode	What it does to the tool surface
`default`	No auto-approvals; unmatched tools hit your `canUseTool` callback (no callback ⇒ deny).
`acceptEdits`	Auto-approves edits + filesystem commands (`mkdir`/`touch`/`rm`/`rmdir`/`mv`/`cp`/`sed`) inside `cwd`/`additionalDirectories`; other `Bash` follows default rules.
`plan`	Read-only tools only; no edits to source files.
`dontAsk`	Anything not pre-approved by rules is denied; `canUseTool` is never called.
`bypassPermissions`	All tools run without prompts — but deny rules, explicit `ask` rules, and hooks still apply. Cannot run as root on Unix.
`auto` (TS only)	A model classifier approves or denies each call.

Allow and deny rules — and the five-step order

Within a mode, allow and deny rules pre-approve or block specific tools and calls. A bare name and a scoped pattern behave differently: allowed_tools=["Read", "Grep"] auto-approves those tools; disallowed_tools=["Bash"] removes Bash from the request entirely, so the model never sees it; and disallowed_tools=["Bash(rm *)"] keeps Bash available but denies any rm * call — in every mode, including bypassPermissions. [Official] Configure permissions · AnthropicT1-official original All of this resolves through a fixed sequence: “When Claude requests a tool, the SDK checks permissions in this order: 1. Hooks. 2. Deny rules. 3. Permission mode. 4. Allow rules. 5. canUseTool callback.” [Official] Configure permissions · AnthropicT1-official original

Why a deny rule beats bypassPermissions Worked example

An agent runs under permission_mode="bypassPermissions" (chosen to suppress prompts in a headless run) with a guardrail: disallowed_tools=["Bash(rm -rf *)"]. The model requests Bash(rm -rf /data). Walk the five-step order:

Hooks — any PreToolUse hook runs first; suppose none match.
Deny rules — Bash(rm -rf *) matches → denied, here, before the mode is ever consulted.
(Permission mode — never reached for this call.)
(Allow rules — never reached.)
(canUseTool — never reached.)

The rm -rf is blocked even though the mode is bypassPermissions, because deny rules (step 2) and hooks (step 1) sit above the mode (step 3). Flip it around to see the trap: allowed_tools=["Read"] lives at step 4, below the mode — so under bypassPermissions the mode approves everything at step 3 and the allowlist is never consulted. Order is everything: forbid high (hooks/deny), permit low (allow rules).

The high-value trap lives in the gap between pre-approving and restricting. allowed_tools only pre-approves the tools you list; it does not filter everything else out. Set allowed_tools=["Read"] alongside permission_mode="bypassPermissions" and the agent “still approves every tool, including Bash, Write, and Edit.” [Official] Configure permissions · AnthropicT1-official original The allowlist was never a sandbox.

The day-to-day use of these tools — the muscle memory of Read, Edit, and Bash inside a working session — is the handbook’s territory (the Use book’s chapter on Claude Code’s toolset); this chapter is the architect’s exam angle on the roster and the permission surface that gates it.

Practice

Exercise solutions

Solution ↑ Exercise

C. allowed_tools pre-approves the tools you list; it never restricts the ones you omit. Paired with bypassPermissions, the configuration “still approves every tool, including Bash, Write, and Edit” — the allowlist is silently irrelevant (it sits at step 4 of the evaluation order, below the mode at step 3). A is the core misconception (treating the allowlist as a filter). B confuses this with acceptEdits, which is the mode that auto-approves filesystem commands — bypassPermissions approves everything, not just filesystem ops. D invents a conflict; the two settings combine without error, which is exactly why the trap is dangerous. The fix: drop bypassPermissions and use permission_mode="plan" (read-only tools only), or keep a stricter mode and add a deny rule such as disallowed_tools=["Write", "Edit", "Bash"] — deny rules block even under bypassPermissions. Reach for the allowlist to permit, and for the mode or a deny rule to forbid.

Solution ↑ Exercise

The three Read calls may run concurrently; the Edit must run on its own (sequentially). The deciding property is whether a tool is read-only or state-modifying: read-only tools (Read, Glob, Grep) can run in parallel because they cannot conflict, while state-modifying tools (Edit, Write, Bash) run sequentially to avoid clobbering each other. So the runtime can fan out the three reads at once, then run the single edit after.

Solution ↑ Exercise

(a) plan mode — it restricts the agent to read-only tools, so it can explore and propose changes but cannot edit any source file, with no allow/deny list to maintain. (b) acceptEdits — it auto-approves file edits and filesystem commands (mkdir/rmdir/mv/…) inside cwd plus additionalDirectories, while paths outside that scope (and protected paths) still prompt. plan forbids edits entirely; acceptEdits permits them but only within the working scope.

Exam essentials

The roster is fixed and the names are exact — Read, Write, Edit, Grep, Glob, Bash are the six core built-ins (of ~14), identical to Claude Code’s; they appear verbatim in allow/deny rules and as tool_use.name, and a mis-cased name matches nothing.
Read vs write decides parallelism — read-only tools (Read/Glob/Grep) run concurrently; state-modifying tools (Edit/Write/Bash) run sequentially; custom tools default to sequential and opt in via readOnlyHint. Orthogonal to D2.3’s disable_parallel_tool_use.
Six permission modes — default, acceptEdits, plan, dontAsk, bypassPermissions, auto (TS). plan = read-only; acceptEdits = auto-approve edits + filesystem ops (mkdir/touch/rm/rmdir/mv/cp/sed) inside cwd/additionalDirectories, prompt outside.
Five-step evaluation order — Hooks → Deny rules → Permission mode → Allow rules → canUseTool. Deny rules and hooks fire before the mode, so they bind even under bypassPermissions; allow rules fire after it, so the mode can override them.
The allowlist trap — allowed_tools pre-approves, it does not restrict; with bypassPermissions it approves everything regardless. Confine with plan mode or a deny rule, never with the allowlist alone.

Part 2 · D2 Review

5 exercises across 5 chapters — interleaved review.

d2-01-tool-interfaces

d2-01-ex-consolidate-vs-split A teammate ships three tools — `get_customer_by_id`, `list_customer_transactions`, and `list_customer_notes` — and reports that the agent often calls only the first and then stalls. You may redesign the interface. Propose a single consolidated tool (give its name, a one-sentence description, and the boundary it draws), and name the two interface principles your redesign applies.

d2-02-structured-errors

d2-02-ex-channel-and-content Your booking tool (exposed over MCP) receives a departure date in the past. A colleague wants to return a JSON-RPC `-32602 Invalid params` error. (a) Which channel is correct, and why? (b) Write the one-sentence error content the tool should return. (c) Would `strict: true` have prevented this particular failure? Explain. (d) If the same tool were called directly over the Claude Messages API instead of MCP, what changes about the *spelling* of the failure flag?

d2-03-tool-choice-distribution

d2-03-ex-mode-selection You are building an agent that (a) must produce a JSON classification by calling your `record_decision` tool on every turn with valid inputs, and (b) you also want extended thinking enabled so it reasons first. A colleague sets `tool_choice: {"type": "tool", "name": "record_decision"}`. What goes wrong? What configuration gets you the *schema-valid guaranteed* tool call, and what must you give up to also keep thinking?

d2-04-mcp-configuration

d2-04-ex-scope-and-secret Your team needs a `postgres` MCP server available to everyone who clones the repo, but the connection string must not be committed. (a) Which scope and file do you use, and what `claude mcp add` flag sets that scope? (b) Show the `env` field using the documented expansion syntax. (c) A teammate later adds the *same* server name in their personal `~/.claude.json`. Which definition wins on their machine, and why?

d2-05-builtin-tools

d2-05-ex-allowlist-vs-bypass A developer wants a "read-only" agent for an automated audit. They configure it with `allowed_tools=["Read"]`, reasoning that listing only `Read` confines the agent to reading. To suppress every approval prompt in the headless run, they also set `permission_mode="bypassPermissions"`. Which tools can the agent actually use, and what is the one-line fix that would genuinely confine it? - **A.** Only `Read`; `bypassPermissions` still honors the allowlist, so the agent stays read-only as intended. - **B.** `Read` plus the filesystem commands (`mkdir`, `rm`, `mv`, `cp`) that `bypassPermissions` auto-approves. - **C.** Every tool, including `Bash`, `Write`, and `Edit`; `allowed_tools` only pre-approves and never restricts. - **D.** No tools; `allowed_tools=["Read"]` and `bypassPermissions` conflict, so the query raises an error.

Part 3 Chapter 1 Last verified 2026-06-02 Fresh

CLAUDE.md Hierarchy & @import: Four Scopes That Concatenate

How Claude Code assembles persistent instructions from four CLAUDE.md scopes that concatenate without precedence — the opposite of the strict five-level settings hierarchy (Managed > CLI > Local > Project > User) — plus the @import mechanism (depth-5, first-use approval), the AGENTS.md bridge, and the managed claudeMd / claudeMdExcludes controls.

Volatility: architectural-pattern

Tools compared: claude-code

D2.4 resolved MCP servers and settings across a strict precedence where the highest scope wins. The instruction layer looks like the same machinery — files at managed, user, project, and local scopes — but it behaves in the opposite way: the files do not compete, they concatenate. This chapter is the exam angle on that distinction and on the @import mechanism that stitches files together. The design rationale for treating the file as a context budget lives in the Further reading.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

Name the four CLAUDE.md scopes and the order they load in.
Two CLAUDE.md files in the chain say contradictory things. Which one “wins”?
Name the five levels of the settings precedence, highest to lowest.
What is the @import recursion depth, and what happens permanently if you decline the first-use prompt?
Claude Code “ignores” a teammate’s AGENTS.md. Why, and what one line fixes it?

Check your answers

Broadest to most specific: Managed policy → User → Project → Local (~/.claude/CLAUDE.md, ./CLAUDE.md or ./.claude/CLAUDE.md, ./CLAUDE.local.md).
Neither — CLAUDE.md files concatenate rather than override; both contradictory instructions sit in context at once, a smell to fix at the source.
Managed → CLI arguments → Local → Project → User — the highest scope wins (exception: permission allow/ask/deny rules merge across scopes).
Maximum recursion depth is 5; declining the first-use approval dialog disables imports permanently — the dialog does not reappear.
Claude Code reads CLAUDE.md, not AGENTS.md; the fix is a one-line bridge — a CLAUDE.md that imports @AGENTS.md.

Four scopes that load broadest-first

Claude Code assembles its persistent instructions from up to four CLAUDE.md scopes, loaded broadest to most specific: Managed policy, then User (~/.claude/CLAUDE.md), then Project (./CLAUDE.md or ./.claude/CLAUDE.md), then Local (./CLAUDE.local.md, which you add to .gitignore). [Official] How Claude remembers your project · AnthropicT1-official original Discovery walks up the directory tree from your working directory; a CLAUDE.md nested below cwd is not loaded at launch but on demand, when Claude first reads a file in that subdirectory. [Official] How Claude remembers your project · AnthropicT1-official original The managed-policy file is the one scope that cannot be excluded by any individual setting — which is exactly what makes it the instrument for org-enforced instructions. [Official] How Claude remembers your project · AnthropicT1-official original

Concatenation, not precedence

Here is the property that separates the instruction layer from every settings file: the discovered CLAUDE.md files do not override one another. “All discovered files are concatenated into context rather than overriding each other. Across the directory tree, content is ordered from the filesystem root down to your working directory.” [Official] How Claude remembers your project · AnthropicT1-official original Within a single directory, CLAUDE.local.md is appended after CLAUDE.md. [Official] How Claude remembers your project · AnthropicT1-official original

Contrast settings, which resolve by a strict five-level precedence where the highest scope wins. Named in full, highest to lowest: [Official] Claude Code Settings · AnthropicT1-official original

So CLAUDE.md and settings sit at opposite ends: one accumulates, the other overrides — and the override ladder is five rungs, not three, with CLI and Local the two most often forgotten.

Same conflict, two opposite resolutions Worked example

A user file and a project file each set the same thing. Watch the two layers diverge:

Instructions (CLAUDE.md) — concatenate. ~/.claude/CLAUDE.md says “prefer tabs”; ./CLAUDE.md says “prefer 2-space indent.” Result: both load into context at once (root-down order), and there is no rule about which wins — Claude sees two contradictory instructions. The contradiction is a smell to fix at the source, not something proximity resolves.

Settings (settings.json) — override. ~/.claude/settings.json sets "model": "opus"; .claude/settings.json sets "model": "sonnet". Result: Project wins over User (level 4 beats level 5), so the effective model is sonnet. Add --model haiku on the CLI and that wins (level 2), overriding both files.

Same shape of conflict, opposite outcomes: the instruction files merge and coexist; the settings resolve to exactly one value down the five-level ladder (Managed → CLI → Local → Project → User). Predict CLAUDE.md with the settings model and you will be wrong every time.

@import: stitching files together

A CLAUDE.md can pull in other files with @path/to/import. The imported files expand and load at launch alongside the referencing file; relative paths resolve relative to the file containing the import; and the import chain has a maximum recursion depth of 5. [Official] How Claude remembers your project · AnthropicT1-official original The first time a session encounters an import, Claude shows an approval dialog — and declining it disables imports permanently (the dialog does not reappear). [Official] How Claude remembers your project · AnthropicT1-official original

AGENTS.md, managed policy, and the budget you don’t develop here

Three controls round out the layer. First, the cross-tool bridge: Claude Code reads CLAUDE.md, not AGENTS.md — to share one instruction set with other agents, create a CLAUDE.md that imports @AGENTS.md. [Official] How Claude remembers your project · AnthropicT1-official original Second, managed settings can deploy instructions with no file at all: the claudeMd key carries inline CLAUDE.md content, honored only in managed/policy settings, and it loads before the user and project files. [Official] How Claude remembers your project · AnthropicT1-official original Claude Code Settings · AnthropicT1-official original Third, claudeMdExcludes — glob patterns matched against absolute paths, merged across layers — skips ancestor CLAUDE.md files, with the single exception that the managed-policy file can never be excluded. [Official] How Claude remembers your project · AnthropicT1-official original

Practice

Exercise solutions

Solution ↑ Exercise

C. CLAUDE.md files do not compete: “All discovered files are concatenated into context rather than overriding each other,” ordered “from the filesystem root down to your working directory.” So both “use tabs” and “use 2-space indent” are in context simultaneously — which is itself a smell, because contradictory ancestor instructions are not resolved by proximity. A and B both assume a precedence the instruction layer does not have. D is the high-value trap: it imports the settings model (where settings.local.json overrides user settings) onto memory, where files merge instead. To actually suppress the root file you would use claudeMdExcludes, not a closer CLAUDE.md.

Solution ↑ Exercise

haiku runs. The five-level settings precedence, highest to lowest, is Managed → CLI arguments → Local → Project → User; the --model haiku CLI argument (level 2) beats both the project opus (level 4) and the user sonnet (level 5). The same two-scope setup behaves differently for CLAUDE.md because the instruction layer concatenates instead of overriding — two CLAUDE.md files setting contradictory guidance would both load into context at once, with no “winner,” whereas settings resolve to exactly one value down the ladder.

Solution ↑ Exercise

(a) The maximum @import recursion depth is 5. (b) On first encountering an import, Claude Code shows an approval dialog; declining it disables imports permanently — the dialog does not reappear, so a future import in that environment silently will not expand until the choice is reset. Design import chains shallow (≤5) and approve deliberately.

Exam essentials

Four scopes, broadest-first — Managed → User → Project → Local; discovery walks up from cwd, nested files load on demand, and the managed file cannot be excluded.
Concatenate, don’t override — there is no precedence between CLAUDE.md files (root → cwd order; CLAUDE.local.md appended after CLAUDE.md). This is the opposite of the strict five-level settings precedence: Managed → CLI → Local → Project → User (CLI and Local are the forgotten two). Conflating the two is the single most common instruction-layer error.
@import — @path expands at launch, resolves relative to the importing file, caps at recursion depth 5, and prompts an approval dialog on first use (declining disables imports permanently).
AGENTS.md — Claude Code reads CLAUDE.md only; it picks up AGENTS.md solely through a @AGENTS.md import.
Managed controls — claudeMd deploys inline policy content (loads before user/project); claudeMdExcludes globs away ancestor files by absolute path — but never the managed-policy CLAUDE.md.

Slash Commands & Skills: Stored Prompts, Lazy-Loaded Capabilities

Two ways to extend the workflow — a slash command (a stored prompt recognized at message start) and a skill (a lazy-loaded, auto-invocable, directory-bundled capability). The merged model, the full lazy-load lifecycle (description budget, $ARGUMENTS substitution, compaction carry-forward, live change detection), the SKILL.md frontmatter, the four scopes, and what disable-model-invocation does to the description.

Volatility: feature-surface

Tools compared: claude-code

D3.1’s CLAUDE.md is always-on context. This chapter is the other half of the instruction layer: capabilities the agent loads only when it needs them. A slash command is a stored prompt you trigger by typing /name; a skill is a richer, directory-bundled capability that Claude can also reach for on its own. The exam-relevant facts are that the two have converged, and that a skill’s lazy-loading — and its budget — are what keep it cheap.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

Where does a slash command have to appear in your message, and what becomes of the text after it?
True or false: a skill’s description is always loaded at session start.
How do you pass and reference arguments inside a skill body?
Roughly how much context does a skill cost at startup vs once invoked, and how long does the body persist?
Name the four skill scopes in precedence order.

Check your answers

At the start of the message — text that follows the command name is passed to it as arguments.
False — a disable-model-invocation: true skill carries no description in context at all, and even ordinary descriptions can be dropped under budget pressure (least-invoked first).
$ARGUMENTS expands to all arguments passed, and $ARGUMENTS[N] / $N pick a specific 0-indexed argument; if the body never references them, args are appended as ARGUMENTS: <value>.
A ~100-token description at startup (against a budget defaulting to 1% of the context window); the full body loads on invocation and stays for the rest of the session, carried through compaction within a budget (≤5,000 tokens each, 25,000 combined).
Enterprise > personal > project, plus plugin — namespaced as plugin-name:skill-name, so it never conflicts.

Commands and skills have merged

A slash command controls Claude Code from inside a session, and “a command is only recognized at the start of your message. Text that follows the command name is passed to it as arguments.” [Official] Commands · AnthropicT1-official original Alongside the many built-in commands, some entries are bundled skills: “they use the same mechanism as skills you write yourself: a prompt handed to Claude, which Claude can also invoke automatically when relevant. Everything else is a built-in command whose behavior is coded into the CLI.” [Official] Commands · AnthropicT1-official original

That same mechanism is why the two authoring formats have converged: “Custom commands have been merged into skills. A file at .claude/commands/deploy.md and a skill at .claude/skills/deploy/SKILL.md both create /deploy and work the same way. Your existing .claude/commands/ files keep working.” [Official] Extend Claude with skills · AnthropicT1-official original Old flat-file commands still run; skills are the recommended form for new work because they add directory bundling, frontmatter, and auto-invocation.

Skills are lazy-loaded directories

A skill is “a markdown directory: .claude/skills/<name>/SKILL.md,” with optional supporting files (reference.md, scripts/, and the like) alongside. [Official] Extend Claude with skills · AnthropicT1-official original The reason a skill is cheap is its loading model — unlike CLAUDE.md, which loads every session: “skills load on demand. The agent receives skill descriptions at startup and loads the full content when relevant.” [Official] Extend Claude with skills · AnthropicT1-official original Each description is roughly 100 tokens; the full body materializes only when the skill is invoked, “enters the conversation as a single message and stays there for the rest of the session,” and is not re-read on later turns. [Official] Extend Claude with skills · AnthropicT1-official original

The lazy-load lifecycle in full

The “cheap” story has a budget and a lifecycle the exam can probe at each stage:

Concept ·

Startup (budget). Skill descriptions load into a context budget that defaults to 1% of the model’s context window (skillListingBudgetFraction; per-description cap maxSkillDescriptionChars, default 1,536). On overflow, the least-invoked skills’ descriptions are dropped first — so a rarely-used skill can become invisible. /doctor reports whether the budget is overflowing. [Official] Extend Claude with skills · AnthropicT1-official original
Invocation (arguments). Inside the body, $ARGUMENTS expands to all arguments passed (if you never reference it, the args are appended as ARGUMENTS: <value>), and $ARGUMENTS[N] / $N pick a specific 0-indexed argument ($0 first, $1 second), with shell-style quoting. [Official] Extend Claude with skills · AnthropicT1-official original
Persistence (compaction). Once loaded the body stays for the session, and compaction carries it forward within a budget — the first 5,000 tokens of each most-recently-invoked skill, up to a combined 25,000 tokens post-compaction. [Official] Extend Claude with skills · AnthropicT1-official original

One operational nicety: live change detection — adding, editing, or removing a skill under ~/.claude/skills/, the project’s .claude/skills/, or an --add-dir directory takes effect within the current session, no restart — the one exception being a brand-new top-level skills/ directory that did not exist at launch. [Official] Extend Claude with skills · AnthropicT1-official original

A deploy skill through its lifecycle Worked example

# .claude/skills/deploy/SKILL.md
---
name: deploy
description: Deploy the app to an environment. Use when asked to deploy, ship, or release a build.
argument-hint: "[environment]"
---
Deploy to $0 by running the runbook in scripts/deploy.sh, then verify the health check…

Trace it: at startup, only the ~100-token description loads (counted against the 1%-of-context budget) — the runbook body does not. On /deploy staging, $0 (and $ARGUMENTS) expand to staging, and the full body enters the conversation as one message. It then persists for the session; if a compaction fires, the body is carried forward (up to 5,000 tokens of it, within the 25,000-token combined skill budget). Flip disable-model-invocation: true and the calculus changes: the description is no longer in context at startup at all, so Claude can’t auto-invoke on “ship it” — only an explicit /deploy loads it.

SKILL.md frontmatter and the four scopes

The SKILL.md frontmatter is where a skill declares its behavior. Among the fields: name (display name in skill listings, defaults to the directory name — the directory name, not this field, sets the /command you type, except for a plugin-root SKILL.md), description (drives auto-invocation; description + when_to_use capped at 1,536 chars by default), argument-hint, disable-model-invocation, user-invocable, allowed-tools (CLI-only), model, effort, context, and paths (glob patterns that limit when the skill activates). [Official] Extend Claude with skills · AnthropicT1-official original

Skills resolve across four scopes by precedence — “enterprise > personal > project; plugin skills use a plugin-name:skill-name namespace and never conflict”: Enterprise (managed settings, all users), Personal (~/.claude/skills/, all your projects), Project (.claude/skills/, auto-discovered from cwd up to repo root), and Plugin (namespaced). [Official] Extend Claude with skills · AnthropicT1-official original

Who can invoke it

Two frontmatter switches decide who may call a skill. user-invocable: false hides it from the / menu so only Claude can invoke it; disable-model-invocation: true does the inverse — only the user can trigger it via /, Claude cannot auto-invoke, its description is kept out of context, and it also blocks subagent preloading. [Official] Extend Claude with skills · AnthropicT1-official original Between them you can build a skill that is purely automatic, purely manual, or both.

Practice

Exercise solutions

Solution ↑ Exercise

B. A skill lazy-loads: only its ~100-token description sits in context at startup, and the full 2,000-token body loads on invocation — and Claude can auto-invoke it when a deploy is relevant. A is the D3.1 budget mistake: a CLAUDE.md is loaded every session, so the whole 2,000 tokens would be spent on every unrelated turn. C is wrong on the “only way” claim — commands have merged into skills, so .claude/commands/deploy.md and .claude/skills/deploy/SKILL.md both create /deploy; the skill form is recommended. D defeats the point of a reusable capability. The skill is the form that is both cheap (lazy) and discoverable (auto-invocable).

Solution ↑ Exercise

disable-model-invocation: true keeps the skill’s description out of startup context entirely — so unlike an ordinary skill (whose ~100-token description Claude sees and can match against), the model is never told this skill exists. That is exactly the point for a risky release action: it forces a human to type /force-release, because Claude cannot auto-invoke something it cannot see. The teammate got the intended safety behavior; “Claude doesn’t know it exists” is the feature, not a bug. (If they wanted Claude to know about it but still gate execution, that is a permission/ask concern, not this flag.)

Solution ↑ Exercise

(a) Only the ~100-token description enters context at session start, counted against the skill-listing budget (default 1% of the model’s context window); the ~3,000-token body does not load yet. (b) The body loads on invocation, enters as a single message, and stays for the rest of the session (it is not re-read each turn). If a compaction fires, the body is carried forward within a budget — up to the first 5,000 tokens of each most-recently-invoked skill, capped at a combined 25,000 tokens post-compaction.

Exam essentials

Commands merged into skills — .claude/commands/x.md and .claude/skills/x/SKILL.md both create /x; old commands keep working, skills are recommended (directory bundling, frontmatter, auto-invocation). A command is recognized only at message start; text after it is arguments.
Skills lazy-load on a budget — a ~100-token description at session start (budget defaults to 1% of the context window; on overflow least-invoked descriptions drop first; /doctor shows it), the full body on invocation; the body persists and compaction carries it forward (≤5,000 tokens each, 25,000 combined).
Arguments — $ARGUMENTS (all args), $ARGUMENTS[N] / $N (0-indexed) substitute into the body; absent references append ARGUMENTS: <value>.
The description is the retrieval interface — but not unconditional — disable-model-invocation: true keeps the description out of context (user-only via /), and budget overflow can drop one. user-invocable: false = Claude-only.
Four scopes — enterprise > personal > project; plugin skills are namespaced (plugin-name:skill-name) and never conflict. Live change detection: editing a skill takes effect mid-session without restart (except a brand-new top-level skills/ dir).

Path-Scoped Rules: Modular, Glob-Triggered Instructions

The .claude/rules/ system — a modular, eager-loaded instruction layer parallel to CLAUDE.md, with optional glob path-scoping so a rule loads only when Claude reads matching files. Covers unconditional vs path-scoped rules, user-level vs project rules and their load order, the glob format, and the directory and symlink mechanics.

Volatility: feature-surface

Tools compared: claude-code

D3.1 covered the always-on CLAUDE.md; D3.2 covered the lazy, invocable skill. Rules are the third shape of the instruction layer: modular markdown files that load eagerly like CLAUDE.md, but can be glob-scoped so a rule only enters context when Claude touches the files it governs. The architect’s job here is to know that rules are a separate, equal-priority system, that user and project rules have a load order, and to use path-scoping to keep file-specific guidance out of unrelated work.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

Is .claude/rules/ a subsystem of CLAUDE.md, or a parallel one? At what priority do un-scoped rules load?
When does a paths-scoped rule actually enter context?
Where do user-level rules live, and do they win over project rules or lose to them?
Write a paths glob for every .ts and .tsx file in the repo.
When should a standing instruction be a paths-scoped rule rather than a line in CLAUDE.md?

Check your answers

A parallel system, not a subsystem — rules without paths load at launch at the same priority as .claude/CLAUDE.md.
When Claude reads a file matching the glob — not at launch, and not on every tool use.
In ~/.claude/rules/, applying to every project; they load before project rules, so project rules win (read last, higher priority).
**/*.{ts,tsx} — ** matches any directory depth, and brace expansion covers both extensions in one pattern.
When the guidance is real but only relevant to part of the tree — path-scoping keeps file-specific instructions out of context on unrelated work.

A system parallel to CLAUDE.md

.claude/rules/*.md is a modular rules system, loaded into context every session, with recursive discovery across subdirectories. [Official] How Claude remembers your project · AnthropicT1-official original The relationship to CLAUDE.md is the fact to fix first: “Rules without paths frontmatter are loaded at launch with the same priority as .claude/CLAUDE.md.” [Official] How Claude remembers your project · AnthropicT1-official original A rule is not nested under CLAUDE.md or overridden by it — the two are parallel instruction sources discovered separately and loaded at equal priority.

User-level vs project rules — there is a load order

The “no precedence” story needs one refinement the exam can probe: rules come in scopes, and the scopes load in order. User-level rules live in ~/.claude/rules/ and apply to every project on your machine — use them for preferences that aren’t project-specific (your personal style, your workflows). [Official] How Claude remembers your project · AnthropicT1-official original Project rules live in the repo’s .claude/rules/. And the order between them is documented: “user-level rules are loaded before project rules, giving project rules higher priority.” [Official] How Claude remembers your project · AnthropicT1-official original

That is the same recency model as the CLAUDE.md hierarchy (D3.1): files concatenate, but the more-specific scope is read last and so effectively dominates when two instructions tension. So “concatenate, not override” is true within a scope; across scopes there is a load order — user → project, project higher — exactly mirroring CLAUDE.md’s broad-to-specific assembly.

Path-scoping with the `paths` frontmatter

The lever that makes rules more than “another CLAUDE.md” is the optional paths field. Give a rule a paths glob and it stops loading unconditionally: “path-scoped rules trigger when Claude reads files matching the pattern, not on every tool use.” [Official] How Claude remembers your project · AnthropicT1-official original So a rule scoped to src/api/**/*.ts costs nothing until Claude actually reads an API file — and then applies while that work is in scope.

---
paths:
  - "src/api/**/*.ts"
---

# API Development Rules
- All API endpoints must include input validation

The glob format is the same one skills use for their paths field: [Official] How Claude remembers your project · AnthropicT1-official original

What's in context, and when Worked example

A developer has, across two scopes:

~/.claude/rules/preferences.md         # user-level, no paths  → personal default
.claude/rules/code-style.md            # project, no paths     → always on, beats user default
.claude/rules/backend/api.md           # project, paths: src/api/**/*.ts

At session start, in context: preferences.md (loaded first) and code-style.md (loaded after → higher priority). api.md is not loaded — it is path-scoped. If preferences.md says “prefer 4-space indent” and code-style.md says “2-space,” the project rule was read last, so it dominates. The moment Claude reads src/api/orders.ts, api.md activates and “all API endpoints must include input validation” enters context — and only then. Work that never touches src/api/ never pays for that rule. Two levers at once: scope order (user → project) decides who wins; path-scoping decides who is even present.

Layout and symlinks

Rules can mix unconditional and path-scoped files in one tree, and discovery recurses into subdirectories:

.claude/
└── rules/
    ├── code-style.md      # no paths: loaded unconditionally
    ├── security.md        # no paths: loaded unconditionally
    ├── frontend/
    │   └── react.md       # paths: src/frontend/**/*.tsx
    └── backend/
        └── api.md         # paths: src/api/**/*.ts

The unconditional files (code-style.md, security.md) are always in context; the nested path-scoped files wait for a matching file-read. [Official] How Claude remembers your project · AnthropicT1-official original Rules can also be shared from a central directory by symlink: “symlinks work in .claude/rules/ — link shared rules from a central dir; circular symlinks are detected gracefully.” [Official] How Claude remembers your project · AnthropicT1-official original

Choosing the shape: rule vs CLAUDE.md vs skill

The three instruction shapes now in view divide cleanly by when they load and what they carry:

The practical rule of thumb: reach for a paths-scoped rule when guidance is real but only relevant to part of the tree — it is the one shape that lets you write file-specific instructions without paying for them on every unrelated turn.

Practice

Exercise solutions

Solution ↑ Exercise

C. A path-scoped rule is exactly this case: with paths: ["src/api/**/*.ts"] the standard loads only when Claude reads a matching API file, and stays out of context otherwise. A and B both load the line unconditionally — CLAUDE.md and an un-scoped rule sit in context at the same priority every session, which is the clutter you wanted to avoid. D misuses a skill: skills are invocable capabilities/workflows, not standing rules that should bind automatically while editing a file. The lever that matters is paths, and only a rule (or a skill) offers it — so the rule is the right shape for a standing, file-scoped instruction.

Solution ↑ Exercise

src/**/*.{ts,tsx} — or, to cover the whole repo regardless of directory, **/*.{ts,tsx}. A single paths entry with brace expansion {ts,tsx} matches both extensions; ** matches any directory depth. (You can also list two patterns, but the brace form does it in one.)

Solution ↑ Exercise

Claude favors 2-space indent — the project rule wins. Your preferences.md is a user-level rule (~/.claude/rules/, applies to every project); the repo’s code-style.md is a project rule. The documented order is that user-level rules load before project rules, giving project rules higher priority — so when both are in context and they tension, the project rule (read last) dominates. Your personal preference acts as a default that any project can override for its own repo.

Exam essentials

Parallel to CLAUDE.md — .claude/rules/*.md loads every session with recursive subdir discovery; rules without paths load unconditionally at the same priority as .claude/CLAUDE.md (not a subsystem of it).
Scopes have a load order — user-level (~/.claude/rules/, all projects) loads before project (.claude/rules/), giving project rules higher priority. Rules concatenate, but the more-specific scope is read last and wins (the CLAUDE.md model).
Path-scoping — a paths glob makes a rule trigger only when Claude reads matching files, not at launch and not on every tool use. Glob format same as skill paths (**/*.ts, src/**/*, {ts,tsx}).
Mechanics — unconditional and path-scoped rules can mix in one tree; symlinks work, with circular symlinks detected gracefully.
Choosing the shape — CLAUDE.md (always-on), rules (modular, optionally glob-gated, user/project order), skills (lazy capability). Reach for a paths-scoped rule for guidance that is real but only relevant to part of the tree.

Part 3 Chapter 4 Last verified 2026-06-08 Fresh

Plan Mode vs Direct Execution: Research Before You Edit

Plan mode restricts Claude to read-only research and a written proposal — no edits — and approving the plan exits the mode into a write mode. Choosing plan versus going direct is a risk-containment decision, not a named-mode toggle; "direct execution" is simply working in a write mode. The opusplan alias pairs the mode with a model-per-phase split — Opus plans, Sonnet executes.

Volatility: architectural-pattern

Tools compared: claude-code

D2.5 enumerated the six permission modes as a tool-gating mechanism. This chapter zooms in on the one an architect chooses on purpose: plan. Plan mode is read-only research with a written proposal at the end — and the real exam question is not “what is plan mode” but “when do you plan first versus edit directly.” That is a risk-containment decision, and there is no named “direct execution” mode — going direct just means working in a mode that lets edits through. The chapter closes on opusplan, the alias that makes the strong-model-plans/fast-model-executes split automatic.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

Does plan mode suppress permission prompts? What exactly can Claude do in it, and what can it not?
The instant you approve a plan, what happens to the session’s permission mode?
Name three distinct ways to enter plan mode.
What does the opusplan alias do — and does its plan phase get the automatic 1M-context upgrade?
A two-line fix in code you know well: plan first or go direct?

Check your answers

No — permission prompts still apply the same as default mode; Claude can read files, run shell commands to explore, and write a plan, but it does not edit your source.
Approving a plan exits plan mode — the session switches to the write mode the chosen approve option names, and Claude starts editing.
Any three of: cycle with Shift+Tab, prefix a prompt with /plan, start with claude --permission-mode plan, or set permissions.defaultMode: "plan" in settings.
It runs Opus in plan mode and switches to Sonnet for execution — and no: the automatic 1M-context upgrade is opus-only, so opusplan’s plan phase runs at the standard 200K window.
Go direct — a small diff in familiar code is the going-direct profile; plan first when reversal cost is high or the design space is unclear.

What plan mode restricts

Plan mode is the one mode whose purpose is to change nothing: “Plan mode tells Claude to research and propose changes without making them. Claude reads files, runs shell commands to explore, and writes a plan, but does not edit your source. Permission prompts still apply the same as default mode.” [Official] Choose a permission mode · AnthropicT1-official original It is read-only — best for exploring a codebase before changing it. [Official] Choose a permission mode · AnthropicT1-official original

Its contrast is not a single “direct” mode but the write modes from D2.5 — default (reads auto-approved, edits prompt) and acceptEdits (edits and common filesystem commands auto-approved) — the modes that let changes through. [Official] Configure permissions · AnthropicT1-official original “Going direct” is shorthand for working in one of those, not a toggle of its own.

Entering and exiting plan mode

You can enter plan mode four ways: cycle with Shift+Tab (the CLI cycle runs default → acceptEdits → plan), prefix a single prompt with /plan (optionally with a task, /plan fix the auth bug), start the session with claude --permission-mode plan, or set permissions.defaultMode: "plan" in settings. [Official] Choose a permission mode · AnthropicT1-official original

Exit is the part that trips people up: “Approving a plan exits plan mode and switches the session to the permission mode each approve option describes, so Claude starts editing. To plan again, cycle back to plan mode with Shift+Tab, or prefix your next prompt with /plan.” [Official] Choose a permission mode · AnthropicT1-official original When the plan is ready Claude presents it, and each approve option (auto, accept-edits, review-each-edit) names the write mode the session lands in.

The decision: plan first, or go direct?

Because plan mode adds a research-and-review round before any edit, the choice between it and going direct is a bet on reversal cost and uncertainty:

Model per phase: the `opusplan` alias

Plan mode separates thinking from doing in time — research first, edits after. That split is exactly where a model-per-phase pays, and Claude Code ships an alias for it. opusplan is one of the eight model aliases, and it “uses Opus in plan mode, switches to Sonnet for execution.” [Official] Model configuration · AnthropicT1-official original Spelled out: “The opusplan model alias provides an automated hybrid approach: In plan mode — Uses opus for complex reasoning and architecture decisions. In execution mode — Automatically switches to sonnet for code generation and implementation.” [Official] Model configuration · AnthropicT1-official original

Set it like any alias — /model opusplan during a session, or claude --model opusplan at startup. [Official] Model configuration · AnthropicT1-official original The reasoning-heavy plan runs on the strong model; the moment execution begins, the fast model takes over. You spend the expensive tokens where the leverage is — on the design — and the cheaper model on the mechanical edits.

One approval, two switches: opusplan through a plan→execute cycle Worked example

A developer starts a session for a multi-file refactor with claude --model opusplan and presses Shift+Tab until the mode reads plan. Two independent settings are now in play: the permission mode is plan (read-only) and the model resolves to Opus (the plan-phase half of opusplan).

Research phase. Claude reads files and runs shell exploration — all permitted, none of it edits — and the heavy reasoning runs on Opus. It writes a proposed change set.
The approval. The developer picks the “accept edits” approve option. One action flips two switches at once: the permission mode leaves plan for acceptEdits (edits now auto-approve), and opusplan leaves its plan phase, so the model switches to Sonnet for execution.
Execution phase. Claude applies the edits on Sonnet under acceptEdits — fast model, mechanical work.

The lesson the exam tests: approval is not a rubber stamp. It simultaneously ends the read-only guarantee (D2.5’s mode axis) and, under opusplan, hands the work to a different model. To plan again — and return to Opus — you must re-enter plan mode (Shift+Tab or /plan).

Where plan mode fits the workflow

Plan mode is the front of the iterative loop — the “explore and plan” phase before “implement and commit” (the rhythm developed in D3.5). The hands-on mechanics — how the approval screen looks, how to drive the loop turn by turn — are the handbook’s territory rather than this book’s: see the Use book, Chapter 3, Your First Working Session.

Practice

Exercise solutions

Solution ↑ Exercise

B. The change is unfamiliar, multi-file, and expensive to get wrong — exactly the profile where planning first pays. In plan mode Claude maps the full set of call sites and proposes the complete edit before touching anything, so a missed reference shows up in a proposal you can reject, not as breakage you have to walk back. A and D both start editing before the scope is known: you discover missed call sites as broken edits (A) or as a long stream of one-at-a-time approvals with no view of the whole (D). C removes the safety entirely — and approving plans is the opposite move from skipping permission checks. Plan mode here converts an unbounded reversal cost into a bounded one.

Solution ↑ Exercise

In plan mode Claude can read files and run shell commands to explore, and it writes a plan — but it does not edit your source. The thing people wrongly assume it suppresses is permission prompts: they “still apply the same as default mode,” so plan mode is not a quiet, prompt-free sandbox — it is a no-edit mode that still gates the actions it does allow. Read-only is the guarantee; silence is not.

Solution ↑ Exercise

(a) The session is now in acceptEdits — approving a plan exits plan mode and switches the session to the write mode the chosen approve option names, and Claude starts editing. The read-only guarantee is gone. (b) To plan again you must re-enter plan mode: cycle back with Shift+Tab, or prefix your next prompt with /plan. Approval is one-way; getting back to research is a deliberate re-entry, not an undo.

Solution ↑ Exercise

(a) Under opusplan, the planning phase runs on Opus (complex reasoning and architecture) and execution runs on Sonnet (code generation) — the switch is automatic at the plan→execute boundary. (b) No — opusplan does not give you a 1M planning window. The automatic 1M-context upgrade applies to the opus alias only, not opusplan, whose Opus plan phase runs at the standard 200K. For a ~400K planning context you would use opus[1m] (or pin a 1M model) for that phase instead.

Exam essentials

Plan mode = read-only research — Claude reads, explores via shell, and writes a plan, but does not edit source; permission prompts still apply the same as default mode (it is no-edit, not prompt-free).
“Direct execution” is not a mode — it is working in a write mode (default/acceptEdits); plan’s contrast is the modes that let edits through.
Enter — Shift+Tab cycle (default → acceptEdits → plan), /plan [task], --permission-mode plan, or defaultMode in settings.
Exit — approving a plan exits plan mode into the chosen write mode and Claude starts editing; to plan again, cycle back with Shift+Tab or prefix /plan.
opusplan — alias that runs Opus in plan mode, Sonnet in execution; set via /model opusplan or --model opusplan. Approval flips both the permission mode and the model. Its plan phase runs at 200K — the 1M upgrade is opus-only, not opusplan.
The decision — plan when reversal cost is high or the design space is uncertain (unfamiliar / multi-file / risky); go direct for small diffs in known code. Plan contains misunderstandings before edits.

Part 3 Chapter 5 Last verified 2026-06-02 Fresh

Iterative Refinement: The Loop, the Interview, and Test-Driven Prompting

Agentic work is iterative. The explore-plan-implement-commit rhythm, the interview pattern (Claude interviews you, writes a spec, a fresh session implements), and test-driven prompting are the durable disciplines — methodology that survives any tool rename, hence a stable principle.

Volatility: stable-principle

Tools compared: claude-code

D3.4’s plan mode is one phase of a larger loop. This chapter is the loop itself — how an architect drives a task from understanding to a committed change, refining as they go. These are disciplines, not features: the explore-plan-implement-commit rhythm, the interview pattern, and test-driven prompting outlast any particular keybinding or tool name, which is why this chapter is a stable principle rather than a feature surface.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

Name the four phases of the recommended workflow, in order. When is it safe to skip the plan phase?
In the interview pattern, who asks the questions, which tool drives them, and what artifact ends the interview?
Why does a fresh session implement the spec rather than the one that ran the interview?
What is described as “the single highest-leverage thing you can do,” and what does withholding it cost?
You have corrected Claude three times on the same issue this session. What is the recommended move, and why?

Check your answers

Explore → Plan → Implement → Commit; skip the plan phase only when the diff is one-sentence-describable.
Claude asks the questions, driven by the AskUserQuestion tool, and the interview ends by writing a complete spec to a file (e.g. SPEC.md).
The interview session is full of question-answer thrash; the fresh session starts on clean context whose only input is the written spec.
A verification criterion — tests, screenshots, or expected outputs so Claude can check itself; without one it “might produce something that looks right but actually doesn’t work,” and you become the only feedback loop.
Past two corrections on the same issue the context is cluttered with failed approaches — run /clear and start fresh with a more specific prompt that incorporates what you learned.

The four-phase rhythm

Letting Claude jump straight to coding produces code that solves the wrong problem; the antidote is a phased loop. “The recommended workflow has four phases: Explore [plan mode, read files] → Plan [create detailed implementation plan] → Implement [switch out of plan mode, verify against plan] → Commit [descriptive message + PR].” [Official] Best practices for Claude Code · AnthropicT1-official original Explore and Plan are the read-only front half (D3.4’s plan mode is exactly this); Implement and Commit are where edits land. Skip the plan phase only when the diff is one-sentence-describable. [Official] Best practices for Claude Code · AnthropicT1-official original

The interview pattern

For a large feature with an unclear design space, the most effective opening is to invert the usual flow and let Claude drive the questions: “For larger features, have Claude interview you first. Start with a minimal prompt and ask Claude to interview you using the AskUserQuestion tool. Claude asks about things you might not have considered yet, including technical implementation, UI/UX, edge cases, and tradeoffs.” [Official] Best practices for Claude Code · AnthropicT1-official original The interview ends by writing a complete spec to a file, and then a fresh session implements from that spec — the interview session is full of question-answer thrashing, while the implementation session starts with clean context whose only input is the written spec.

The spec file is the bridge between the two sessions: a deliverable you can review and the context bootstrap the implementation session reads.

The interview pattern, end to end Worked example

You need a rate limiter for an API and the design space is fuzzy. Instead of writing a full spec you do not yet have, you run the interview pattern:

Minimal prompt. “I need a rate limiter for our API. Interview me with AskUserQuestion before writing any code — dig into the hard parts I might not have considered, then write a complete spec to SPEC.md.”
Claude interviews you. It asks what you would not have front-loaded: Algorithm — token bucket (allows bursts) or sliding window (smoother)? Scope — per-API-key, per-IP, or global? On exceed — reject with 429, or queue and delay? Storage — in-process, or Redis so it holds across instances? Failure mode — if the limiter’s backing store is down, fail open or fail closed?
You answer, and the hard tradeoffs surface before any code — the fail-open/closed question alone is one most one-shot prompts never raise.
Artifact. Claude writes SPEC.md: token bucket, per-key, 429 on exceed, Redis-backed, fail-closed. You read and correct it — it is a reviewable deliverable, not buried in chat.
Fresh session implements. You /clear (or open a new session) and prompt: “Implement the rate limiter per SPEC.md; write the tests first.” The implementation session starts on clean context whose only input is the reviewed spec — none of the interview’s question-answer thrash.

Why the split matters: the interview session’s context is full of half-formed options and back-and-forth; the implementation session should not inherit that noise. The spec file is the clean hand-off — review gate and context bootstrap in one.

Give Claude a way to verify its work

The highest-return habit in the whole loop is supplying a success criterion: “Include tests, screenshots, or expected outputs so Claude can check itself. This is the single highest-leverage thing you can do. … Without clear success criteria, it might produce something that looks right but actually doesn’t work.” [Official] Best practices for Claude Code · AnthropicT1-official original This is also what turns a vague prompt concrete — “the build is failing” becomes “the build fails with this error: [paste]; fix it, verify the build succeeds, and address the root cause.” Test-driven prompting is the same instinct formalized: for a bug, ask for “a failing test that reproduces the issue, then fix it”; for longer work, have Claude write the tests first and keep them as the persistent contract. [Official] Best practices for Claude Code · AnthropicT1-official original

Course-correct early — and know when to restart

Iteration only pays if the loop stays clean. “The best results come from tight feedback loops” — correct Claude quickly rather than letting a wrong direction run. But there is a threshold: “If you’ve corrected Claude more than twice on the same issue in one session, the context is cluttered with failed approaches. Run /clear and start fresh with a more specific prompt that incorporates what you learned. A clean session with a better prompt almost always outperforms a long session with accumulated corrections.” [Official] Best practices for Claude Code · AnthropicT1-official original

The hands-on mechanics of this loop — the session rhythm turn by turn, and the dedicated treatments of the interview pattern and the testing discipline — are the handbook’s territory: see the Use book, Chapter 3, Your First Working Session, with the interview-pattern and testing-and-verification chapters forthcoming there.

Practice

Exercise solutions

Solution ↑ Exercise

B. A large feature with an unsettled design space is the documented home of the interview pattern: a minimal prompt asks Claude to interview you via AskUserQuestion, surfacing edge cases and tradeoffs you have not considered, and the interview ends in a SPEC.md that a fresh session then implements from clean context. A assumes you already hold a complete spec — but the premise is that you do not, so you would be encoding gaps as requirements. C burns turns thrashing and pollutes context with half-formed direction. D is the closest miss: plan mode (D3.4) makes Claude explore the code, but the interview pattern makes Claude interrogate you about intent and tradeoffs — and here the unknowns live in your requirements, not in the codebase. The interview elicits the spec; plan mode would plan against a spec you have not written yet.

Solution ↑ Exercise

The four phases in order are Explore → Plan → Implement → Commit — explore (read-only / plan mode) builds understanding, plan writes a detailed implementation plan, implement switches out of plan mode and verifies against that plan, and commit writes a descriptive message and PR. You may skip the plan phase only when the diff is one-sentence-describable — a change small and clear enough that there is nothing for a plan to de-risk.

Solution ↑ Exercise

The pattern is: “Write a failing test that reproduces the issue, then fix it” — make Claude first encode the bug as a test that fails, then make that test pass. It outperforms “fix the login bug” for two reasons. First, the failing test is an unambiguous success criterion: Claude can iterate against ground truth on its own turns instead of waiting for you to judge each attempt — the highest-leverage move in the loop. Second, the test persists as a regression contract: it stays green afterward, so the same bug cannot silently return. “Fix the login bug” gives Claude no way to check itself and leaves nothing behind to prove the fix held.

Exam essentials

Four-phase rhythm — Explore (read-only / plan mode) → Plan → Implement (verify against plan) → Commit; skip the plan only when the diff is one-sentence-describable.
Interview pattern — for large features, Claude interviews you via AskUserQuestion, writes a SPEC.md, and a fresh session implements from it (clean context).
Verification is the single highest-leverage move — include tests, screenshots, or expected outputs so Claude checks itself; without criteria, the human is the only feedback loop.
Concrete beats vague — replace “the build is failing” with the error plus “address the root cause”; the delta is verb + concrete example/test/file + a verification step.
Course-correct early, then restart — tight loops beat drift, but after two corrections on the same issue, /clear and rewrite the prompt incorporating what you learned.

Part 3 Chapter 6 Last verified 2026-06-08 Fresh

CI/CD Integration: Headless Runs, Output Formats, and GitHub Actions

Running Claude Code in CI — the headless `claude -p` entry point and `--bare` for reproducibility, the three output formats, schema-validated structured output via `--json-schema`, the permission flags that lock down a run with no human to prompt, and the GitHub Actions wrapper with its credential model.

Volatility: feature-surface

Tools compared: claude-code

D2.5 and D3.4 governed permission inside an interactive session. This chapter takes the same agent out of the terminal and into a pipeline. The mechanics change — there is no one at the keyboard to approve a tool or answer a question — so a headless run has to decide its output shape and its permission surface up front. The payoff is that Claude Code becomes a scriptable, gated CI citizen.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

What does --bare skip, and why does omitting it make a headless run non-reproducible?
Which --output-format gives you total_cost_usd and session_id as parseable fields?
With no human at the keyboard, which permission mode is the documented floor for a locked-down CI run, and what does it deny?
What is a CI step actually gating on when it passes or fails a claude -p run? Name two conditions that make the run exit non-zero.
In GitHub Actions v1.0, how do you pass --bare or --allowedTools through to the underlying claude -p?

Check your answers

--bare skips auto-discovery of hooks, skills, plugins, MCP servers, auto memory, and CLAUDE.md; without it the run loads whatever the host machine has, so the same command behaves differently on different runners.
--output-format json — a single payload with result, session_id, and total_cost_usd (plus a per-model cost breakdown).
--permission-mode dontAsk — it denies anything not in permissions.allow or the read-only command set.
The run’s process exit code (0 passes, non-zero fails); hitting --max-turns and piping stdin over the 10 MB cap both exit non-zero.
Through the claude_args passthrough input — v1.0 keeps prompt for instructions and routes all CLI flags via claude_args.

The headless invocation

The entry point for everything in this chapter is one flag: “claude -p "<query>" is the canonical non-interactive invocation; the CLI exits after responding. All standard CLI options work with -p.” [Official] Run Claude Code programmatically · AnthropicT1-official original That single command runs the full agent loop and returns — no prompt, no session UI.

For CI you almost always pair it with --bare: “Add --bare to reduce startup time by skipping auto-discovery of hooks, skills, plugins, MCP servers, auto memory, and CLAUDE.md. Without it, claude -p loads the same context an interactive session would, including anything configured in the working directory or ~/.claude. Bare mode is useful for CI and scripts where you need the same result on every machine.” [Official] Run Claude Code programmatically · AnthropicT1-official original It is the recommended mode for scripted calls and is slated to become the default for -p in a future release. [Official] Run Claude Code programmatically · AnthropicT1-official original

Output formats

A headless run can emit one of three shapes, selected with --output-format: text (default), json (a single payload with result, session_id, and total_cost_usd), and stream-json (newline-delimited events). [Official] Run Claude Code programmatically · AnthropicT1-official original The json form is what makes a run scriptable: “With --output-format json, the response payload includes total_cost_usd and a per-model cost breakdown, so scripted callers can track spend per invocation without consulting the usage dashboard.” [Official] Run Claude Code programmatically · AnthropicT1-official original

Structured output with `--json-schema`

When a downstream step needs a specific shape rather than prose, constrain the output to a schema: “To get output conforming to a specific schema, use --output-format json with --json-schema and a JSON Schema definition. The response includes metadata about the request (session ID, usage, etc.) with the structured output in the structured_output field.” [Official] Run Claude Code programmatically · AnthropicT1-official original The flag is --json-schema '<schema>'; the CLI reference describes it as producing “validated JSON output matching a JSON Schema after agent completes its workflow (print mode only).” [Official] CLI reference · AnthropicT1-official original

A minimal schema looks like this (illustration only — the shape is yours to define):

claude -p "Classify this PR's risk" \
  --output-format json \
  --json-schema '{"type":"object","properties":{"severity":{"type":"string"}},"required":["severity"]}'

The schema-conforming result then arrives in structured_output, alongside the usual session_id and usage metadata.

Permission gates for a run with no human

The defining constraint of CI is that no one is there to approve a tool call, so the permission surface must be settled before the run starts. The locked-down mode from D3.4 is built for exactly this: --permission-mode dontAsk “denies anything not in permissions.allow or the read-only command set,” which the docs call out as useful for locked-down CI runs. [Official] Run Claude Code programmatically · AnthropicT1-official original Pair it with an allowlist: --allowedTools "Bash(git diff *),Read" auto-approves specific tools and supports prefix matching. [Official] Run Claude Code programmatically · AnthropicT1-official original

Two more knobs bound a run: --max-turns N “limits agentic turns and exits with error when reached,” and --max-budget-usd <N> caps dollar spend — both print-mode-only. [Official] CLI reference · AnthropicT1-official original At the far end, --permission-mode bypassPermissions (alias --dangerously-skip-permissions) skips prompts entirely [Official] Run Claude Code programmatically · AnthropicT1-official original — appropriate only inside an isolated container, never as a convenience.

Exit codes: what CI actually gates on

Output format decides what a run prints; the exit code decides whether the pipeline step passes. A CI step’s pass/fail is the exit status of the process it ran — so a headless claude -p that runs the full loop and “exits after responding” [Official] Run Claude Code programmatically · AnthropicT1-official original hands its exit code straight to the runner, and 0 means the step succeeds while non-zero fails it. The docs name concrete non-zero triggers you can rely on: --max-turns N “limits agentic turns and exits with error when reached,” [Official] CLI reference · AnthropicT1-official original and an over-cap stdin (10 MB) “returns a clear error and non-zero exit.” [Official] Run Claude Code programmatically · AnthropicT1-official original

The same mechanism gives you a clean pre-flight gate: claude auth status “exits 0 if logged in, 1 if not — useful as a CI gate before the agent step.” [Official] Run Claude Code programmatically · AnthropicT1-official original Run it first and the job fails fast with a clear cause instead of burning a turn on an unauthenticated agent call.

A read-only review gate that fails correctly Worked example

Goal: a CI job that lets Claude read the repo and run tests, blocks on nothing (no human), bounds cost, and fails the pipeline if the run fails. A shell step:

# 1. Pre-flight: fail fast if the runner isn't authenticated.
claude auth status --text || { echo "Claude not authenticated"; exit 1; }

# 2. The gated run. Exit code propagates to the step.
claude -p "Review the staged diff for regressions; run the test suite." \
  --bare \
  --permission-mode dontAsk \
  --allowedTools "Read,Bash(git diff *),Bash(npm test)" \
  --max-turns 12 \
  --output-format json > result.json
# No `|| true` — if claude exits non-zero, the step (and the job) fails here.

Trace each guard: auth status (step 1) turns a missing credential into an immediate, legible failure — exit 1 — rather than a confusing agent error. --bare makes the run a function of its inputs, not the runner’s stray config. dontAsk + the tight --allowedTools fix the permission surface up front, so no tool request can stall waiting for an approval that will never come. --max-turns 12 is the ceiling: if the agent thrashes past twelve turns it exits with error, and because the step does not mask the status, the job goes red. On success the agent exits 0, result.json holds the parseable payload, and a later step can read total_cost_usd to log spend. The job’s green/red is the agent’s exit code — exactly the contract CI needs.

GitHub Actions and the credential model

The managed CI surface wraps all of the above. Claude Code GitHub Actions is “built on top of the Claude Agent SDK” [Official] Claude Code GitHub Actions · AnthropicT1-official original and wraps claude -p in a GitHub Action runner. Beyond the direct Anthropic API it supports two cloud providers — Amazon Bedrock (use_bedrock) and Google Vertex AI (use_vertex) — each authenticated through GitHub OIDC / Workload Identity Federation, so no static cloud keys are stored. [Official] Claude Code GitHub Actions · AnthropicT1-official original The v1.0 interface is deliberately small: “mode is auto-detected; use prompt for all instructions and claude_args for any CLI passthrough” [Official] Claude Code GitHub Actions · AnthropicT1-official original — so everything from earlier sections (--bare, --output-format, --allowedTools) reaches the runner through claude_args.

Practice

Exercise solutions

Solution ↑ Exercise

B. The job has no human to approve anything, so the permission surface must be fixed up front: dontAsk denies anything not pre-approved, --allowedTools "Read,Bash(npm test)" grants exactly the read and test-run capability (prefix matching scopes Bash to the test command), and --bare makes the run reproducible across machines. A does the opposite of locking down — --dangerously-skip-permissions approves everything, including edits and pushes. C auto-approves file edits, but the job is supposed to be read-only. D is the classic headless trap: in CI there is no one to answer the approval prompt, so a tool that falls through to default mode stalls or is denied rather than helpfully pausing. The locked-down combination is dontAsk + a tight allowlist + --bare.

Solution ↑ Exercise

Use --output-format json and read the session_id and total_cost_usd fields. The json form returns a single payload with result, session_id, and total_cost_usd (plus a per-model cost breakdown), so the later step can resume with the captured session_id and log spend from total_cost_usd — no usage-dashboard round-trip. The default text format returns only the final response (nothing parseable), and stream-json would make you reassemble the fields from an event stream.

Solution ↑ Exercise

The flag is --bare, and the cause it addresses is non-reproducible context discovery. Without --bare, claude -p auto-discovers and loads whatever the machine has — the repo’s CLAUDE.md, a developer’s personal ~/.claude config, locally-configured MCP servers, hooks, skills, plugins, auto memory — so the same command becomes a function of the host rather than of its inputs, and two runners diverge. --bare skips that discovery, making the run reproducible across machines (and it is slated to become the -p default).

Solution ↑ Exercise

(a) The job gates on the run’s process exit code: claude -p exits after responding and hands its exit status to the runner — 0 passes the step, non-zero fails it. (b) Two documented non-zero conditions: hitting --max-turns (“exits with error when reached”) and an over-cap stdin (piped input above the 10 MB limit “returns a clear error and non-zero exit”); claude auth status exiting 1 when not logged in is a third, useful as a pre-flight gate. (c) Swallowing the status — e.g. ending the command with || true or masking it behind a pipe — so the shell reports success even though the agent run failed; let the exit code propagate.

Exam essentials

Headless entry — claude -p "<query>" runs non-interactively and exits; add --bare to skip discovery (hooks/skills/MCP/CLAUDE.md) for reproducible CI. --bare is slated to become the -p default.
Output formats — text (default), json (result, session_id, total_cost_usd, per-model cost), stream-json (newline-delimited events; system/init carries plugin_errors to fail CI).
Structured output — --output-format json with --json-schema adds a validated structured_output field; --json-schema is print-mode only (as are --max-turns, --max-budget-usd).
Permission gates — CI has no human to prompt, so decide the surface up front: dontAsk (deny anything not pre-approved) + --allowedTools (prefix matching, e.g. Bash(git diff *)); --max-turns / --max-budget-usd cap a run; bypassPermissions / --dangerously-skip-permissions skips all checks — containers only.
Exit codes are the CI contract — a step passes/fails on the run’s exit status (0 success, non-zero failure). --max-turns reached and over-cap stdin (10 MB) exit non-zero; claude auth status exits 0/1 as a pre-flight gate. Don’t mask the status (|| true) — a failed run would report green.
GitHub Actions — wraps claude -p; v1.0 auto-detects mode (prompt + claude_args); the Anthropic API plus Bedrock + Vertex (the two cloud providers via OIDC, no static keys); supply credentials as secrets, never hardcoded.

Part 3 · D3 Review

6 exercises across 6 chapters — interleaved review.

d3-01-claude-md-hierarchy

d3-01-ex-scope-concat A monorepo has `/repo/CLAUDE.md` ("use tabs") and `/repo/services/api/CLAUDE.md` ("use 2-space indent"). A developer runs Claude Code from `/repo/services/api/`, and both files are discovered. Which statement describes what Claude Code actually loads, and why? - **A.** Only the `services/api/CLAUDE.md` loads — the closest file overrides its ancestors. - **B.** Only the root `/repo/CLAUDE.md` loads — the project-root file takes precedence. - **C.** Both load and concatenate (root first, then `api`); with no precedence, both instructions sit in context at once. - **D.** Both load, but `api`'s lines override the root's, the way `settings.local.json` overrides user settings.

d3-02-slash-commands-skills

d3-02-ex-command-vs-skill Your team has a 2,000-token deployment runbook you want Claude to follow whenever it deploys — ideally without anyone having to remember to paste it. Where should it live, and why? - **A.** In `CLAUDE.md`, so the runbook is always available to every session. - **B.** As a skill at `.claude/skills/deploy/SKILL.md`, so only its ~100-token description costs context until it is invoked. - **C.** As a slash command at `.claude/commands/deploy.md`, since that is the only way to get a `/deploy` command. - **D.** Pasted inline into each prompt at deploy time, so it is always fresh.

d3-03-rules-path-scoping

d3-03-ex-scoped-vs-unconditional You have a one-line standard — "all API endpoints must validate their input" — that is only relevant when someone edits files under `src/api/`. You want it in front of Claude during that work but not cluttering context the rest of the time. How should you author it? - **A.** As a line in `CLAUDE.md`, so it is always loaded and never missed. - **B.** As an unconditional rule `.claude/rules/api.md` with no `paths`, so it loads every session. - **C.** As a path-scoped rule `.claude/rules/api.md` with `paths: ["src/api/**/*.ts"]`, so it loads only when Claude reads an API file. - **D.** As a skill at `.claude/skills/api/SKILL.md`, so Claude can invoke it when needed.

d3-04-plan-mode

d3-04-ex-plan-or-direct You are asked to rename a widely-used helper function across an unfamiliar ~40-file service, and you want Claude to carry it out. Which approach best contains the risk, and why? - **A.** Start in `acceptEdits` and let Claude rename call sites as it finds them — the fastest path to a green tree. - **B.** Start in plan mode so Claude maps every call site and proposes the complete change set before editing; approve once it is complete. - **C.** Use `bypassPermissions` so nothing interrupts the multi-file edit. - **D.** Work in `default` mode and approve each edit as it appears, catching mistakes one prompt at a time.

d3-05-iterative-refinement

d3-05-ex-interview-vs-direct You are kicking off a sizeable new feature — a billing system — whose design space you have not fully thought through (proration rules, failure modes, and edge cases are still fuzzy). What is the most effective way to start Claude on it? - **A.** Write one detailed prompt with every requirement and verification criterion you can think of, for maximum autonomy. - **B.** Use the interview pattern: a minimal prompt asking Claude to interview you with `AskUserQuestion`, ending in a written `SPEC.md`, then a fresh session implements from it. - **C.** Give a vague one-liner and course-correct turn by turn as the design reveals itself. - **D.** Use plan mode alone — let Claude explore the codebase and propose an approach without interviewing you.

d3-06-cicd-integration

d3-06-ex-locked-down-ci A CI job should let Claude read the repository and run the test suite — and nothing else: no edits, no pushes, and no hanging on a prompt, because there is no human to answer one. Which invocation best fits? - **A.** `claude -p "..." --dangerously-skip-permissions`, so the run never blocks on a permission check. - **B.** `claude -p "..." --permission-mode dontAsk --allowedTools "Read,Bash(npm test)" --bare`. - **C.** `claude -p "..." --permission-mode acceptEdits`, so it can write any test artifacts it needs. - **D.** `claude -p "..."` in the default mode, letting it prompt for approval if it needs more access.

Part 4 Chapter 1 Last verified 2026-06-08 Fresh

Explicit Criteria over Vague Instructions

The controllable lever for output quality is the specification, not the model. Name the success criteria and the output shape explicitly; positive instruction beats negative; the model will not infer a requirement you did not state. This is durable methodology — a stable principle, not a feature surface.

Volatility: stable-principle

Tools compared: claude-code

Part IV opens the domain the exam weights at 20% — getting a model to produce what you actually need. The first lever is the cheapest and the most overlooked: the prompt’s own precision. Everything later in this Part (few-shot, structured outputs, validation loops) is an escalation from this baseline. The principle here outlasts every model version — newer models make it more true, not less — which is why this chapter is a stable principle, not a feature surface.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

Two runs of the same prompt return differently-shaped output. Where does the inconsistency actually live?
A more-literal model (Opus 4.8) is given a prompt with one unstated-but-intended requirement. What happens to that requirement?
Which steers more reliably — “don’t use markdown lists” or “respond in flowing prose” — and why?
Name the three rungs of the output-control escalation ladder, cheapest first.
Why does a newer, more capable model make “be explicit” more important rather than less?

Check your answers

In the specification, not the model — the disagreement was latent in the prompt: a degree of freedom you left unstated, which the model resolved differently each run.
It is simply not honored — Opus 4.8 interprets prompts literally and “does not infer requests you didn’t make,” so an unwritten requirement goes unmet.
“Respond in flowing prose” — a positive instruction points straight at the destination, while a prohibition only names a forbidden region without locating the target inside the still-vast permitted one.
Explicit instruction → few-shot examples (D4.2) → structured outputs / strict tools (D4.3) — each rung costs more, so you climb only as far as the stakes require.
Because newer models follow instructions more literally — they invalidate the assumption that the model will generously guess your intent, so an unstated requirement becomes a liability, not a gap the model fills.

Specify the output format; do not leave it to inference

The single most reliable quality lever is to state the output contract explicitly. “Precisely define your desired output format using JSON, XML, or custom templates so that Claude understands every output formatting element you require.” [Official] Increase output consistency · AnthropicT1-official original A vague instruction (“summarize this”) leaves the shape — length, fields, ordering, what to do with missing data — for the model to guess, and a guess varies from run to run. An explicit instruction (“return JSON with keys sentiment, key_issues (list), and action_items (list of objects with team and task)”) removes the variance at its source.

The model will not infer what you did not ask for

Modern models follow instructions more literally, which makes implicit expectations a liability. “Claude Opus 4.8 interprets prompts literally and explicitly, particularly at lower effort levels. It does not silently generalize an instruction from one item to another, and it does not infer requests you didn’t make.” [Official] Prompting best practices · AnthropicT1-official original The upside is precision and less thrash; the cost is that a requirement you held in your head but never wrote down will simply not be honored. The fix is not a cleverer prompt that the model “figures out” — it is stating the requirement.

Tell the model what to do, not what to avoid

When you are steering format or tone, a positive instruction outperforms a prohibition. The docs are explicit that demonstrating the wanted behavior beats forbidding the unwanted one: “Positive examples showing how Claude can communicate with the appropriate level of concision tend to be more effective than negative examples or instructions that tell the model what not to do.” [Official] Prompting best practices: Use examples effectively (multishot / few-shot) · AnthropicT1-official original “Respond in smoothly flowing prose” steers better than “do not use markdown lists,” because a prohibition names a forbidden region without locating the target inside the (still vast) permitted one. A positive instruction points straight at the destination. The same logic applies to eliminating preambles: state “respond directly without preamble” rather than enumerating the openings you dislike. [Official] Prompting best practices · AnthropicT1-official original

The escalation ladder: instruction, then examples, then a hard schema

Explicit instruction is the first rung, not the only one. The documented hierarchy is to ask plainly first and escalate only when you need a stronger guarantee: “Try simply asking the model to conform to your output structure first, as newer models can reliably match complex schemas when told to, especially if implemented with retries. For classification tasks, use either tools with an enum field containing your valid labels or structured outputs.” [Official] Prompting best practices · AnthropicT1-official original Plain instruction handles most cases; a few-shot example (the next chapter) disambiguates edge cases; a hard schema (D4.3) makes a shape unrepresentable to violate. Each rung costs more — context, latency, setup — so you climb only as far as the stakes require.

Climbing the ladder on one prompt Worked example

A nightly job starts with a vague prompt and hardens it exactly as far as the stakes demand:

Rung 0 (vague). "Summarize this support ticket." Output drifts run to run — one line here, three paragraphs there, sentiment sometimes present — and the downstream parser breaks. The variance is latent in the prompt: length, fields, and sentiment-handling were never pinned.
Rung 1 (explicit instruction). "Return JSON: summary (≤2 sentences), sentiment (one of positive/neutral/negative), action_needed (boolean). If there is no clear action, set action_needed=false." Now every degree of freedom the parser cares about is fixed. This handles the common case and costs nothing but words.
Rung 2 (few-shot, D4.2). A batch of feature-request tickets keeps getting tagged negative because they describe a missing capability. The instruction can’t fully pin that judgment, so you add two worked examples showing a feature request labeled neutral. The example disambiguates what prose could not.
Rung 3 (structured outputs / strict tool, D4.3). The pipeline must never see a label outside the three — but the model occasionally emits "mixed". You move sentiment to an enum-constrained tool / structured output, so a fourth value is unrepresentable, not merely discouraged.

The discipline is to stop at the rung the stakes require. Most fields never leave Rung 1; only the crash-on-violation field (sentiment) earns Rung 3. Climbing further than the risk warrants just buys setup cost and latency you don’t need.

The hands-on craft of writing these prompts — the iteration rhythm, the worked before-and-after examples — is the Use book’s territory; its prompt-engineering chapter is the use-side companion to this exam-angle treatment (forthcoming in the handbook).

Practice

Exercise solutions

Solution ↑ Exercise

C. The inconsistency is latent in the prompt: “summarize” leaves length, fields, and sentiment-handling unstated, so the model resolves them differently each run. Specifying the exact contract — fixed fields, each with a type and a length bound — removes those degrees of freedom and is what a downstream parser needs. A (temperature) reduces token-level randomness but does nothing about an underspecified shape; a deterministic model still has to invent a structure you never gave it. B is a negative, vague instruction — “too much” is undefined and “be consistent” names the goal without specifying the target. D is the exact assumption modern literal-following models invalidate: a bigger model will not infer a contract you didn’t write, and may follow your vague prompt more faithfully, not less.

Solution ↑ Exercise

A positive rewrite: “Respond directly with the answer in flowing prose.” That single instruction covers both prohibitions — “directly” eliminates the preamble, “flowing prose” rules out headers — by naming the target instead of the forbidden regions. It steers more reliably because a prohibition (“don’t use headers,” “no preamble”) shrinks the output space without locating the destination inside the still-vast permitted region, whereas the positive form aims straight at the one shape you meant.

Solution ↑ Exercise

The three rungs, cheapest first: (1) explicit instruction — name the four labels in the prompt and ask the model to return exactly one; (2) few-shot examples — demonstrate the labeling on ambiguous inputs; (3) structured outputs / strict tools — constrain decoding so only an in-set label can be emitted. Here you should climb to rung 3: the premise is that a malformed or out-of-set label crashes the pipeline, so you need the shape to be unviolatable, not merely likely-correct. An enum-constrained tool or structured output makes an out-of-set value unrepresentable — the only rung that turns “should be valid” into “cannot be invalid,” which is what a crash-on-violation contract demands.

Exam essentials

Consistency lives in the specification — if two runs disagree, a degree of freedom was left unstated; pin the format (fields, types, lengths, missing-data handling) explicitly.
Modern models follow instructions literally — Opus 4.8 “does not infer requests you didn’t make,” so an unstated requirement is an unmet one; state it rather than expecting the model to read intent.
Positive beats negative — “respond in flowing prose” / “answer in one sentence” steers better than “don’t use lists” / “don’t be verbose”; aim at the target instead of ruling out one failure.
Escalation ladder — explicit instruction → few-shot examples (D4.2) → structured outputs / strict tools (D4.3); climb only as far as the stakes require, since each rung costs more.
Why stable-principle — “be explicit about what you want” survives every model version; newer, more-literal models make it more load-bearing, not less.

Part 4 Chapter 2 Last verified 2026-06-02 Fresh

Few-Shot Prompting for Ambiguous Cases

Examples are the most reliable way to steer format, tone, and structure — and the only clean way to pin down an ambiguous case. The pattern is 3-5 relevant, diverse, structured examples, with at least one placed on the edge case showing the desired handling.

Volatility: architectural-pattern

Tools compared: claude-code

D4.1’s escalation ladder put examples on the second rung: when a plain instruction can’t fully pin a behavior — especially on the messy, ambiguous inputs — a demonstration does what a description cannot. This chapter is that rung. It is an architectural pattern: the 3-5-example construction and its quality criteria are stable across model versions, with the example-tag syntax as the illustration that may shift.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

When does a demonstration outperform a written instruction — what kind of case is few-shot’s home turf?
What is the documented example-count sweet spot, and what fails below it and above it?
Name the three example-quality criteria. Which one, when neglected, silently corrupts the prompt?
To pin that a missing field resolves to null (not "unknown"), do you add a rule or an example — and where in the set?
Few-shot and structured outputs (D4.3): competing or complementary? What does each one control?

Check your answers

A demonstration wins where description is ambiguous — few-shot’s home turf is the case an instruction can’t fully specify, like how a borderline input should resolve.
3-5 examples is the documented sweet spot: with 1-2 the model latches onto an incidental trait, and at 6+ you burn context and risk disagreeing examples teaching “either is acceptable.”
Relevant, diverse, structured — diverse is the most often neglected and most consequential: examples that all share an irrelevant trait teach that trait as if it were the rule.
An example — show an input with the missing field whose output is null, placed in the middle of the set; the model generalizes from the example treatment, not from a prose rule beside it.
Complementary — the schema locks the shape while examples teach content and edge-case handling.

Examples are the most reliable steering mechanism

When a behavior is hard to describe, demonstrate it. “Examples are one of the most reliable ways to steer Claude’s output format, tone, and structure. A few well-crafted examples (known as few-shot or multishot prompting) can dramatically improve accuracy and consistency.” [Official] Prompting best practices: Use examples effectively (multishot / few-shot) · AnthropicT1-official original An example carries information a sentence struggles to: the exact field ordering, the precise tone, how a borderline input should resolve. The model does not memorize the examples — it extracts the implicit pattern across them and applies it to the new input.

The 3-5 sweet spot

The documented count is small and specific: “Include 3-5 examples for best results. You can also ask Claude to evaluate your examples for relevance and diversity, or to generate additional ones based on your initial set.” [Official] Prompting best practices: Use examples effectively (multishot / few-shot) · AnthropicT1-official original The range is not arbitrary — too few examples and the model latches onto an incidental trait; too many and you burn context for no gain and risk contradictory examples confusing it.

Relevant, diverse, structured

Quality matters more than count. The three criteria are explicit: “When adding examples, make them: Relevant: Mirror your actual use case closely. Diverse: Cover edge cases and vary enough that Claude doesn’t pick up unintended patterns. Structured: Wrap examples in <example> tags (multiple examples in <examples> tags) so Claude can distinguish them from instructions.” [Official] Prompting best practices: Use examples effectively (multishot / few-shot) · AnthropicT1-official original Diversity is the criterion most often neglected and the most consequential: three examples that all happen to share an irrelevant trait teach that trait as if it were the rule.

Target the ambiguous case directly

This is the heart of the cert task area. When an extraction or classification has a messy case — a missing field, an “other” bucket, an unusual variant — do not write a separate rule for it; show an example on that case with the handling you want. The model generalizes from the example treatment, not from a prose rule beside it. Put the ambiguous input in the middle of the set with its desired output:

<examples>
  <example>
    <input>Order #4815 shipped on Apr 3 via UPS tracking 1Z999AA10123456784.</input>
    <output>{"order_id": "4815", "carrier": "UPS", "tracking": "1Z999AA10123456784"}</output>
  </example>
  <example>
    <input>Customer asked about order status yesterday but gave no order number.</input>
    <output>{"order_id": null, "carrier": null, "tracking": null}</output>
  </example>
  <example>
    <input>Shipped today via 'FedEx Express Saver' - see ref 7712-4488-9933.</input>
    <output>{"order_id": null, "carrier": "FedEx", "tracking": "7712-4488-9933"}</output>
  </example>
</examples>

The middle example is the ambiguous case: it teaches that “no order number” resolves to null — not an empty string, not "unknown", not "n/a". [Official] Prompting best practices: Use examples effectively (multishot / few-shot) · AnthropicT1-official original Few-shot also composes with the next rung of the ladder: with structured outputs (D4.3), the schema locks the shape while examples still teach content and edge-case handling — the two are complementary, not redundant. [Official] Prompting best practices: Use examples effectively (multishot / few-shot) · AnthropicT1-official original A schema can require order_id to be a string-or-null; only an example teaches that this kind of input is the null one.

Constructing a 3-5 set that targets the edge case Worked example

Task: classify a support ticket as one of bug | feature_request | question | other. The hard case is a feature request phrased as a complaint (“Your export is useless without CSV”) — instructions keep mislabeling it bug. Build the set deliberately:

<examples>
  <example>  <!-- canonical bug -->
    <input>Clicking Export throws a 500 error every time.</input>
    <output>{"category": "bug"}</output>
  </example>
  <example>  <!-- the edge case, placed in the middle -->
    <input>Your export is useless without CSV — fix this.</input>
    <output>{"category": "feature_request"}</output>
  </example>
  <example>  <!-- a question, to vary the class -->
    <input>Where do I find the export button?</input>
    <output>{"category": "question"}</output>
  </example>
  <example>  <!-- the "other" bucket -->
    <input>Thanks for the great support last week!</input>
    <output>{"category": "other"}</output>
  </example>
</examples>

Read it against the three criteria. Relevant: every input is a real ticket in your actual phrasing, not a toy sentence. Diverse: four different categories, and the second example deliberately breaks the “angry tone → bug” pattern an instruction-only prompt was learning — that one example is doing the real work. Structured: each pair is wrapped in <example>, the set in <examples>, so the model separates demonstrations from the instruction and the input. Four examples, one per class with the edge case carrying its own slot — squarely in the 3-5 sweet spot. Note what is not here: a prose sentence saying “angry feature requests are not bugs.” The demonstration replaces that rule, and does it more reliably.

The use-side craft of building and iterating on an example set lives in the Use book’s prompt-engineering chapter (forthcoming), alongside D4.1’s hands-on companion material.

Practice

Exercise solutions

Solution ↑ Exercise

B. The ambiguous case — an input with no due date — is exactly what an example should demonstrate: place one in the set whose output shows "due_date": null, and the model generalizes from that treatment. A is the D4.1 instinct (be explicit), and it helps, but a prose rule beside the examples is weaker than a demonstration on the case itself: the model generalizes from the example treatment, not from a separate written rule sitting next to it (the principle established in “Target the ambiguous case” above). C adds volume without diversity: ten clean invoices never show the missing-date case, so the model still has nothing to copy for it. D works mechanically but is a band-aid that hard-codes one symptom (“unknown”); the model may emit “none” or "" next, and you are now maintaining a translation table instead of fixing the prompt. The few-shot fix targets the root cause.

Solution ↑ Exercise

The three criteria are relevant (mirror your actual use case closely), diverse (cover edge cases and vary enough that the model doesn’t latch onto an unintended pattern), and structured (wrap each example in <example> tags, the set in <examples>, so the model separates demonstrations from instructions). Diversity is the silent corrupter: when three examples happen to share an incidental trait — every input ends in a period, every output capitalizes the first field — the model generalizes from whatever is common across the set, so it learns that trait as if it were the rule. The prompt still “looks right,” but it has quietly taught the wrong invariant, and the failure only surfaces on inputs that don’t share the accidental trait.

Solution ↑ Exercise

With a single example, the model cannot tell which of the example’s traits are the pattern and which are incidental — quoting the first field is one concrete trait of that one sample, and with nothing to contrast it against, the model copies it as if it were required. This is the documented failure of 1-2 examples: high risk of picking up an incidental pattern instead of the intended one. Raising the count into the 3-5 range fixes it by adding contrast: across several examples that vary the incidental traits (some quote nothing, different field orders) while holding the intended pattern constant, the first-field-quoting habit no longer appears in every example, so the model stops treating it as the rule. The cure is diversity-via-count, not volume for its own sake.

Exam essentials

Few-shot is the most reliable steering mechanism for format, tone, and structure — the model extracts the implicit pattern across examples, so it disambiguates where instructions can’t.
3-5 examples is the documented sweet spot: 1-2 risks learning an incidental trait, 6+ burns context and risks contradictions; budget the 3-5 as one canonical + variants + one edge case.
Relevant, diverse, structured — mirror the real use case, vary enough to avoid spurious patterns, and wrap each in <example> (group in <examples>) so the model separates examples from instructions and input.
Target the ambiguous case — put an example on the edge case (null field, “other” bucket) showing the desired handling; the example teaches it, a prose rule beside it teaches it less reliably.
Composes with structured outputs — the schema locks shape, examples teach content and edge-case handling; complementary, not redundant.

Part 4 Chapter 3 Last verified 2026-06-08 Fresh

Structured Output via Tool Use and JSON Schema

Forcing a known-shape JSON result has two generations. The classic pattern borrows the tool-call channel as a typed output slot; the modern features (strict tool use and output_config.format) use grammar-constrained decoding to make a non-conforming shape unrepresentable. The JSON-Schema subset, additionalProperties false, and the per-request limits are the surfaces to know.

Volatility: feature-surface

Tools compared: claude-code

D4.1 ended with a top rung: when a shape must not be violated, make it unrepresentable. This chapter is that rung’s machinery. It has two generations — the older tool-use pattern that is still the right tool for open-ended schemas, and the newer grammar-constrained features that eliminate schema-violation retries entirely — and because the substance here is named API fields, schema rules, and numeric limits, it is a feature surface.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

In the classic pattern, which field of the forced tool call holds your data, and what do you do with the tool’s “result”?
What does strict: true guarantee that the classic pattern alone does not — and through which integration path is it silently dropped?
Which JSON-Schema keyword is mandatory on every object node, and why does the constrained decoder require it?
Name the two failure modes constrained decoding cannot prevent, and the stop_reason value that signals each.
For open-ended extraction (“I don’t know which fields will appear”), do you reach for structured outputs or classic tool use, and why?

Check your answers

The data is the forced call’s tool_use.input — that object is your extracted JSON; the tool’s “result” is discarded entirely.
It grammar-constrains tool inputs to your schema — no wrong types ("2" for 2), no missing required fields; it is silently dropped by the OpenAI SDK compatibility layer, which honors the request but gives no grammar guarantee.
additionalProperties: false — an open object has no closed grammar to compile, so the decoder must know exactly which keys are permitted at each step.
Refusal (stop_reason: "refusal" — a 200 you are billed for, output may not match) and truncation (stop_reason: "max_tokens" — every token schema-valid but the object never closed).
Classic tool use — open-ended extraction needs additionalProperties: true, which structured outputs cannot accept since it requires additionalProperties: false on every object.

The classic mechanism: a tool whose input is your output

The oldest reliable way to get JSON is to borrow the tool-call channel. Define a tool whose input_schema is exactly the shape you want back, force Claude to call it, and read the call’s input — that object is your extracted JSON; you discard the tool’s “result” entirely. The convention is to name the tool print_X (print_summary, print_entities) so the model treats it as committing data rather than taking an action. [Official] Extracting Structured JSON using Claude and Tool Use · AnthropicT1-official original Forcing the call is what guarantees the extraction happens: tool_choice: {type: "tool", name: "print_summary"}. [Official] Extracting Structured JSON using Claude and Tool Use · AnthropicT1-official original

tools = [{
    "name": "print_summary",
    "description": "Prints a summary of the article.",
    "input_schema": {
        "type": "object",
        "properties": {
            "author":  {"type": "string"},
            "topics":  {"type": "array", "items": {"type": "string"}},
            "summary": {"type": "string"},
        },
        "required": ["author", "topics", "summary"],
    },
}]
resp = client.messages.create(model="claude-opus-4-8", max_tokens=1024, tools=tools,
    tool_choice={"type": "tool", "name": "print_summary"}, messages=[...])
json_summary = next(b.input for b in resp.content if b.type == "tool_use")

Strict tool use: from shape to guarantee

The classic pattern controls which fields appear, but not their types — Claude could still emit "2" where you need 2. Setting strict: true on the tool definition closes that gap: “Setting strict: true on a tool definition guarantees Claude’s tool inputs match your JSON Schema by constraining the model’s token sampling to schema-valid outputs (a technique called grammar-constrained sampling).” [Official] Strict tool use · AnthropicT1-official original The motivation is operational: “Without strict mode, Claude might return incompatible types (‘2’ instead of 2) or missing required fields, breaking your functions and causing runtime errors.” [Official] Strict tool use · AnthropicT1-official original For “call one of N candidate tools and validate its inputs,” combine tool_choice: {type: "any"} with strict: true on each tool. [Official] Strict tool use · AnthropicT1-official original

Structured outputs: constrain the response itself

Strict tool use constrains a tool call. Its sibling constrains Claude’s response directly: output_config.format coerces the final assistant text to a JSON schema using the same pipeline. The two are “two complementary features: JSON outputs (output_config.format) … Strict tool use (strict: true),” and the payoff is the elimination of retry loops: “Structured outputs guarantee schema-compliant responses through constrained decoding … No retries needed for schema violations.” [Official] Structured outputs · AnthropicT1-official original The request carries output_config: {format: {type: "json_schema", schema: {...}}}, and the conforming JSON arrives in the response text.

Note the migration: “The output_format parameter has moved to output_config.format, and beta headers are no longer required. The old beta header (structured-outputs-2025-11-13) and output_format parameter will continue working for a transition period.” [Official] Structured outputs · AnthropicT1-official original

The JSON-Schema subset and its one mandatory rule

Both features accept a subset of JSON Schema, not the full draft. Objects, arrays, the scalar types, enum, const, anyOf, and internal $ref are supported; external $ref, recursive schemas, numerical bounds (minimum/maximum), and string-length bounds are not — unsupported features return a 400. [Official] Structured outputs · AnthropicT1-official original The one rule that catches everyone: additionalProperties: false is required on every object node — it is the most common 400 for hand-authored schemas. When you need a numeric or length bound, the SDK helpers strip it from the schema, encode it as description text, and validate it client-side after the call instead. [Official] Structured outputs · AnthropicT1-official original

Limits, caching, and the failure modes that still get through

Three operational facts complete the picture. Caching: the compiled grammar carries a first-request latency, then is “cached for 24 hours from last use,” and the cache “invalidates if you change the JSON schema structure or set of tools. Changing only name or description fields does NOT invalidate cache.” [Official] Structured outputs · AnthropicT1-official original Limits: a request allows at most 20 strict tools, 24 cumulative optional parameters across strict schemas, and 16 union-typed parameters; beyond that (or an internal grammar-size cap) you get a 400 “Schema is too complex for compilation.” [Official] Structured outputs · AnthropicT1-official original The failures constrained decoding cannot prevent. Grammar constraints guarantee every emitted token is schema-valid — but not that Claude emits a complete result, so two gaps survive and a caller must check stop_reason for both. First, a refusal: if Claude refuses, the response is stop_reason: "refusal" with a 200 status, you are billed, and the output may not match the schema. [Official] Structured outputs · AnthropicT1-official original Second, truncation: max_tokens is a hard cap on output, and stop_reason: "max_tokens" is its output-budget signal. [Official] How the agent loop works · AnthropicT1-official original If generation hits that cap mid-structure, every token emitted was schema-valid but the object never closed — the JSON is cut off, so a parser rejects it just the same. The fix is not a retry on the same budget but a larger max_tokens (or a smaller schema); the constrained decoder cannot finish a structure it ran out of room to write.

This is also why the classic cookbook pattern has a permanent niche: open-ended extraction with additionalProperties: true — “I don’t know which fields will be present” — is something structured outputs cannot do, since it requires additionalProperties: false. For open-ended schemas, plain tool use stays the right tool. [Official] Extracting Structured JSON using Claude and Tool Use · AnthropicT1-official original

Reading a strict extraction without trusting it blindly Worked example

You force a strict print_record extraction. The grammar guarantees the shape — but a naive caller that reads the result directly still crashes on the two failures above. Guard them:

resp = client.messages.create(
    model="claude-opus-4-8", max_tokens=1024, tools=[print_record],
    tool_choice={"type": "tool", "name": "print_record"}, messages=[...])

# strict: true makes each emitted token schema-valid -- not the response complete.
if resp.stop_reason == "refusal":
    raise ExtractionRefused()        # 200 status, you were billed, may not match schema
if resp.stop_reason == "max_tokens":
    raise OutputTruncated()          # valid tokens, but the object never closed
record = next(b.input for b in resp.content if b.type == "tool_use")

Trace the reasoning: the grammar buys you per-token schema-validity, so you will never see "3" where you required 3. It does not buy you a guarantee that a record arrived. A refusal returns a 200 you paid for with no conforming object; a max_tokens stop cuts the JSON off mid-write so tool_use.input is absent or malformed. Reading resp.content before checking stop_reason turns both into a confusing StopIteration or parse error three layers downstream. Two cheap branches convert them into legible, actionable failures — and the max_tokens branch tells you to raise the cap, not to retry the same doomed request.

Practice

Exercise solutions

Solution ↑ Exercise

A. A single forced tool plus strict: true gives both guarantees you need: the forced tool_choice ensures exactly this extraction runs, and strict constrains the inputs by grammar so seats is a real integer, not "3" — exactly the “incompatible types breaking your functions” case strict mode exists to prevent. B is the open-ended pattern; additionalProperties: true is for when you don’t know the fields, and it forgoes the strict guarantee — wrong for a fixed, type-critical record. C silently drops strict through the OpenAI-compatibility layer, so you lose the type guarantee precisely where you needed it. D is the unconstrained baseline D4.1 warned about — it can parse fine and still hand you "3", the failure you are trying to design out.

Solution ↑ Exercise

The most likely cause is an object node missing additionalProperties: false, and the one-line fix is to add it to every object in the schema — nested ones included. The decoder requires it because an open object (one allowing arbitrary extra keys) has no closed set of valid continuations to compile into a grammar: at each decoding step the model must know exactly which keys are permitted, and additionalProperties: false is what closes that set. Without it there is no finite grammar to constrain sampling against, so the API rejects the schema with a 400 rather than allow unconstrained keys. It is the top 400 for hand-authored schemas precisely because standard JSON Schema defaults additionalProperties to true, so a schema that “validates fine in your editor” still fails compilation here.

Solution ↑ Exercise

The niche is open-ended extraction — “I don’t know which fields will be present” — and the schema feature that defines it is additionalProperties: true (an object that may carry arbitrary, unknown keys). Constrained decoding cannot serve it because it requires additionalProperties: false on every object: the grammar must enumerate the permitted keys ahead of time, and an object that allows any key has no closed grammar to compile. So when the set of fields is genuinely unknown in advance, the classic print_X tool-use pattern — which imposes no such closure — remains the right tool; structured outputs is for shapes you can pin down completely.

Solution ↑ Exercise

What is happening is truncation, not a grammar failure. max_tokens is a hard cap on total output, and the generation ran into it partway through writing the object; the grammar did its job — every token emitted was schema-valid — but it cannot guarantee the structure finishes within the budget, so the JSON is cut off before its closing braces and the parser rejects it. The confirming signal is stop_reason: "max_tokens" (the output-budget value), as opposed to end_turn. The fix is to raise max_tokens (or shrink the schema / split the extraction). A plain retry of the identical request is not the fix because it re-runs against the same budget and truncates at the same place — you must enlarge the room before the structure can complete.

Exam essentials

Classic tool-use pattern — define a print_X tool whose input_schema is your output shape, force it with tool_choice: {type: "tool", name: ...}, read tool_use.input; the tool result is discarded.
strict: true — grammar-constrains tool inputs to the schema (no wrong types, no missing required fields); pair with tool_choice: {type: "any"} for “one-of-N and valid.” Ignored on the OpenAI-compat layer.
output_config.format — grammar-constrains Claude’s response to a JSON schema; “no retries needed for schema violations.” Migrated from the output_format param / structured-outputs-2025-11-13 beta header.
Schema subset — a subset of Draft 2020-12; additionalProperties: false is mandatory on every object (top 400 cause); no numeric/length bounds (SDKs strip them to descriptions + post-validate); no external $ref or recursion.
Limits + failure modes — 20 strict tools / 24 optional params / 16 union types per request; grammar cached 24h from last use (invalidated by schema/tool-set change, not name/description); two failures slip past constrained decoding — a refusal (stop_reason: "refusal"; 200, billed, may not match) and truncation (stop_reason: "max_tokens"; JSON cut off mid-object). Always check stop_reason for both; raise max_tokens for the second rather than retrying.

Part 4 Chapter 4 Last verified 2026-06-02 Fresh

Validation, Retry, and Feedback Loops

Constrained decoding eliminates schema errors, never semantic ones — valid JSON can still hold wrong data. The architect's job is the layer above the schema, discriminating the two error kinds, encoding semantic checks into the schema itself, and closing a bounded validate-feed-back-retry loop that escalates to a human on exhaustion.

Volatility: architectural-pattern

Tools compared: claude-code

D4.3 closed with a hard guarantee — a schema a response cannot violate — and one caveat: a refusal still gets through. This chapter is about a deeper caveat. Constrained decoding guarantees the shape, never the truth. A perfectly schema-valid record can name the wrong customer or fabricate a total. The pattern for catching that lives above the API, in your validation loop — and the loop, not its current field names, is what this chapter is about, which makes it an architectural pattern.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

A schema-valid record names the wrong customer. Which error class is this, and can the API catch it?
An SDK structured-output run returns. How do you tell success from exhausted retries, and why must you check before reading the payload?
To catch a total that doesn’t match the line items, what schema-level hook makes the error mechanically checkable?
Name the three layers of the validate-feed-back-retry pattern. Which layer catches a semantic error?
Your retry loop keeps exhausting on long inputs that come back cut off. Why won’t more retries help, and what will?

Check your answers

A semantic error — valid JSON, incorrect data — and the API cannot catch it: a schema constrains form, never fact.
Branch on subtype — success carries the payload in message.structured_output, error_max_structured_output_retries means fall back; exhaustion is a result, not an exception, so unchecked code silently reads undefined.
The stated_total vs calculated_total pair — the model copies the document’s total and re-sums the line items, and the caller compares the two.
API constrained decoding, application-code semantic checks, and a bounded feedback loop — the semantic error is caught by layer 2, application-code checks.
The failure is truncation (stop_reason: "max_tokens") — re-prompting on the same budget truncates at the same place; raise the cap (or shrink the schema) instead.

Two kinds of error: schema and semantic

Structured outputs (D4.3) eliminates a whole class of failure: “Always valid: No more JSON.parse() errors. Type safe: Guaranteed field types and required fields. Reliable: No retries needed for schema violations.” [Official] Structured outputs · AnthropicT1-official original What it cannot touch is the other class — the semantic errors: responses that are valid JSON matching your schema but containing incorrect data, the very failures the SDK’s validate-and-feed-back machinery exists to catch. [Official] Get structured output from agents · AnthropicT1-official original A schema can require customer_name to be a non-empty string; it cannot know the source said “Jane” while the model wrote “John.”

The SDK retry loop handles the schema layer

For the residual schema mismatches in a multi-tool agentic run, the Agent SDK adds a retry loop: “the SDK validates the output against it, re-prompting on mismatch. If validation does not succeed within the retry limit, the result is an error instead of structured data.” [Official] Get structured output from agents · AnthropicT1-official original Crucially, exhaustion is a result you inspect, not an exception that throws — you discriminate on subtype: success carries the typed payload in message.structured_output; error_max_structured_output_retries means the budget ran out and you must fall back. [Official] Get structured output from agents · AnthropicT1-official original

Encode the semantic check into the schema

You cannot retry your way out of a semantic error the SDK never sees — so make the model commit to signals a caller can check. The pattern is to add fields whose only job is verification:

Each pattern converts an un-checkable judgment (“is this right?”) into a mechanical test (“does calculated_total equal the sum?”). [Official] Get structured output from agents · AnthropicT1-official original The model is doing the same extraction either way; you are just asking it to show enough of its work that a downstream check can catch a lie.

Close the loop: validate, feed back, retry, escalate

The full pattern stacks three independent layers, and skipping any one surfaces a different failure. [Official] Get structured output from agents · AnthropicT1-official original Layer 1 is constrained decoding (schema errors gone). Layer 2 is your application code running the semantic cross-checks above. Layer 3 re-prompts with the specific failures (“calculated_total does not equal the sum of line items; re-extract correcting this”) for a bounded number of attempts, then falls back.

Unbounded, schema-thrashing, or truncation-trapped retry loops

Three ways the feedback loop bites back. First, an unbounded loop on a genuinely ambiguous task never converges — bound the attempts and escalate to human review on exhaustion, because the alternative is paying inference forever for an answer that will not come. Second, each retry is a full inference, so a three-retry run costs roughly four times a clean one — and if you mutate the schema between attempts you also invalidate the grammar cache (D4.3) and re-pay compilation; keep the schema fixed across retries and vary only the feedback message. Third, a truncation is the one failure retries cannot fix on their own: if a response stopped at the max_tokens cap — stop_reason: "max_tokens", the output-budget signal — re-prompting on the same budget truncates at the same place and silently burns the whole retry allowance. [Official] How the agent loop works · AnthropicT1-official original Detect that stop reason and raise the cap (or shrink the schema) instead of spending a single retry on it.

The schema-design heuristics fold back into D4.3: keep schemas focused, and mark fields optional when the source might not contain them — an over-required schema turns a missing field into a retry and then an exhausted-budget error. [Official] Get structured output from agents · AnthropicT1-official original

The three layers on one invoice Worked example

A nightly invoice extractor runs the full pattern, and a fabricated total shows where each layer earns its place:

Layer 1 — constrained decoding. Structured outputs returns a schema-valid object: { "stated_total": 480.00, "calculated_total": 520.00, "line_items": [...] }, with subtype: "success". No JSON.parse error is possible and both totals are guaranteed floats. The schema layer is done — and it has caught nothing wrong, because nothing is wrong with the shape.
Layer 2 — application-code semantic check. Your code re-sums line_items (520.00) and compares it to stated_total (480.00). 480 ≠ 520 → a semantic error the API could never see: both numbers are valid, but they disagree. This catch exists only because you added the stated_total / calculated_total pair to the schema — the model showed enough work for a mechanical test to run.
Layer 3 — bounded feedback. The loop re-prompts with the specific failure: “stated_total (480.00) does not match the sum of line_items (520.00); re-extract, correcting the discrepancy.” Bound it to, say, three attempts. If a retry reconciles, return success; if the budget exhausts — subtype: "error_max_structured_output_retries" or a persistent mismatch — route to human review, never silently bill the cheaper of two totals.

The discipline: layer 1 is free and total; layer 2 is where your design effort goes (the verification fields don’t exist until you add them); layer 3 must be bounded with a human backstop. Skip layer 2 and the bad total reaches billing; leave layer 3 unbounded and you pay inference forever on an answer that will not come.

Practice

Exercise solutions

Solution ↑ Exercise

B. A wrong-but-well-typed total is a semantic error — valid JSON, incorrect data — so no schema or type guarantee touches it. The fix is to make the error checkable: have the model emit both the document’s stated_total and its own calculated_total, then let application code compare them and route mismatches to review. A (strict) guarantees the total is a number, which it already was; it does nothing about a number being wrong. C (more tokens) addresses truncation, not arithmetic fabrication. D (minimum) is both unsupported by the structured-outputs subset and irrelevant — a bound on magnitude can’t detect a total that’s internally inconsistent with the line items.

Solution ↑ Exercise

The two subtypes are success — the run validated, and the typed payload is on message.structured_output — and error_max_structured_output_retries — validation failed within the retry budget, so there is no payload and you must fall back (simpler schema, simpler prompt, or human review). You must branch on subtype before reading the payload because exhaustion returns a result, not an exception: code that reads message.structured_output on the error path reads undefined and silently processes garbage downstream. The subtype is the success/failure contract; the payload is present and trustworthy only on success.

Solution ↑ Exercise

The three layers are (1) API constrained decoding (the schema layer — eliminates syntax/type/required/enum errors), (2) application-code semantic checks (the domain layer — cross-checks what the data means), and (3) a bounded feedback loop (re-prompts with the specific failures, then escalates on exhaustion). A fabricated customer_name — a valid string naming the wrong person — is a semantic error: layer 1 cannot catch it (the shape is perfect) and layer 2 is the one that must. The hook that makes the catch possible is a provenance field: have the model emit the source span it drew the name from (claim + source.span_quote + confidence), so application code can verify the quoted span actually appears in the document — turning “is this the right person?” into a mechanical string-containment check (D5.6).

Solution ↑ Exercise

(a) The failure is truncation, not a schema or semantic error: the response hit the max_tokens output cap partway through writing the object, confirmed by stop_reason: "max_tokens" (the output-budget value, versus end_turn). (b) The retry loop makes it worse because each re-prompt runs against the same max_tokens budget, so it truncates at the same place — every attempt fails validation identically, and the loop burns its entire allowance reaching error_max_structured_output_retries without ever being able to succeed. (c) The fix is to detect stop_reason: "max_tokens" and raise the cap (or shrink / split the schema) before retrying — retries cannot manufacture room the budget does not allow.

Exam essentials

Schema vs semantic — constrained decoding eliminates syntax/type/required/enum errors; semantic errors (valid JSON, wrong data) are invisible to the API and need domain logic.
SDK retry loop — validates and re-prompts on mismatch; the result is success (payload in message.structured_output) or error_max_structured_output_retries (fall back). It’s a result you check, not an exception; the retry count is undocumented.
Schema-level semantic hooks — detected_pattern, stated_total vs calculated_total, conflict_detected, nullable “other” + detail, and the provenance triple turn “is it correct?” into a mechanical cross-check.
Three layers — API constrained decoding (schema) + application-code semantic checks + a bounded feedback loop that re-prompts with the specific errors; escalate to human review on exhaustion.
Loop economics — bound the attempts (each retry is a full inference, ~4× for three retries) and keep the schema stable across retries so you don’t re-pay grammar compilation. A truncation (stop_reason: "max_tokens") is the trap retries can’t fix — re-prompting on the same budget re-truncates; detect it and raise the cap instead.

Part 4 Chapter 5 Last verified 2026-06-08 Fresh

Batch Processing: The Message Batches API

When nothing is waiting on the answer, batch trades latency for half the price. The Message Batches API processes up to 100,000 async requests at a 50 percent discount with a 24-hour SLA, and its one non-negotiable contract is custom_id matching, because results come back in any order.

Volatility: feature-surface

Tools compared: claude-code

The first three chapters of Part IV controlled a single response. This one scales to a hundred thousand of them. A batch is the right tool whenever the work is large and nothing is waiting on the answer — and its surface (the endpoint, the size limits, the custom_id rule, the beta header) is exactly the kind of named detail that shifts between releases, so this is a feature-surface chapter.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

What single factor decides batch versus real-time, and which way does each tolerance point?
Why is custom_id mandatory, and what specifically breaks if you rely on result order?
Name the two size limits on a single batch, and which one an HTTP 413 reports.
A batch result comes back succeeded — does that mean its answer is usable? What must you still check?
Which result types are not billed, and which billed-but-possibly-useless outcome is the trap?

Check your answers

Latency tolerance alone decides: if a human or synchronous system is blocked on the result, batch is wrong; if the work is an overnight job, backfill, or offline evaluation, batch halves the bill.
Results “can be returned in any order,” so the unique custom_id is the only sanctioned join key; relying on positional order silently mismatches outputs to inputs, and nothing in the response flags it.
100,000 requests or 256 MB, whichever is reached first — an HTTP 413 on creation reports the 256 MB payload limit.
No — succeeded is a batch-level outcome that says the request ran; you must still inspect the message’s own stop_reason, because a refusal ("refusal") or truncation ("max_tokens") arrives as succeeded.
errored, canceled, and expired are not billed; the trap is a succeeded refusal — it returns a 200, you pay for it, and it may not match your schema.

The cost-latency trade

The Message Batches API exists for one trade: give up immediacy, get half off. “The Message Batches API is a powerful, cost-effective way to asynchronously process large volumes of Messages requests. This approach is well-suited to tasks that do not require immediate responses, with most batches finishing in less than 1 hour while reducing costs by 50% and increasing throughput.” [Official] Batch processing (Message Batches API) · AnthropicT1-official original The discount is a flat 50% on both input and output across all tiers, and it stacks with prompt-caching discounts. The cost of the discount is a service-level agreement measured in hours, not milliseconds: most batches finish within an hour, but the guarantee is 24, and a batch that does not complete in 24 hours expires. [Official] Batch processing (Message Batches API) · AnthropicT1-official original Results stay retrievable for 29 days after creation.

The custom_id contract

A batch is a set, not a sequence, and that has one non-negotiable consequence: “Batch results can be returned in any order, and may not match the ordering of requests when the batch was created. … To correctly match results with their corresponding requests, always use the custom_id field.” [Official] Batch processing (Message Batches API) · AnthropicT1-official original Every request carries a unique custom_id (1–64 characters, alphanumeric plus - and _), and that id is the only thread connecting an output back to the input that produced it.

The batch envelope: what fits and what it can’t do

A batch is bounded by size and by shape. Size: “A Message Batch is limited to either 100,000 Message requests or 256 MB in size, whichever is reached first.” [Official] Batch processing (Message Batches API) · AnthropicT1-official original Exceed the payload and the create call returns HTTP 413 — break huge datasets into multiple batches. Shape: a batch supports all Messages API features including beta features, “however, streaming is not supported for batch requests,” [Official] Batch processing (Message Batches API) · AnthropicT1-official original and each request is single-shot — there is no follow-up turn inside a batch, so multi-turn tool round-trips do not work. Structured outputs (D4.3), by contrast, compose cleanly: a batched request can carry output_config.format and you get schema-valid JSON at 50% off. [Official] Structured outputs · AnthropicT1-official original

Billing, result types, and the lifecycle

You pay only for what works: a result is succeeded, errored, canceled, or expired, and “you are not billed for errored, canceled, or expired requests.” [Official] Batch processing (Message Batches API) · AnthropicT1-official original For unusually long generations there is an opt-in: the output-300k-2026-03-24 beta header “raises the max_tokens cap to 300,000 for batch requests using Claude Opus 4.8, Claude Opus 4.7, Claude Opus 4.6, or Claude Sonnet 4.6” — batch-only, and a single 300k generation can itself take over an hour, so submit it with the 24-hour window in mind. [Official] Batch processing (Message Batches API) · AnthropicT1-official original

A succeeded result is not automatically a usable one

The four result types — succeeded, errored, canceled, expired — are batch-level outcomes: they tell you the request ran, not that its answer is good. A succeeded result “includes the message result,” [Official] Batch processing (Message Batches API) · AnthropicT1-official original and that message carries its own stop_reason. Two values still bite — a refusal (stop_reason: "refusal") returns a 200, is billed, and may not match your schema; a truncation (stop_reason: "max_tokens") is incomplete output. [Official] Structured outputs · AnthropicT1-official original Note the cost asymmetry: you are not billed for errored/canceled/expired, but a succeeded refusal you do pay for. So per-result handling must inspect each succeeded message’s stop_reason, not stop at the result type.

The lifecycle, with the check most callers skip Worked example

Classifying 80,000 tickets overnight. Two layers of result-checking, not one:

# 1. Create -- each request a unique custom_id (the only join key).
batch = client.messages.batches.create(requests=[
    {"custom_id": f"ticket-{t.id}", "params": {...}} for t in tickets])

# 2. Poll until ended (most < 1h; SLA 24h, then expiry).
while client.messages.batches.retrieve(batch.id).processing_status != "ended":
    sleep(60)

# 3. Stream JSONL -- order is NOT guaranteed.
for r in client.messages.batches.results(batch.id):
    if r.result.type != "succeeded":
        handle_failure(r.custom_id, r.result.type)        # errored/canceled/expired -- unbilled
        continue
    msg = r.result.message
    if msg.stop_reason in ("refusal", "max_tokens"):
        handle_unusable(r.custom_id, msg.stop_reason)      # succeeded but NOT usable -- and billed
        continue
    records[r.custom_id] = parse(msg)                       # 4. join by custom_id

The structure is the lesson. The first guard is the batch-level result type — succeeded versus the three unbilled failures. The second, the one most pipelines omit, is the message-level stop_reason inside a succeeded result: a refusal or a truncation reaches you as succeeded yet carries no answer you can use — and you paid for the refusal. Skip it and you silently ingest refused/truncated outputs as if they were classifications. And throughout, custom_id is the only correct join key, because the result stream is unordered.

Practice

Exercise solutions

Solution ↑ Exercise

B. The job is large, offline, and cost-sensitive with no one waiting — the exact profile batch is built for: 50% off, and 80,000 requests sits within the 100,000-request limit. Matching by custom_id after the batch ends is the required pattern because results return unordered. A works but forfeits the 50% discount and adds rate-limit and orchestration overhead for latency nobody needs. C is impossible — streaming is not supported for batch requests. D collapses 80,000 independent classifications into one prompt, which blows past context limits and produces a single entangled response with no per-ticket structure.

Solution ↑ Exercise

custom_id is mandatory because batch results “can be returned in any order” — a batch is a set, not a sequence, so there is no positional correspondence between the request list and the result stream to fall back on. The unique custom_id is the only thread joining an output back to the input that produced it. If a caller instead assumes submission order, the specific failure is a silent mis-join: result n is attributed to request n when it actually answers some other request, so records carry the wrong data and nothing in the response flags it. That is the most dangerous failure class — one that corrupts data without surfacing an error.

Solution ↑ Exercise

The two limits are 100,000 requests or 256 MB in size, whichever is reached first; the 256 MB payload limit is the one an HTTP 413 reports on creation (the fix is to split the dataset into multiple batches). A Messages API capability that does not work inside a batch: streaming (explicitly unsupported), or equally a multi-turn tool loop — each batched request is single-shot, with no tool_result round-trip, because a batch processes each request as one independent user→assistant turn with no follow-up.

Exam essentials

The trade — batch is async and 50% off (input and output, all tiers, stacks with caching); most finish under an hour, the SLA is 24 hours, then the batch expires; results retained 29 days. Choose it by latency tolerance alone.
custom_id contract — results return in any order; match by the unique custom_id (1–64 chars, alphanumeric + - + _). Never rely on positional order; never reuse an id.
Envelope — 100,000 requests or 256 MB per batch (HTTP 413 over payload); streaming unsupported; each request is single-shot (no multi-turn tool loop). Structured outputs compose (schema-valid at 50% off).
Billing + beta — billed only for succeeded; errored/canceled/expired are free; the output-300k-2026-03-24 beta raises max_tokens to 300,000 on batch for Opus 4.8/4.7/4.6 and Sonnet 4.6.
Succeeded ≠ usable — a succeeded result still carries a per-message stop_reason; a refusal ("refusal", 200, billed, may not match schema) or a truncation ("max_tokens", incomplete) reaches you as succeeded. Check each succeeded message’s stop_reason, not just the result type.
Lifecycle — POST /v1/messages/batches → poll until ended → stream results_url JSONL → match by custom_id → optional DELETE before the 29-day window.

Part 4 Chapter 6 Last verified 2026-06-02 Fresh

Multi-Pass Review: Independent Reviewers and Attention Dilution

A fresh context catches what a self-review cannot, because attention dilutes as the window fills and an implementer is biased toward its own code. The same independent-reviewer pattern scales from a two-session Writer/Reviewer pair to a fleet of specialists guarded by a verification pass.

Volatility: architectural-pattern

Tools compared: claude-code

Part IV’s final chapter is about checking the work — at scale. The instinct to “ask the model to double-check itself” is exactly the wrong one, for a structural reason: a model reviewing in the same window that wrote the code is both dilated by a full context and biased toward what it just produced. The fix is independence, and it scales from two sessions to a fleet. The principle is stable; the product that embodies it is the illustration — so this is an architectural pattern.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

Why is asking a model to review its own code in the same session the weakest review — for which two independent reasons?
Name the three scales of the independent-reviewer pattern, cheapest first.
What does the verification pass do in a fleet review, and what fails without it?
Name Code Review’s three severity tags, and which one you would block a merge on.
Code Review’s check run never blocks a merge on its own — so how do you actually gate on its findings?

Check your answers

Attention dilution (performance degrades as the context window fills) and implementer bias (a fresh context won’t be biased toward code it just wrote) — self-review is the most dilated and most biased reviewer you could pick.
Verification subagent (one session, child in its own context window), Writer/Reviewer (two independent sessions), fleet (many parallel specialists, one issue class each).
It re-checks each candidate finding against actual code behavior to filter out false positives; without it, parallel reviewers’ plausible-but-wrong findings accumulate — more reviewers means more noise.
🔴 Important / 🟡 Nit / 🟣 Pre-existing — block the merge on Important (a bug that should be fixed before merging).
Read the severity breakdown from the check-run output in your own CI and fail the step yourself — exit non-zero when the 🔴 Important count is positive, so your own required check blocks the merge.

Why a fresh context beats self-review

Two independent forces make same-session self-review weak. The first is attention dilution: “LLM performance degrades as context fills. When the context window is getting full, Claude may start ‘forgetting’ earlier instructions or making more mistakes. The context window is the most important resource to manage.” [Official] Best practices for Claude Code · AnthropicT1-official original The second is implementer bias: “A fresh context improves code review since Claude won’t be biased toward code it just wrote.” [Official] Best practices for Claude Code · AnthropicT1-official original A reviewer that never watched the code get written carries neither the polluted context nor the sunk-cost instinct to defend it.

The Writer/Reviewer pattern and its lightweight form

The canonical realization is two sessions. One writes; a second, with no inherited context, reviews; the first then addresses the feedback. The docs give a worked example: Session A implements a rate limiter, Session B reviews @src/middleware/rateLimiter.ts “looking for edge cases, race conditions, and consistency with existing middleware patterns,” and Session A applies the result. [Official] Best practices for Claude Code · AnthropicT1-official original The same shape works for tests: “have one Claude write tests, then another write code to pass them.” [Official] Best practices for Claude Code · AnthropicT1-official original When spinning up a second session is too heavy, the single-session analog is a verification subagent — “use a subagent to review this code for edge cases” — which runs in its own context window and so inherits none of the parent conversation’s assumptions. [Official] Best practices for Claude Code · AnthropicT1-official original

The fleet: parallelism plus a verification pass

At the top of the scale, the pattern becomes a fleet. In Anthropic’s Code Review product, “when a review runs, multiple agents analyze the diff and surrounding code in parallel on Anthropic infrastructure. Each agent looks for a different class of issue, then a verification step checks candidates against actual code behavior to filter out false positives.” [Official] Code Review · AnthropicT1-official original The surviving findings “are deduplicated, ranked by severity, and posted as inline comments.” [Official] Code Review · AnthropicT1-official original Fanning out to specialists is the direct architectural answer to attention dilution: rather than ask one reviewer to hold every bug class in one window at once, each agent owns a single class — and the isolated-context, lead-plus-specialists shape is the same one Anthropic’s multi-agent research system uses. [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original

Code Review in practice: availability, triggers, severity

The productized form of the pattern has a concrete surface the exam can probe. Availability is gated: “Code Review is in research preview, available for Team and Enterprise subscriptions. It is not available for organizations with Zero Data Retention enabled.” [Official] Code Review · AnthropicT1-official original

Triggers come in three modes per repo: once after PR creation, after every push, or manual — invoked by commenting @claude review (which then subscribes to subsequent pushes) or @claude review once (a single one-off pass). [Official] Code Review · AnthropicT1-official original The one-off is the lever for “review this PR now, but don’t enroll it in re-review on every push.”

Severity is a fixed three-tag taxonomy on every finding:

And one subtlety that catches people: the check run “always completes with a neutral conclusion so it never blocks merging through branch protection rules.” [Official] Code Review · AnthropicT1-official original Code Review advises; it does not gate. To actually block a merge on findings, read the severity breakdown from the check-run output in your own CI and fail the step yourself. [Official] Code Review · AnthropicT1-official original

Convergence: keep multi-pass from spamming

More passes is not strictly better — multi-pass review needs convergence rules, and attention dilution applies to the instructions as much as the code. The docs are explicit that “a long REVIEW.md dilutes the rules that matter most,” [Official] Code Review · AnthropicT1-official original so the broadcast instruction block stays short. And re-review on every push needs a damping rule so trivial diffs don’t draw endless commentary: an instruction like “after the first review, suppress new nits and post Important findings only” stops “a one-line fix from reaching round seven on style alone.” [Official] Code Review · AnthropicT1-official original Production-quality fleet review is real work, not free — Code Review averages roughly $15–25 and 20 minutes per review at current figures [Official] Code Review · AnthropicT1-official original — so the convergence rules are also cost control.

Making an advisory review actually block a merge Worked example

Your team wants a merge blocked when Code Review finds a real bug — but its check run is neutral by design and never blocks through branch protection. So you build the gate yourself, on top of the surfaces above:

Trigger once. On a PR you want checked but not re-reviewed on every push, comment @claude review once. The fleet runs, the verification pass filters false positives, and findings post as inline comments tagged 🔴 Important / 🟡 Nit / 🟣 Pre-existing.
Read the severity breakdown. Code Review writes a machine-readable severity count into the check-run output; a CI step parses it (e.g. gh + jq) to pull the 🔴 Important count.
Fail on Important. Exit non-zero when it is positive, so your own required check (not Code Review’s) goes red and branch protection blocks the merge:

important=$(gh ... --jq '... | .important')
[ "$important" -gt 0 ] && { echo "Code Review: $important Important finding(s)"; exit 1; }

The reasoning ties three threads together. Code Review deliberately advises rather than gates, so a research-preview tool can never wedge a team’s merge queue. The 🔴/🟡/🟣 taxonomy is what makes a self-built gate precise — block on Important, let Nits and Pre-existing through. And the exit-code discipline from D3.6 is what converts an advisory signal into an enforced one. Note also the cost lever: @claude review once runs a single pass, where “after every push” multiplies the ~$15–25 per-review cost by your push count.

Practice

Exercise solutions

Solution ↑ Exercise

B. A fresh session or a verification subagent reviews with no inherited context, which removes both weaknesses of self-review at once — the reviewer is at peak performance in a clean window and has no authorship bias toward the code, exactly the conditions the docs credit for catching what self-review misses. A is self-review: most dilated (the implementation already fills the context) and most biased (the model defends what it wrote). C generates a second implementation, not a review, and gives you two artifacts to reconcile rather than a found bug. D re-pastes code into an already-polluted context; freshening the code does nothing about the accumulated context or the authorship bias — only a fresh, independent context fixes those.

Solution ↑ Exercise

Attention dilution says performance degrades as the context window fills — so a single agent asked to find every class of bug must hold the diff, the surrounding code, and a long checklist of bug categories in one window, and its attention to any one class thins as the others crowd in. Fanning out gives each specialist its own fresh context with a single mandate — race conditions, or injection, or edge cases — so none of them is operating dilated, and each brings full attention to its one class. The fleet trades one over-loaded reviewer for many focused ones; that is the direct architectural answer to dilution (paired, crucially, with a verification pass so the extra candidates don’t become noise).

Solution ↑ Exercise

The failure mode is false-positive amplification: parallel reviewers each independently flag plausible-but-wrong issues, and with no filter those candidates accumulate, so adding reviewers adds noise, not just signal — five agents surface five streams of unverified guesses. The verification pass re-checks each candidate against actual code behavior to filter out false positives before anything is posted, so only findings that survive a behavioral check reach the human. It is what makes fan-out a net gain rather than a faster false-positive generator; the surviving findings are then deduplicated and ranked by severity.

Exam essentials

Fresh context beats self-review for two independent reasons: attention dilution (performance degrades as the window fills) and implementer bias (a model defends code it just wrote). Independence removes both.
Writer/Reviewer — one session writes, a second independent session reviews, the writer addresses feedback; the test/code split is a variant; the verification subagent is the single-session, isolated-context form.
Fleet + verification pass — parallel specialists each own one issue class (the answer to attention dilution); a verification step filters false positives, then dedupe + severity ranking. A fleet without verification amplifies false positives.
Convergence rules — keep the broadcast instruction block short (“a long REVIEW.md dilutes the rules that matter most”) and damp re-review (“suppress new nits, post Important findings only”) so trivial diffs don’t draw endless passes; this is also cost control.
Code Review surfaces — research preview, Team & Enterprise only, not under Zero Data Retention; three trigger modes (once after PR creation / after every push / manual via @claude review or one-off @claude review once); severity taxonomy 🔴 Important / 🟡 Nit / 🟣 Pre-existing; the check run is neutral and never blocks a merge — gate by reading the severity breakdown in your own CI.

Part 4 · D4 Review

6 exercises across 6 chapters — interleaved review.

d4-01-explicit-criteria

d4-01-ex-vague-vs-explicit A nightly job asks Claude to "summarize each support ticket." The summaries come back wildly inconsistent — some one line, some three paragraphs, some with a sentiment label and some without — and a downstream parser keeps breaking. What is the most direct fix? - **A.** Lower the sampling temperature so the model decodes more deterministically and repeats one shape. - **B.** Add "be consistent and don't write too much" so the prompt tells it to self-regulate length. - **C.** Specify the exact output contract — a fixed set of typed, length-bounded fields the parser can rely on. - **D.** Switch to a larger model that infers the intended format more reliably from the same prompt.

d4-02-few-shot-prompting

d4-02-ex-ambiguous-edge You are extracting `{invoice_id, amount, due_date}` from emails. Most are clean, but some have no due date at all, and the pipeline keeps emitting `"due_date": "unknown"` for those — which the downstream date parser rejects. You want missing dates to come back as `null`. What is the most reliable fix? - **A.** Add a sentence to the instruction: "if there is no due date, use null, not 'unknown'." - **B.** Include 3-5 examples, one of which is an email with no due date whose output shows `"due_date": null`. - **C.** Raise the example count to 10 clean invoices so the model has more data to generalize from. - **D.** Post-process every `"unknown"` string into `null` after the model returns.

d4-03-structured-output-tool-use

d4-03-ex-pick-the-primitive You extract a fixed record — `{customer, plan, seats}` with `seats` an integer — and a downstream billing function will crash on a string where it expects a number. You want exactly this one extraction, type-guaranteed, on the native Claude API. Which approach fits best? - **A.** Force a single `print_record` tool with `tool_choice: {type: "tool", name: "print_record"}` and set `strict: true` on it. - **B.** Use the classic cookbook pattern with `additionalProperties: true` so the model can include whatever it finds. - **C.** Call through the OpenAI SDK compatibility layer with `strict: true` for portability. - **D.** Ask in the prompt for "valid JSON with an integer seats field" and parse the response text.

d4-04-validation-retry-feedback

d4-04-ex-semantic-catch An invoice extractor uses structured outputs and never returns malformed JSON, but once in a while it reports a `total` that doesn't match the line items — and those slip through to billing. You want to catch them automatically. What is the most effective design? - **A.** Add `strict: true` to the extraction tool so the totals are type-guaranteed. - **B.** Add both `stated_total` and `calculated_total` to the schema and have application code compare them, routing any mismatch to human review. - **C.** Raise `max_tokens` so the model has room to compute the total more carefully. - **D.** Tighten the JSON schema with a `minimum` constraint on `total`.

d4-05-batch-processing

d4-05-ex-batch-or-realtime You must classify 80,000 archived support tickets overnight to populate an analytics dashboard read the next morning. Cost matters; latency does not. Which design fits best? - **A.** A real-time loop calling the Messages API once per ticket, parallelized for throughput. - **B.** One Message Batch of 80,000 requests, each with a unique `custom_id`, results matched by `custom_id` after `processing_status` reaches `ended`. - **C.** A streaming batch so the dashboard can update ticket-by-ticket as results arrive. - **D.** A single Messages API request containing all 80,000 tickets in one prompt.

d4-06-multi-pass-review

d4-06-ex-independent-review Claude has just implemented a subtle concurrent cache in one long session, and you want the strongest possible bug-catch before merge. Which approach is most likely to surface a race condition? - **A.** Ask the same session, in the next turn, to carefully review its own implementation for bugs. - **B.** Open a fresh session (or dispatch a verification subagent) that reviews the cache with no context from how it was written. - **C.** Re-run the same prompt at a higher temperature to get a different implementation to compare. - **D.** Keep reviewing in the same session but paste the code back in so it is fresh in context.

Part 5 Chapter 1 Last verified 2026-06-08 Fresh

Long-Conversation Context: Accumulation, Degradation, Compaction

Context is a finite, accumulating resource, and a long conversation degrades before it overflows. This chapter frames the exam angle — cumulative context, the lost-in-the-middle and summarization failure modes, and lossy automatic compaction — and points to the design book where the degradation mechanics are proven in depth.

Volatility: architectural-pattern

Tools compared: claude-code

Part V is the reliability domain — context management, escalation, error propagation, provenance. It opens with the most basic constraint behind all of them: the context window is finite, and a long conversation gets worse before it gets full. This chapter is the cert-exam angle; the mechanics of why long context degrades are proven in depth in the design book, to which it points. It is an architectural pattern — the accumulation-and-compaction shape is stable, while the window sizes and message types are the moving surface.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

List what shares the cumulative context budget within a session.
“Fits in the window” and “well-attended” — why are these different claims?
What does automatic compaction do, and which part of the conversation does it tend to drop first?
A rule given only in the opening prompt stops being followed right after a compaction summary. Where should it have lived, and why?
Given a misbehaving long session, which three failure modes should you discriminate among before reaching for a fix?

Check your answers

The system prompt, tool definitions, CLAUDE.md, conversation history, and tool inputs/outputs all accumulate in one finite budget that never refills within a session.
The token limit is a capacity bound while attention is a quality that declines as the window fills — a conversation comfortably under the limit can still have lost the thread of an instruction given fifty turns ago (lost-in-the-middle).
Near the context limit it replaces older messages with a summary — it summarizes the oldest turns first, so specific instructions from early in the conversation are most at risk.
In CLAUDE.md, because its content is re-injected on every request and so survives compaction, unlike an opening-prompt rule whose survival depends on a summarizer’s discretion.
Accumulation pressure, lost-in-the-middle, and post-compaction loss — three distinct conditions with three distinct remedies.

Context is a finite, accumulating resource

Everything in a session shares one budget. “Context window is cumulative within a session. System prompt, tool definitions, CLAUDE.md, conversation history, tool inputs/outputs all accumulate.” [Official] How the agent loop works · AnthropicT1-official original And the budget is concrete: current windows are 1M tokens on Opus 4.8 and Sonnet 4.6 and 200k on Haiku 4.5 — though Opus 4.8’s tokenizer can consume up to 35% more tokens for the same text, so the same conversation costs more of the budget on one model than another. [Official] Models overview · AnthropicT1-official original

Degradation comes before overflow

The failure that matters is not hitting the limit — it is the quiet decline well before it. As a window fills, a model attends less reliably to material buried in the middle of a long context (the “lost-in-the-middle” effect), and any progressive summarization of earlier turns discards detail that may later turn out to matter. These degradation mechanisms — context rot, lost-in-the-middle, summarization loss — are the subject of the Agentic Systems Design book’s chapter on context rot, where they are established against the research; this chapter’s job is to make you recognize them on the exam.

Compaction: the automatic defense, and its cost

When a session approaches the limit, the loop defends itself: “Automatic compaction triggers near the context limit.” [Official] How the agent loop works · AnthropicT1-official original The defense is lossy by construction: “Compaction replaces older messages with a summary, so specific instructions from early in the conversation may not be preserved. Persistent rules belong in CLAUDE.md (loaded via settingSources) rather than in the initial prompt, because CLAUDE.md content is re-injected on every request.” [Official] How the agent loop works · AnthropicT1-official original Compaction buys room by trading away fidelity to the early conversation — exactly the region most at risk from lost-in-the-middle in the first place.

Where the depth lives

This chapter is the exam-angle surface; the design book owns the mechanism. The degradation research, the measurement of context rot, and the assembly strategies that fight it live in the Agentic Systems Design book — its chapter on context rot for the failure modes and its chapter on context assembly for the deliberate construction of what goes in the window. The exam-relevant skill is diagnostic: given a long-session scenario, name whether it is accumulation pressure, lost-in-the-middle, or post-compaction loss, and reach for the matching mitigation.

Diagnosing three long-session failures Worked example

The exam-relevant skill is to name the failure mode before reaching for a fix. Three scenarios, three different diagnoses:

Scenario A — the rule dropped after a summary. A constraint you gave at turn 1 stops being honored around turn 40, just as a compaction summary appears. Diagnosis: post-compaction loss — the oldest turns were summarized and the early rule didn’t survive intact. Fix: move it to CLAUDE.md, which is re-injected every request.
Scenario B — a buried fact misremembered, no compaction. The session is comfortably under the token limit and never compacted, yet Claude misremembers a detail established ~60 turns ago in a long stretch of context. Diagnosis: lost-in-the-middle — material buried in the middle of a long window is attended to least reliably. Fix: re-surface the fact near the end of the context (restate it), or assemble the working context deliberately rather than letting it sprawl.
Scenario C — the session simply fills up. Turns keep growing until the window is near full and the loop compacts (or you hit the limit). Diagnosis: accumulation pressure — nothing is degraded yet; the budget is just spent. Fix: reduce what accumulates (tighter tool outputs, /clear and restart with a focused prompt) rather than treating it as a quality bug.

The discipline: “fits,” “buried,” and “summarized away” are three distinct conditions with three distinct remedies. Reaching for a bigger window (Scenario C’s instinct) does nothing for A or B — a larger context still loses its middle and still compacts eventually. Name the mode first.

Practice

Exercise solutions

Solution ↑ Exercise

B. CLAUDE.md content is re-injected on every request, so a rule placed there is present in the context after compaction just as before it — exactly what a session-long constraint needs. The timing (failure right after a compaction summary) is the tell that the original instruction was summarized away. A works for a few turns but fights the symptom by hand and will fail again at the next compaction. C delays the limit but does not address degradation — a larger window still loses the middle, and a long enough session compacts anyway. D governs output length, not whether an early instruction is retained, so it is unrelated to the failure.

Solution ↑ Exercise

What accumulates: the system prompt, tool definitions, CLAUDE.md, the full conversation history (every user and assistant turn), and all tool inputs and outputs — everything shares one cumulative budget within the session. “Still fits” is not “well-attended” because the token limit is a capacity bound while attention is a quality that declines as the window fills: a model attends less reliably to material buried in the middle of a long context, so a conversation comfortably under the limit can still have effectively lost an instruction given fifty turns ago. Fitting is necessary but not sufficient for the model to be using all of it well.

Solution ↑ Exercise

When a session nears the context limit, automatic compaction triggers and replaces older messages with a summary to buy room. An instruction placed only in the opening prompt is at risk because compaction summarizes the oldest turns first, and a one-line rule from turn one rarely survives the summary intact — so the constraint silently stops being honored (the failure often shows up right after a summary appears). A durable rule belongs where it is re-injected on every request: CLAUDE.md (loaded via settingSources), whose content is re-added to context each request and so is present after compaction exactly as before it — its survival no longer depends on a summarizer’s discretion.

Exam essentials

Cumulative budget — system prompt, tool defs, CLAUDE.md, conversation history, and tool I/O all accumulate in one finite window (1M tokens on Opus 4.8 / Sonnet 4.6, 200k on Haiku 4.5; tokenizer density varies by model).
Degradation precedes overflow — lost-in-the-middle and summarization loss erode a long context before it hits the limit; “fits” is not “well-attended.” Depth lives in the design book’s context-rot chapter.
Compaction is lossy — it triggers near the limit and replaces older messages with a summary, so early-conversation specifics may not be preserved.
Durable rules belong in re-injected context — put session-long constraints in CLAUDE.md (re-injected every request), not the opening prompt, so compaction cannot strand them.

Part 5 Chapter 2 Last verified 2026-06-02 Fresh

Escalation and Ambiguity Resolution

When an agent is uncertain or blocked, the reliable architecture makes it surface the decision rather than guess at intent. AskUserQuestion is the structured mechanism, the interview pattern is its proactive form, and the check-in is a control point where a human resolves what the model cannot.

Volatility: architectural-pattern

Tools compared: claude-code

Reliability is not only about catching errors after the fact; it is about not committing to a wrong interpretation in the first place. When intent is ambiguous, a well-built agent asks rather than assumes. This chapter is the exam angle on that discipline — the mechanism and its limits — and points to the handbook for the hands-on use-side workflow. The principle is durable; the tool that carries it is the illustration, so this is an architectural pattern.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

When a task is genuinely ambiguous, what is the reliable move — and what is the cost of guessing instead?
What are the structural limits of one AskUserQuestion call (questions per call, options per question, header)?
AskUserQuestion is the tool Claude calls; what is canUseTool, and which two triggers fire it?
Beyond a plain approve/deny, name two of canUseTool’s richer response patterns.
Why can a subagent not escalate with AskUserQuestion, and how do you design the work around that?

Check your answers

Escalate — surface the decision rather than silently pick an interpretation; asking costs one round trip, guessing wrong costs the whole task built on the wrong branch.
One call carries 1–4 questions, each with 2–4 options (label + description) and a header of ≤12 characters.
canUseTool is the callback your application implements; it fires when Claude wants to use a tool (a permission check) and when Claude calls AskUserQuestion (a clarification).
Any two of: approve-with-changes (updatedInput), approve-and-remember (PermissionUpdate), suggest-alternative (deny with guidance), or redirect-entirely (inject a new instruction via streaming input).
Subagents cannot call AskUserQuestion, so they must guess or fail; have the coordinator resolve the open questions first, then hand the subagent a fully-specified task.

Escalate, don’t guess

The foundational move is to surface a decision the model cannot make on its own. “While working on a task, Claude sometimes needs to check in with users. It might need permission before deleting files, or need to ask which database to use for a new project. Your application needs to surface these requests to users so Claude can continue with their input.” [Official] Handle approvals and user input · AnthropicT1-official original An architect’s job is to make those check-ins possible and routine — to design the agent so that hitting an ambiguity raises a question instead of silently resolving it.

AskUserQuestion: structured clarification

The mechanism is a built-in tool with a deliberately bounded shape. Each AskUserQuestion call carries 1–4 questions, each question a header (≤12 characters) and 2–4 options with a label and a description; the response maps each question to the chosen label, and free-text is handled by offering an “Other” choice and passing the typed text rather than the literal "Other". [Official] Handle approvals and user input · AnthropicT1-official original The structure is the point: bounded multiple-choice makes the human’s answer fast to give and unambiguous to route back into the agent’s flow.

The application side: the `canUseTool` callback

AskUserQuestion is the tool Claude calls; canUseTool is the callback your application implements to receive these interruptions and answer them. It fires on two triggers: when Claude wants to use a tool (a permission check) and when Claude calls AskUserQuestion (a clarification). [Official] Handle approvals and user input · AnthropicT1-official original So one callback is the single surface through which your app both gates tools and answers questions; it returns PermissionResultAllow / PermissionResultDeny (Python) or { behavior: "allow" | "deny" } (TS). [Official] Handle approvals and user input · AnthropicT1-official original

And its response is far richer than yes/no — the docs document six patterns:

Two interactions are worth memorizing. The callback is skipped in dontAsk mode — anything not pre-approved is denied without ever calling it (it is the last step of the permission chain, and dontAsk short-circuits before reaching it). [Official] Handle approvals and user input · AnthropicT1-official original And for a human who is not watching the terminal, the PermissionRequest hook can fire an external notification (Slack, email, push) when Claude is waiting on approval. [Official] Handle approvals and user input · AnthropicT1-official original

The interview pattern: escalate before you start

Escalation is strongest when it is proactive — ask the questions before any work depends on the answers. That is the interview pattern from D3.5: Claude interviews you with AskUserQuestion, writes a spec, and a fresh session implements from it. [Official] Best practices for Claude Code · AnthropicT1-official original The natural home for this is plan mode, since “clarifying questions are especially common in plan mode, where Claude explores the codebase and asks questions before proposing a plan.” [Official] Handle approvals and user input · AnthropicT1-official original Front-loading the questions resolves ambiguity while it is still cheap — before a single edit is built on a guessed interpretation.

The check-in as a control point

A clarifying question is also a deliberate pause, and the SDK treats it as one: the canUseTool callback “can stay pending indefinitely” while a human decides, and for long delays the agent can return a "defer" decision that ends the query and resumes later from the persisted session. [Official] Handle approvals and user input · AnthropicT1-official original That makes escalation the upstream half of human-in-the-loop — the agent yields control at the moment of uncertainty, and the human resolves what the model could not (the downstream half, routing low-confidence output to review, is D5.5). The hands-on, use-side treatment of when and how to prompt for clarification is the handbook’s territory (its escalation-patterns chapter is forthcoming).

One callback, two triggers, three patterns Worked example

A coordinator wires a single canUseTool callback. Over one task it fields three interruptions, each resolved by a different pattern:

Claude wants to run Bash(rm -rf build/) — trigger 1, a permission check. The callback approves with changes: it rewrites updatedInput to a scoped rm -rf ./build/ and allows it. Claude sees the result and is not told the command was tightened.
Claude calls AskUserQuestion("Which database?", [Postgres, MySQL, SQLite]) — trigger 2, a clarification. The same callback surfaces it to the human (or fires a Slack notification via the PermissionRequest hook), stays pending while they decide, and returns the chosen label.
Claude wants to run Bash(curl … | sh) — trigger 1 again. The callback suggests an alternative: it denies with the message “pipe-to-shell is blocked; download, verify the checksum, then run.” Claude reads the guidance and adjusts its next step.

One callback handled a permission and a question, and steered rather than merely gated — editing one input, answering one question, redirecting one plan. Had the session been in dontAsk mode, none of these would have reached the callback at all: anything not pre-approved is denied without calling it. That is the whole design — AskUserQuestion raises; canUseTool resolves.

Practice

Exercise solutions

Solution ↑ Exercise

B. The database choice is a genuine intent decision the model cannot infer, so the reliable move is to surface it: AskUserQuestion with a few bounded options gets a fast, unambiguous answer and lets the agent continue on the right branch. A guesses — a reasonable default is still a coin flip on a decision with downstream lock-in (migrations, drivers, hosting). C infers intent from an accident of the environment; what is installed on a build machine is not a statement of what the project should use. D is the over-correction: failing throws away a recoverable situation that one bounded question would resolve. Escalation, not guessing and not giving up, is the pattern.

Solution ↑ Exercise

One AskUserQuestion call carries 1–4 questions; each question offers 2–4 options (each an option a label + description), plus a short header (≤12 characters) and a multiSelect flag. The response associates an answer with its question by mapping the question to the chosen option’s label — { "answers": { "<question text>": "<label>" } } — so the agent routes each selection back to the specific question it answers (a multiSelect answer returns as an array or a comma-joined string). Free-text is handled by offering an “Other” option and passing the typed text rather than the literal "Other".

Solution ↑ Exercise

A subagent cannot call AskUserQuestion — it is an explicit SDK limitation, and a subagent runs in an isolated context with no channel back to the user. So a subagent that discovers an ambiguous spec has no way to escalate: it must guess or fail, the precise outcome escalation exists to prevent. The fix is to restructure the decomposition so the part that needs a human stays with the agent that can reach one: have the coordinator resolve the open questions first (via AskUserQuestion, ideally during a plan-mode interview), then hand the subagent a fully-specified, unambiguous task. Delegate only after intent is pinned — never push an unresolved decision down to an agent that cannot ask about it.

Exam essentials

Escalate, don’t guess — design the agent to surface an ambiguous or blocked decision rather than silently pick an interpretation; the cost of asking is one round trip, the cost of guessing wrong is the whole task.
AskUserQuestion — 1–4 questions per call, 2–4 bounded options each (label + description, short header); the answer maps question to chosen label; free-text via an “Other” option. The bounded shape makes answers fast and routable.
canUseTool — the app-side callback — fires on two triggers (Claude wants a tool / Claude calls AskUserQuestion); six response patterns: approve, approve-with-changes (updatedInput), approve-and-remember (PermissionUpdate), reject, suggest-alternative, redirect-entirely. Skipped in dontAsk mode; the PermissionRequest hook sends external notifications while waiting.
Proactive beats reactive — the interview pattern and plan mode front-load clarifying questions, resolving ambiguity while it is still cheap, before work is built on a guess.
Subagents cannot escalate — AskUserQuestion is unavailable in subagents, so resolve open questions in the coordinator before delegating a fully-specified task.
Check-in as control point — a clarifying question is a deliberate pause; the callback can stay pending or defer-and-resume, making escalation the upstream half of human-in-the-loop (D5.5 is the downstream half).

Part 5 Chapter 3 Last verified 2026-06-02 Fresh

Error Propagation Across Multi-Agent Systems

In a chain of agents an error does not stay local — an upstream ambiguity becomes a downstream wrong decision, and concurrent faults compound into degradation no single component test reproduces. The defenses are structured error context across boundaries, independent validation, and circuit breakers.

Volatility: architectural-pattern

Tools compared: claude-code

A single agent fails visibly; a pipeline of agents fails by degrees. An error introduced at one stage rarely announces itself — it rides the handoff to the next stage as if it were sound input, and concurrent faults aggregate into a degradation that looks like nothing in particular. This chapter is about that propagation and the architecture that contains it. The pattern is durable; the percentages that illustrate it are evidence, so this is an architectural pattern.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

Why is a pipeline of agents less reliable than its single most reliable agent?
Why does a mid-pipeline agent turn an upstream ambiguity into a silent wrong decision, where an interactive one would not?
The MAST taxonomy sorts multi-agent failures into which three categories — and which is the largest?
Overlapping production bugs degraded quality for weeks, yet every component eval stayed green. Why couldn’t the unit tests see it?
Name two boundary defenses that contain error propagation across agents.

Check your answers

Because a chain’s reliability is the product of its handoffs, not its best agent’s — each boundary is both a place an error can enter and a place an existing error passes through unexamined.
A mid-pipeline agent has no one to ask, so where an interactive agent would pause and clarify, it resolves the ambiguity itself and hands the guess downstream as settled fact.
Specification problems (41.77%) — the largest — then coordination failures (36.94%) and verification gaps (21.30%).
The degradation lived between components, not inside any one — each part passed its own eval, and the combined effect appeared only in the interaction, on traffic slices no single test exercises.
Any two of: structured error context across boundaries, independent validation by an isolated judge, and circuit breakers that isolate a misbehaving agent before it cascades.

Failures compound across agent boundaries

Multi-agent systems fail far more often than their individual agents do. One practitioner analysis, drawing on the MAST taxonomy of 1,600-plus execution traces, reports that “multi-agent LLM systems fail at rates between 41-86.7% in production because specification ambiguity and unstructured coordination protocols cause agents to misinterpret roles, duplicate work, and skip verification.” [Practitioner] Why multi-agent LLM systems fail and how to fix them · Augment CodeT3-practitioner original That taxonomy sorts the breakdowns into three categories covering most of them: specification problems (41.77%), coordination failures (36.94%), and verification gaps (21.30%). [Practitioner] Why multi-agent LLM systems fail and how to fix them · Augment CodeT3-practitioner original

An upstream ambiguity becomes a downstream wrong decision

The propagation mechanism is specific. “Agents cannot read between lines, infer context, or ask clarifying questions during execution. Every ambiguity becomes a decision point where agents explore all possible interpretations, selecting suboptimal ones.” [Practitioner] Why multi-agent LLM systems fail and how to fix them · Augment CodeT3-practitioner original This is the multi-agent counterpart to D5.2’s escalation problem: an interactive agent can pause and ask, but a mid-pipeline agent usually has no one to ask, so it resolves the ambiguity — and a quietly-wrong resolution is handed downstream as if it were settled fact. The next agent has no signal that the input it received was a guess.

Compounding failures evade per-component testing

When multiple faults run at once, their aggregate is not the sum of their symptoms — and that is what makes them hard to catch. Anthropic’s April-23 postmortem describes three production bugs with distinct, partly-overlapping windows: a reasoning-effort default change (Mar 4–Apr 7), a caching bug that broke thinking blocks (Mar 26–Apr 10), and a verbosity-reduction prompt (Apr 16–Apr 20). [Official] An update on recent Claude Code quality reports · AnthropicT1-official original Those windows union to a roughly seven-week span (Mar 4–Apr 20) — but that is the aggregate reach, not a stretch in which all three ran at once: the first two overlapped, while the third began only after both had been fixed. Even so, the combined effect “looked like broad, inconsistent degradation” that no single bug’s symptom resembled, and the most stubborn one — the caching bug — crossed context management, the API layer, and the extended-thinking system. [Official] An update on recent Claude Code quality reports · AnthropicT1-official original Detection was the central lesson: the bugs hit different traffic slices, and neither internal usage nor the existing eval suite reproduced them. [Official] An update on recent Claude Code quality reports · AnthropicT1-official original

Why three overlapping bugs passed every unit test Worked example

The April-23 incident is the canonical compounding failure. Three independent bugs, three windows:

Bug	Window	Layer(s) it touched
Reasoning-effort default change	Mar 4 – Apr 7	inference defaults
Caching bug (broke thinking blocks)	Mar 26 – Apr 10	context mgmt × API × extended thinking
Verbosity-reduction prompt	Apr 16 – Apr 20	system prompt

Read the dates carefully: the first two overlapped (Mar 26 – Apr 7), but the third started after both were already fixed. So “seven weeks” is the union of the windows (Mar 4 – Apr 20), not a span of three-way concurrency — a distinction worth getting right, because it changes what a responder is actually hunting for (one persistent fault versus a shifting set).

Now the reason it hid: each bug, tested in isolation, passes. The reasoning-effort change is a valid config; the cache works on most paths; the verbosity prompt is well-formed. The degradation lived in the interaction and in which traffic slices each bug touched — so neither internal usage nor the existing eval suite reproduced it, and “broad, inconsistent degradation” was all the aggregate looked like. The postmortem’s remedy is integration-level: per-model evaluations on every prompt change, ablation testing, and soak periods before rollout — exercising the system as it actually runs, because the failure was never inside any one unit.

Defenses: structured context, independent validation, circuit breakers

The countermeasures all push against under-specified, unchecked boundaries. The practitioner remedies are to convert prose specs into machine-validatable schemas, to enforce typed and schema-validated messages between agents (with MCP named as the schema-enforced substrate), to deploy isolated judge agents for independent validation, and to add circuit breakers that isolate a misbehaving agent before it cascades. [Practitioner] Why multi-agent LLM systems fail and how to fix them · Augment CodeT3-practitioner original The payoff is concrete: “PwC achieved a 7x accuracy improvement (10% to 70%) through structured validation loops.” [Practitioner] Why multi-agent LLM systems fail and how to fix them · Augment CodeT3-practitioner original

The single-agent equivalent is the postmortem’s own remedy — broad per-model evaluations on every prompt change, ablation testing, and soak periods before rollout [Official] An update on recent Claude Code quality reports · AnthropicT1-official original — the same instinct as independent validation, applied to a pipeline of one.

Practice

Exercise solutions

Solution ↑ Exercise

B. The failure is a boundary failure: an ambiguous handoff that no stage is positioned to catch. Fixing the boundary — a typed, schema-validated spec so there is less to misinterpret, plus an independent validation step that checks the coder’s output against that spec — intercepts the wrong interpretation before it reaches the reviewer. A makes one agent smarter but leaves the ambiguous interface intact; a better coder still has to guess at an under-specified spec. C asks the reviewer to work harder while still reading only the code, blind to the spec it was meant to satisfy. D doubles cost and gives two artifacts to compare with no oracle for which is right — a wrong-but-consistent interpretation reproduces on the retry.

Solution ↑ Exercise

The three MAST categories are specification problems (41.77%), coordination failures (36.94%), and verification gaps (21.30%); specification problems are the largest. Specification problems propagate especially badly in a pipeline because a mid-pipeline agent cannot pause to ask — an under-specified handoff becomes a guess the agent resolves and passes downstream as settled fact, so a single ambiguity at the top seeds a wrong decision that every later stage treats as valid input.

Solution ↑ Exercise

A compounding cross-system failure can pass every per-component evaluation because it lives between components, not inside any one: each part passes its own eval in isolation, and the degradation emerges only from their interaction, on traffic slices no single test exercises. In the April-23 case three bugs with overlapping windows (the union running Mar 4 – Apr 20) produced “broad, inconsistent degradation” that no single bug’s symptom resembled, and neither internal usage nor the existing eval suite reproduced it. The postmortem concluded that integration-level testing was needed instead — broad per-model evaluations on every prompt change, ablation testing, and soak periods before rollout — testing the system as it actually runs rather than each unit alone.

Exam essentials

Failures compound across boundaries — multi-agent systems fail at much higher rates than their parts (the MAST taxonomy: specification 41.77%, coordination 36.94%, verification 21.30%); a chain’s reliability is the product of its handoffs, not its best agent.
Ambiguity propagates silently — a mid-pipeline agent cannot pause to ask (unlike D5.2’s interactive escalation), so it resolves an ambiguity and passes the guess downstream as settled fact.
Compounding failures evade unit tests — they live between components, on traffic slices no single test exercises; all-green component evals do not prove a multi-stage system healthy.
Defenses — structured error context across boundaries (D2.2), independent validation / isolated judges (D4.6), circuit breakers to stop cascades, and keeping the escalable decision at the coordinator (D5.2); the single-agent analog is broad evals + ablation + soak periods.

Part 5 Chapter 4 Last verified 2026-06-02 Fresh

Large-Codebase Context: Compaction, Scratchpads, Delegation

A large codebase has more relevant files than any window holds, and reading them all is the trap, not the solution. The three levers that extend the horizon are compaction, scratchpads that externalize state to disk, and subagent delegation that pays exploration cost in a separate context.

Volatility: architectural-pattern

Tools compared: claude-code

A large codebase has more relevant code than any context window can hold, so the question is never “how do I fit it all in” but “how do I keep the right slice in and the rest out.” This chapter is the exam-angle inventory of the levers; the design book owns the at-scale mechanics. It is an architectural pattern — the levers are stable, the commands and hooks that drive them are the surface.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

Why is “read every possibly-relevant file into the main session” a failure mode rather than caution?
Name the three knobs that customize or trigger compaction, and which one steers what content survives.
/compact and /clear both free context — what is the difference, and when do you reach for each?
A subagent explores twenty files; what does the main agent receive back, and where was the context cost paid?
You wrote a PLAN.md scratchpad, then ran /compact, then later /clear. Is the plan still available? Why?

Check your answers

Because context is cumulative and finite, reading everything “to be safe” fills the window with material that crowds out the work and degrades attention — the goal is to load the task’s slice, not the codebase.
A CLAUDE.md “Summary instructions” section, the PreCompact hook, and manual /compact — the CLAUDE.md “Summary instructions” section steers what content survives the summary.
/compact condenses the conversation in place so you continue the same task; /clear starts a fresh conversation to switch to an unrelated one — the decision rule is continuity (the previous session stays available via /resume).
The main agent receives only a summary — the three-line answer, not the twenty files — because the exploration cost was paid in the subagent’s separate context window and discarded with it.
Yes — a scratchpad on disk survives both: compaction only summarizes the window and /clear only wipes the window, so the agent re-reads PLAN.md exactly as it left it.

The codebase does not fit, and reading it all is the trap

Context is cumulative and finite (D5.1): conversation history and every tool input and output accumulate in the window over a session. [Official] How the agent loop works · AnthropicT1-official original A large codebase has far more relevant files than that window holds, and the naive response — have the agent read everything “to be safe” — is itself the failure: it fills the window with material that crowds out the work and degrades the model’s attention to what matters.

Compaction extends a long session

When a working session approaches the limit, compaction reclaims room by summarizing older turns, and it is both automatic and steerable. Automatic compaction triggers near the context limit, [Official] How the agent loop works · AnthropicT1-official original and three knobs customize it: a “Summary instructions” section in CLAUDE.md that tells the compactor what to preserve, the PreCompact hook that runs before compaction, and a manual /compact sent on demand. [Official] How the agent loop works · AnthropicT1-official original

Compaction is lossy (D5.1), so the same caution applies: durable rules belong in re-injected CLAUDE.md, not in turns the summary may discard.

`/compact` vs `/clear`, and the scratchpad beneath both

Two commands free context, and confusing them wastes work. /compact [instructions] “frees context by summarizing” — it condenses the conversation in place, so you keep going on the same task with a shorter history, and the optional instructions focus what the summary keeps. [Official] Commands · AnthropicT1-official original /clear instead starts a fresh conversation — it discards the working context entirely, with the previous one still available via /resume (aliases /reset, /new). [Official] Commands · AnthropicT1-official original

The decision rule is continuity. Reach for /compact to continue a task whose history has grown long but is still relevant. Reach for /clear to switch to an unrelated task — or when a session is cluttered with failed approaches, the D3.5 rule that after more than two corrections on the same issue, “a clean session with a better prompt almost always outperforms a long session with accumulated corrections.” [Official] Best practices for Claude Code · AnthropicT1-official original Compaction keeps a lossy summary; clearing keeps nothing in the window at all.

Delegation pays exploration cost in another window

The lever that matters most for breadth is delegation. “Since context is your fundamental constraint, subagents are one of the most powerful tools available. When Claude researches a codebase it reads lots of files, all of which consume your context. Subagents run in separate context windows and report back summaries.” [Official] Best practices for Claude Code · AnthropicT1-official original A subagent can read the twenty files that answer “where is auth enforced,” and the main agent receives the three-line answer rather than the twenty files — the exploration cost is paid in the child’s window and discarded with it. Scratchpads are the complementary move: state written to a file (D1.7) lives on disk, not in the window, and the main agent reads it back only when needed.

Four levers across one long task Worked example

A refactor spanning a 5,000-file service, in one sitting:

Delegate the exploration. Instead of reading auth’s twenty files into the main window, dispatch a subagent — “find every place auth is enforced; report the files and the enforcement pattern.” It reads twenty files in its window and returns a three-line summary; the main context never absorbed the bulk.
Externalize the plan. Write the refactor plan to PLAN.md — a scratchpad on disk, not in the window. State now lives somewhere compaction and clearing cannot reach.
Condense, don’t reset, mid-task. Halfway through, the window nears the limit. Run /compact (optionally focused: “keep the auth findings and the plan”). The history condenses, you continue the same refactor, and PLAN.md is untouched on disk.
Reset for the unrelated bug. A production bug interrupts — unrelated to the refactor. Run /clear for a fresh window (the refactor session stays in /resume), fix the bug, then return and re-read PLAN.md to resume the refactor exactly where you left it.

The lesson is matching the lever to the move: delegation for breadth, a scratchpad for state that must outlive the window, /compact to continue with less, /clear to switch cleanly. Confusing the two commands is the common error — compacting when you meant to switch drags stale context forward; clearing when you meant to continue throws away the thread (and only /resume can recover it).

Where the depth lives

This chapter is the exam-angle inventory; the design book owns the at-scale mechanics. The engineering of context at codebase scale — retrieval, the discipline of what to assemble into a window, and the cost trade-offs of fan-out — lives in the Agentic Systems Design book’s chapters on the environment at scale and context assembly. The exam-relevant skill is selecting the lever: compaction for a long single session, delegation for breadth of exploration, scratchpads for state that must outlive a compaction.

Practice

Exercise solutions

Solution ↑ Exercise

B. Delegation pays the exploration cost in the subagents’ separate context windows and returns only summaries, so the main session learns where authentication flows without absorbing dozens of files — exactly the “subagents run in separate context windows and report back summaries” pattern. A is the trap this chapter names: reading everything into the main window fills it with material that crowds out the actual change and dilutes attention. C buys a bigger budget but still spends it on noise, and a larger window degrades on irrelevant bulk just the same. D fights symptoms — compacting mid-exploration repeatedly summarizes away the very findings you are gathering, and is no substitute for never loading the bulk into the main window in the first place.

Solution ↑ Exercise

The three are (1) a CLAUDE.md “Summary instructions” section, (2) the PreCompact hook, and (3) manual /compact. The one that steers what content survives the summary is the CLAUDE.md “Summary instructions” section — the compactor reads CLAUDE.md like any other context, so a section describing what to preserve directs what the summary keeps. The PreCompact hook runs before compaction (e.g. to archive the full transcript) and manual /compact controls when it happens, not what survives — though /compact’s optional focus instructions also nudge the summary’s content.

Solution ↑ Exercise

Dispatching a subagent protects the main agent’s context because the subagent runs in its own separate context window: it reads the files needed to answer the question there, spending that exploration cost against its own budget, and that window is discarded when it returns. The main agent receives back only a summary — the three-line answer (“auth is enforced in middleware X via pattern Y”), not the twenty files the subagent read — so the main context learns the conclusion without absorbing the bulk that produced it. Exploration cost is paid in the child window and thrown away with it.

Exam essentials

Reading it all is the trap — a large codebase exceeds any window; loading everything “to be safe” fills the context with noise and degrades attention. Load the task’s slice, keep the rest out.
Compaction — triggers automatically near the limit and is steerable three ways: a CLAUDE.md “Summary instructions” section (steers what survives), the PreCompact hook, and manual /compact.
/compact vs /clear — /compact condenses the conversation to continue the same task (keeps a lossy summary); /clear starts a fresh conversation for an unrelated task or after >2 failed corrections (previous in /resume; aliases /reset, /new). A disk scratchpad (D1.7) survives both — neither command touches a file.
Delegation — subagents read files in their own context windows and report back summaries, so exploration cost is paid in the child window, not the main one; scratchpads (D1.7) park state on disk.
Pick the lever — compaction for a long single session, delegation for breadth of exploration, scratchpads for state that must outlive a compaction; depth lives in the design book’s at-scale and context-assembly chapters.

Part 5 Chapter 5 Last verified 2026-06-02 Fresh

Human Review and Confidence Calibration

Not every output earns automatic trust. The architect calibrates which results proceed and which route to a human, using checkable confidence signals and a tiered funnel — cheap auto-checks, then an isolated judge, then a person — so the human sees only the decisions where their judgment changes the outcome.

Volatility: architectural-pattern

Tools compared: claude-code

D4.4 closed the validation loop with the model — detect a semantic error, feed it back, retry. This chapter closes the other loop: when automation is not enough, route to a human. The architect’s job is calibration — deciding, per output, whether to trust it, verify it automatically, or escalate it to a person. The routing-and-funnel pattern is durable; the specific fields are illustration, so this is an architectural pattern.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

What makes the decision to auto-accept vs route-to-human an economic one rather than a quality slogan?
Why is a model’s self-reported “high confidence” not a signal you can route on directly?
What does it mean for a confidence signal to be calibrated in the measurement sense, and how would you check it?
Describe the three tiers of the review funnel and what each one escalates.
What two factors set where the human-review threshold falls?

Check your answers

Because for each output you weigh the cost of a wrong auto-accept against the cost of a human glance — cheap and reversible automates, expensive or irreversible routes to a person; “review everything” and “trust everything” are both failures to calibrate.
Self-reported confidence is a claim, not a measurement — a confidently-wrong output reports high confidence too, so route on checkable signals (e.g. calculated_total ≠ stated_total, conflict_detected) instead.
Calibrated means the stated value tracks actual accuracy — “90% confident” outputs are right about 90% of the time; check it by measuring, over real labeled data, the accuracy at each stated-confidence level.
Auto-check → isolated judge → human: cheap automated checks handle the obvious cases, the isolated judge (fresh context, no authorship bias) catches the wrong-but-plausible ones, and each tier escalates only what it cannot resolve, so the human sees only what survives both.
Stakes × uncertainty — low-stakes, high-confidence proceeds automatically; high-stakes, low-confidence goes to a person; the middle is where the judge tier earns its keep.

Not every output earns automatic trust

The cheapest reliability move is to give the model a way to check itself: “Include tests, screenshots, or expected outputs so Claude can check itself. This is the single highest-leverage thing you can do.” [Official] Best practices for Claude Code · AnthropicT1-official original But some judgments cannot be auto-verified — a wrong-but-plausible extraction, a borderline classification, a high-stakes decision with no ground truth to diff against. Those are where a human belongs. Confidence calibration is the discipline of deciding, output by output, which path each one takes.

Confidence as a routing signal

To route, you need a signal you can act on — and the reliable signals are checkable, not self-reported. The schema hooks from D4.4 double as confidence signals. (These are a design pattern this book recommends, not a built-in platform field — you add them to your schema.) When a model’s calculated_total disagrees with the document’s stated_total, both demand human review; when a conflict_detected flag is true, the record routes to a person; and a structured confidence field (high / medium / low) on an extraction gives the caller an explicit value to threshold on. Each is a place where the system can say “I am not sure” in a form a router can read.

Two senses of “calibration”

The word is doing double duty in this chapter, and the distinction is worth making sharp. There is the routing calibration the chapter is built on — which output goes to which tier — and there is measurement calibration: whether a confidence value actually tracks accuracy. A model is well-calibrated, in the measurement sense, only if its “90% confident” outputs are right about 90% of the time. Self-reported confidence usually fails this test — models tend to be over-confident, reporting high certainty on answers that are wrong — which is precisely why a raw “high” cannot gate the human queue on its own.

Calibrating a confidence signal against reality Worked example

An extraction pipeline emits a confidence field (high / medium / low). Before trusting it to route, you measure it on a labeled sample of 1,000 past extractions:

Stated confidence	Count	Actually correct
high	700	94%
medium	220	71%
low	80	38%

(Illustrative numbers — the method is the point.) Two readings follow. First, the signal is informative: accuracy falls monotonically from high to low, so it does carry real information about correctness. Second, it is not perfectly calibrated: “high” is 94%, not ~100%, so roughly six in a hundred high-confidence extractions are wrong — and on a high-stakes clinical field that residual is unacceptable. The routing decision now follows from the numbers, not the label: auto-accept high only if a ~6% error rate is tolerable for this field; otherwise send even high through the isolated judge, and route medium/low to a human. Had you trusted the word “high” as if it meant “certain,” you would have shipped that 6% silently.

The discipline: a confidence signal earns its routing role by measurement, not by its name — and you re-measure when the model, the prompt, or the input distribution shifts, because calibration is not permanent.

Independent validation before the human

Between auto-accept and the human sits an automated reviewer tier: an isolated judge. Independent validation — “deploy isolated judge agents” — is one of the practitioner-recommended defenses, and the gains are real: “PwC achieved a 7x accuracy improvement (10% to 70%) through structured validation loops.” [Practitioner] Why multi-agent LLM systems fail and how to fix them · Augment CodeT3-practitioner original The judge works for the same reason the D4.6 reviewer does: a fresh, isolated context has no authorship bias toward the output it is checking. Its job is to filter — resolve what it can, escalate only what it cannot — so the human queue stays small and high-value.

Calibrating the threshold

Where the human-review line falls is the calibration, and it is set by stakes times uncertainty. A low-stakes, high-confidence output proceeds automatically; a high-stakes, low-confidence one goes to a person; the middle is where the judge tier earns its keep. This makes human review the downstream half of human-in-the-loop, with escalation (D5.2) as the upstream half — the agent asks before acting when intent is unclear, and a human checks after producing when confidence is low. Calibrate the thresholds so a reviewer sees the few outputs where their judgment changes the outcome, and nothing it does not.

Practice

Exercise solutions

Solution ↑ Exercise

B. The design calibrates: checkable signals (confidence, conflict_detected) select which records are uncertain, an isolated judge clears the merely-plausible ones, and only what the judge cannot resolve reaches a human — so high-stakes errors are caught without reviewing everything. A retries with the same model, and a confidently-wrong extraction reproduces on the second run, so agreement is not evidence of correctness. C trusts self-reported confidence, which a confidently-wrong output also reports as “high” — the exact trap. D is safe but uncalibrated: it spends the most expensive reviewer on every record, most of which need no human, and does not scale.

Solution ↑ Exercise

Three routable signals: a cross-check mismatch (calculated_total ≠ stated_total), a self-flagged conflict_detected: true, and a low stated confidence (a failed provenance check — a cited span absent from the source — is a fourth). A model’s self-reported “high confidence” is not reliable on its own because it is a claim, not a measurement: a confidently-wrong output reports high confidence too, and models tend to be over-confident, so stated confidence often fails to track actual accuracy. The signals worth routing on are the checkable ones — a cross-check either matches or it does not, independent of what the model believes — whereas self-reported confidence must first be empirically calibrated against observed accuracy before it can gate anything.

Solution ↑ Exercise

The funnel is auto-check → isolated judge → human. Tier 1 (cheap automated checks) handles the obvious cases — a cross-check mismatch or a thresholded signal — and escalates anything it cannot clear. Tier 2 (an isolated judge agent) reviews the merely-plausible cases in a fresh, independent context, resolving what it can and escalating only what it cannot. Tier 3 (the human) sees only what survived both. The isolated judge catches errors the cheap auto-checks miss because those errors are wrong-but-plausible — they pass the mechanical checks (valid shape, no flagged conflict) yet are semantically wrong, and a fresh-context reviewer with no authorship bias can judge correctness where a regex or equality test cannot. Each tier escalates only what it cannot resolve, so the most expensive reviewer spends attention only on the decisions that truly need human judgment.

Exam essentials

Calibrate, don’t blanket-trust or blanket-review — decide per output by weighing the cost of a wrong auto-accept against a human glance; verification is the highest-leverage habit where it is possible.
Route on checkable signals — cross-check mismatches (calculated_total ≠ stated_total), self-flagged conflict_detected, low confidence, and failed provenance are routable signals; self-reported confidence alone is a claim, not a measurement. Two senses of calibration: routing (which output to which tier) and measurement (does “90% confident” mean 90% correct?). Empirically calibrate a confidence field — accuracy per stated level — or prefer checkable signals that need no calibration.
Tiered funnel — cheap auto-checks → isolated judge (fresh context, no authorship bias) → human; each tier escalates only what it cannot resolve, keeping the human queue small (structured validation loops drove a documented 7× accuracy gain).
Threshold by stakes × uncertainty — high-stakes + low-confidence routes to a human; human review is the downstream half of human-in-the-loop, escalation (D5.2) the upstream half.

Part 5 Chapter 6 Last verified 2026-06-08 Fresh

Information Provenance: Citations and Temporal Validity

An architect tracks where each claim came from and when its data is valid. The native Citations API ties quoted text to real document spans so a source cannot be fabricated; the provenance triple is the schema-friendly fallback; and a model's knowledge cutoff bounds what it can be trusted to know without a dated source.

Volatility: feature-surface

Tools compared: claude-code

The book closes on the question that underlies trust in any agent output: where did this claim come from, and is it still true? Provenance answers the first — a claim mapped to its source is auditable, one without a source is a trust-me. Temporal validity answers the second. This chapter is the exam-angle treatment; the named features — the Citations API surface, the location modes — are the moving parts, so it is a feature surface.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

What does the Citations API actually guarantee about a cited quote — and what does it not guarantee (e.g. a JSON grammar)?
Name the three citation location modes by document type.
You need structured JSON and per-claim attribution. Why does combining Citations with structured outputs fail, and what is the fallback?
Is a model’s reliable knowledge cutoff earlier or later than its training-data cutoff — and why?
Within one request, on how many of the documents must citations be enabled?

Check your answers

It guarantees the cited text is tied to an actual span in the source document — the model cannot fabricate a citation to text that is not there; it is span-bound, not grammar-constrained, so it guarantees nothing about output shape.
Plain text → char_location, PDF → page_location, custom content → content_block_location.
Cited text must interleave with the response prose, which a strict JSON schema forbids, so the API returns a 400; the fallback is the provenance triple (document_id, span_quote, confidence) verified caller-side.
Earlier (or equal) — data near the training cutoff is sparse, so the model’s reliable knowledge stops before its training does.
All or none — you cannot mix cited and uncited documents within a request.

Provenance maps every claim to its source

The point of provenance is verifiability. “Claude is capable of providing detailed citations when answering questions about documents, helping you track and verify information sources in responses. All active models support citations, with the exception of Haiku 3.” [Official] Citations · AnthropicT1-official original You enable it per document with citations: {"enabled": true} on the document block, and each cited claim in the response carries a sibling citations array pointing back to the exact span of the source it came from. [Official] Citations · AnthropicT1-official original One enablement rule to memorize: citations must be enabled on all or none of the documents within a request — you cannot mix cited and uncited documents. [Official] Citations · AnthropicT1-official original

The Citations API and its location modes

How a citation points at its source depends on the document type, and there are three modes. Plain text is chunked to sentences and cited by char_location; a PDF is cited by page_location; custom content, where you supply the chunks, is cited by content_block_location. [Official] Citations · AnthropicT1-official original The feature is also output-cheap: “the cited_text field is provided for convenience and does not count towards output tokens.” [Official] Citations · AnthropicT1-official original

The provenance triple: schema-friendly fallback

When the output must be structured JSON, the native Citations API is off the table, so you encode provenance into the schema yourself. This is the D4.4 hook applied to attribution — a design pattern this book recommends, not a platform feature: each extracted claim carries a source object with a document_id, a span_quote, and a confidence, and the caller verifies that span_quote actually appears in document_id. If it does not, the model fabricated the citation. It is a manual, checkable provenance you can drop inside any schema.

Temporal provenance: knowing when data is valid

Provenance is not only where a claim came from but when it can be trusted. Each model has a reliable knowledge cutoff — Opus 4.8 at January 2026, Sonnet 4.6 at August 2025, Haiku 4.5 at February 2025 — and that reliable cutoff is earlier than (or equal to) the model’s training-data cutoff, not later: Sonnet 4.6 trained on data through January 2026 but is reliable only to August 2025, and Haiku 4.5 trained through July 2025 but is reliable to February 2025. [Official] Models overview · AnthropicT1-official original Data near the training cutoff is sparse, so the model’s reliable knowledge stops before its training does. Past the reliable cutoff the model has no dependable knowledge, so a time-sensitive fact must come from a dated source supplied at request time (retrieval with a citation), not from the model’s memory. The use-side workflow for recording claim sources and decision dates is the handbook’s territory (its provenance and ADR material is forthcoming).

Provenance on both axes: where and when Worked example

A RAG assistant on Sonnet 4.6 (reliable knowledge cutoff August 2025) is asked: “What did the Q4 2025 earnings report say about revenue?” Both provenance axes are in play:

When — temporal validity first. Q4 2025 is after the model’s reliable cutoff (August 2025), so the model has no dependable knowledge of it — answering from memory risks a confident fabrication. The fact must come from a dated source supplied at request time, not the model’s weights. (Note it is irrelevant that Sonnet 4.6 trained through January 2026: the reliable cutoff is the earlier date, and it is what bounds trust.)
Where — bind the answer to the source. Supply the earnings PDF as a document with citations: {"enabled": true}. Because it is a PDF, the response cites by page_location, and each revenue claim carries the exact page span — auditable, not “trust me.” The cited_text echoes the source span back without costing output tokens.
If the pipeline also needs structured JSON, you hit the wall: Citations + output_config.format returns a 400. Fall to the provenance triple — emit {claim, source: {document_id, span_quote, confidence}} per fact and verify each span_quote appears in the cited PDF caller-side.

The closing synthesis of the book: a trustworthy claim needs where (a source span, via Citations or the triple) and when (a dated source, because the reliable cutoff — earlier than training — bounds what the model knows). Retrieval supplies the dated source; a citation binds the answer to it; verification proves the binding is real.

Practice

Exercise solutions

Solution ↑ Exercise

B. Citations and Structured Outputs are mutually exclusive — the 400 is the API telling you so — so when the structured shape is required, you encode provenance into the schema with a triple (document_id, span_quote, confidence) and verify each span against its source caller-side. A abandons the structured output the pipeline requires, trading one requirement for the other. C is the fabrication trap: an unverified span_quote may quote text that is not in the document, which is exactly the failure provenance exists to catch. D doubles cost and leaves you reconciling two responses with no guarantee the cited run and the schema run extracted the same facts.

Solution ↑ Exercise

The three modes are plain text → char_location (sentence-chunked; start/end character indices), PDF → page_location (start/end page numbers; scanned images without extractable text are not citable), and custom content → content_block_location (you supply the chunks; start/end block indices). cited_text is attractive on output cost because the field “is provided for convenience and does not count towards output tokens” — you get the quoted source span echoed back for verification without paying output tokens for it.

Solution ↑ Exercise

A question about an event after the model’s reliable knowledge cutoff should be answered from a supplied dated source because past that cutoff the model has no dependable knowledge — it may produce a plausible but fabricated answer. Crucially, the reliable cutoff is earlier than the training-data cutoff, so even data the model technically trained on near the boundary is unreliable; the earlier date is the one that bounds trust. Supplying the fact as a dated source at request time (retrieval plus a citation) makes the answer both correct and auditable. That connects to provenance broadly: provenance answers two questions — where a claim came from (a source span, via Citations or the triple) and when it is valid (a dated source past the cutoff). A time-sensitive claim needs both: an external dated source, bound to the answer by a verifiable citation.

Exam essentials

Provenance is verifiability — the Citations API (enable per document with citations: {"enabled": true}) ties each claim to a source span so a citation cannot be fabricated (span-bound, not grammar-constrained); cited_text does not count toward output tokens. Citations must be enabled on all or none of a request’s documents.
Three location modes — plain text → char_location, PDF → page_location, custom content → content_block_location; image citations are not yet supported.
Mutual exclusion — Citations + Structured Outputs return 400; when you need both a schema and provenance, use the provenance triple (document_id + span_quote + confidence) and verify the span caller-side.
Temporal provenance — each model has a reliable knowledge cutoff (Opus 4.8 Jan 2026, Sonnet 4.6 Aug 2025, Haiku 4.5 Feb 2025), which is earlier than (or equal to) the training-data cutoff — the model trains on later data but is only reliable to the earlier date (Sonnet 4.6 trained to Jan 2026, reliable to Aug 2025). Past the reliable cutoff, answer time-sensitive questions from a dated source, not the model’s memory.

Part 5 · D5 Review

6 exercises across 6 chapters — interleaved review.

d5-01-long-conversation-context

d5-01-ex-durable-instruction Early in a long session you tell Claude "always cite a file:line for any behavioral claim." Forty turns later it stops doing so, right around when a compaction summary appeared. What is the most reliable fix? - **A.** Re-paste the instruction manually every few turns to keep it fresh in context. - **B.** Move the instruction into CLAUDE.md, which is re-injected on every request and so survives compaction. - **C.** Switch to a model with a larger context window so compaction never triggers. - **D.** Raise `max_tokens` so the model has more room to comply.

d5-02-escalation-ambiguity

d5-02-ex-ambiguous-db An agent is scaffolding a new service and reaches a step that needs a database, but the request never said which one. What is the most reliable design? - **A.** Pick the most common default (say, PostgreSQL) and proceed, noting the choice in a comment. - **B.** Call `AskUserQuestion` with a short set of bounded options (Postgres / MySQL / SQLite) and continue once the user chooses. - **C.** Infer the database from whatever is already installed on the build machine. - **D.** Fail the task with an error explaining that the requirement was underspecified.

d5-03-error-propagation

d5-03-ex-silent-propagation In a planner → coder → reviewer pipeline, the planner occasionally emits an ambiguous spec, the coder silently picks a wrong interpretation, and the reviewer — reading only the code — rates it fine. Bad output ships. What is the most effective fix? - **A.** Upgrade the coder to a more capable model so it interprets ambiguous specs correctly. - **B.** Make the planner emit a typed, schema-validated spec and add an independent validation step that checks the coder's output against that spec before the reviewer runs. - **C.** Lengthen the reviewer's prompt to tell it to look harder for problems. - **D.** Retry the whole pipeline a second time and compare the two outputs.

d5-04-large-codebase-context

d5-04-ex-explore-large An agent must understand how authentication flows through a 5,000-file codebase before making a change, and the relevant code is spread across dozens of files. What keeps the main session's context usable? - **A.** Read every file that might be relevant into the main session so nothing is missed. - **B.** Dispatch subagents to explore subsystems and report back summaries, keeping the main context scoped to the change itself. - **C.** Raise `max_tokens` so the main session can hold more of the codebase. - **D.** Run `/compact` repeatedly while reading files so the window never fills.

d5-05-human-review-confidence

d5-05-ex-route-to-human A clinical-data extraction pipeline auto-accepts every result. Most are fine, but occasionally a high-stakes field is extracted wrong with no flag, and it reaches a patient record. You want to catch these without manually reviewing all output. What is the best design? - **A.** Retry every extraction with the model a second time and accept it if the two runs agree. - **B.** Have the model emit a `confidence` field and cross-check signals (e.g. `conflict_detected`), route low-confidence and flagged records through an isolated judge, and send what the judge cannot clear to a human. - **C.** Trust the model's self-reported confidence and auto-accept anything it marks "high." - **D.** Send every extracted record to a human reviewer to be safe.

d5-06-information-provenance

d5-06-ex-provenance-with-schema You are building an extraction pipeline that must return structured JSON *and* attribute each extracted fact to a source span. You try the Citations API with `output_config.format` and get a 400. What is the right design? - **A.** Drop the JSON schema and use the Citations API alone, parsing the prose response downstream. - **B.** Keep the schema and add a provenance triple per claim — `document_id`, `span_quote`, `confidence` — then have the caller verify each `span_quote` appears in its source. - **C.** Trust the model's cited text without verification, since it was instructed to quote the source. - **D.** Send the request twice, once with citations and once with the schema, and merge the two responses.

What an agent is

The loop is a control structure

A turn is one tool-use round-trip

Handling tool results: errors and parallel calls

stop_reason is the branch

Termination is the architect’s safety contract

Where Claude Code’s loop sits

Practice

Exercise solutions

Exam essentials

Further reading

Why run more than one agent

The orchestrator and its workers

Isolated context is the whole point

When the pattern earns its cost

Decompose by context, not by role

The verification subagent

Practice

Exercise solutions

Exam essentials

Further reading

The Agent tool is the invocation surface

AgentDefinition is the subagent’s contract

Enabling invocation: Agent in allowedTools

The prompt string is the only channel in

What crosses back: the return channel

Invocation paths and the one-level limit

Practice

Exercise solutions

Exam essentials

Two places to enforce a workflow

Every step boundary is a handoff

The Writer/Reviewer handoff that works

The handoff contract and its artifact

The validation gate: reject before propagating

Choosing where the control flow lives

Practice

Exercise solutions

Exam essentials

Hooks intercept the loop at named events

The events you must recognize

PreToolUse gates; PostToolUse normalizes

Precedence: deny beats defer beats ask beats allow

Hooks and subagents

Practice

Exercise solutions

Exam essentials

Two shapes of a decomposed task

When the path can’t be hardcoded → adaptive

When the path is predictable → pipeline

The cost you are choosing

The failure mode at each extreme

Choosing the structure

Practice

Exercise solutions

Exam essentials

A session is the persisted conversation

continue, resume, and fork

Fork branches the conversation, not the filesystem

Resume to recover — and the encoded-cwd trap

Scratchpads: durable state beyond the session

Practice

Exercise solutions

Exam essentials

Further reading

The description is the highest-leverage surface

Show correct usage with input_examples

Consolidate operations to reduce selection ambiguity

Namespace tool names by service

Return only high-signal information

The structural floor: an object input schema

Practice

Exercise solutions

Exam essentials

Two regimes, two spellings

is_error: true is the canonical failure signal (Messages API)

Write instructive error messages

Execution vs protocol errors: the MCP channel split

Retryability is documented behavior, not a parameter

Prevent the error: strict: true

Enabling invocation: `Agent` in `allowedTools`

`PreToolUse` gates; `PostToolUse` normalizes

Precedence: `deny` beats `defer` beats `ask` beats `allow`

`is_error: true` is the canonical failure signal (Messages API)

Prevent the error: `strict: true`

The four `tool_choice` modes

Guarantee a schema-valid call: `any` + `strict`

Distribution: scope the surface with `allowedTools`

Three scopes — and `claude mcp add --scope`