Agentic Loops: stop_reason and Tool-Result Handling
The first chapter of Domain 1 and the substrate the rest of the book assumes — the agent loop as a control structure whose branch condition is stop_reason. Teaches the tool-use round-trip from first principles, the turn model, error and parallel tool-result handling, termination budgets, and every stop_reason value an architect must recognize.
Domain 1 is the largest slice of the exam (27%), and this chapter is its floor: every later topic — subagents, workflows, hooks, session state — is a variation on the loop defined here. The exam tests whether you can read the loop’s control flow, not whether you can recite an SDK signature. We build the loop from the definition of an agent, trace it end to end with a worked example, name the stop_reason values that branch it, handle the error and parallel result cases, and fix the termination contract an architect owns.
What an agent is
Start with the definition, because the loop falls out of it. Anthropic’s Building Effective Agents draws the line that organizes this entire domain: “Workflows are systems where LLMs and tools are orchestrated through predefined code paths. Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks.” [Official] Building effective agents · Erik Schluntz and Barry Zhang (2024)T1-official original A workflow runs on rails you laid down in code; an agent decides its own next step at runtime.
Operationally, that autonomy has a simple shape: “Agents are typically just LLMs using tools based on environmental feedback in a loop.” [Official] Building effective agents · Erik Schluntz and Barry Zhang (2024)T1-official original The model proposes an action, your code runs it, the result of running it becomes the model’s next input, and the model decides again. That feedback cycle — act, observe, decide — is the agent. Everything else in Domain 1 is a way of shaping, bounding, or distributing it.
The loop is a control structure
Because the agent is the loop, the loop is the thing you reason about — and it has exactly one branch condition. At the Messages API level, Claude returns stop_reason: "tool_use" together with one or more tool_use blocks; your application executes each call and returns tool_result blocks on the next user turn.
[Official]
Tool use with Claude · AnthropicT1-official original The loop repeats that exchange until Claude responds without a tool call.
The architectural point is that the model decides what happens next, but your code decides whether it gets to. Tool access is “one of the highest-leverage primitives you can give an agent,” [Official] Tool use with Claude · AnthropicT1-official original and the loop is where that leverage is either contained or left unbounded. Owning the loop — not authoring any single tool — is the orchestration discipline the rest of Domain 1 elaborates.
A turn is one tool-use round-trip
The word turn has a precise meaning, and the exam leans on it. A turn is “one round trip inside the loop: Claude produces output that includes tool calls, the SDK executes those tools, and the results feed back to Claude automatically … Turns continue until Claude produces output with no tool calls.” [Official] How the agent loop works · AnthropicT1-official original The Agent SDK embeds this loop for you — it ships “the same tools, agent loop, and context management that power Claude Code” [Official] Agent SDK overview · AnthropicT1-official original — so at the SDK level you observe messages, not the raw branch.
The consequence that trips candidates: a text-only final response is not a turn. A four-message session is three tool-use turns plus one final text answer, so max_turns=2 would stop before that final step.
[Official]
How the agent loop works · AnthropicT1-official original Size a turn budget to the tool calls a task needs, not to the messages you expect to see.
Handling tool results: errors and parallel calls
Step 4 of the round-trip hides two cases the exam specifically tests, because both are where a hand-written loop goes wrong.
A failed tool does not raise — it reports. When your handler hits an error (the file is missing, the command exits non-zero, the API times out), you do not throw out of the loop. You return a normal tool_result block with is_error: true and the error text as its content.
[Official]
Tool use with Claude · AnthropicT1-official original The model sees the failure and adapts — retries with a corrected argument, tries a different tool, or explains the blocker — exactly as it would read any other result. Raising an exception instead severs the loop and throws away the model’s ability to recover.
Parallel calls must be answered together. When a single response contains more than one tool_use block, the API requires every corresponding tool_result in the next user message — you cannot answer one tool now and defer the other to a later turn. Execute them (concurrently when they are read-only and independent; serially when one mutates state another reads), collect all results, and send them as one batch. The mechanics of when to parallelize live in D2.3 — for the loop, the rule to hold is: all of a turn’s results return together, keyed by tool_use_id.
stop_reason is the branch
Because stop_reason is the loop’s branch condition, recognizing each value — and its loop consequence — is core exam material. On a ResultMessage, stop_reason carries the value from the model’s last response.
[Official]
How the agent loop works · AnthropicT1-official original
stop_reason | What it means | Loop consequence |
|---|---|---|
tool_use | The response contains tool calls | Continue — execute tools, return tool_result, request again |
end_turn | The model finished naturally, no tool calls | Stop — deliver the final result |
max_tokens | Output hit the token budget mid-response | Stop, but the answer is truncated — you may need to continue |
refusal | The model declined to generate | Stop — handle as a non-answer, not a result |
Termination is the architect’s safety contract
A model-driven loop that can run forever is a production incident waiting to happen, so the architect supplies the stop conditions the model cannot. The SDK exposes two budgets: max_turns (a turn count) and max_budget_usd (a client-side cost estimate). Hitting either ends the loop and sets ResultMessage.subtype to error_max_turns or error_max_budget_usd.
[Official]
How the agent loop works · AnthropicT1-official original
The subtype is the termination indicator, and it gates whether .result is even populated:
subtype | Meaning | .result? |
|---|---|---|
success | Normal finish | yes |
error_max_turns | Hit max_turns | no |
error_max_budget_usd | Hit max_budget_usd | no |
error_during_execution | API / cancellation error | no |
error_max_structured_output_retries | JSON-Schema validation failed past the retry limit | no |
Where Claude Code’s loop sits
Claude Code is one concrete harness around this loop. Its documentation states the architecture plainly: “The agentic loop is powered by two components: models that reason and tools that act.” [Official] How Claude Code works · AnthropicT1-official original The same source makes the dependency explicit — “Tools are what make Claude Code agentic. Without tools, Claude can only respond with text” [Official] How Claude Code works · AnthropicT1-official original — and the SDK’s account agrees that “Tools are the primary building blocks of execution for your agent.” [Official] Building agents with the Claude Agent SDK · Thariq Shihipar et al. (2025)T1-official original
For the exam, treat the loop in this chapter as the inner cycle. A long-running harness wraps it in an outer cycle that carries state across many context windows — but the branch condition, the turn model, and the termination contract are identical at every scale. Master the inner loop and the rest of Domain 1 is composition.
Practice
Exercise solutions
Three turns. Turns count tool-use round-trips: Read, Grep, Edit are turns 1–3; the final text summary is not a turn. The final response’s stop_reason is end_turn (no tool call). The smallest budget that still finishes is max_turns = 3 — it permits the three tool-use turns, and the free final text answer follows. max_turns = 2 would stop before the Edit.
B — end_turn. It means the model finished naturally with no tool calls, so the text response is the deliverable. A (tool_use) is the continue branch — there is more loop to run, not a final answer. C (max_tokens) stops the response but the answer is truncated mid-output, so it usually needs a continuation before it is usable. D (refusal) is a non-answer to be handled as a declined request, not delivered as a result. The discriminating idea: only end_turn is both terminal and complete.
A defensible policy: “Set max_turns to the tool calls a normal fix needs plus headroom (say 15), and max_budget_usd as a hard cost ceiling, because the re-reading loop fails by count and by spend and either bound should stop it. My handler checks ResultMessage.subtype first: on success I read .result; on any error_* subtype I surface the exhaustion (and the partial transcript) rather than printing an empty .result.” The exact field checked first is subtype — never .result before it.
Exam essentials
- An agent is a model in a loop: “LLMs using tools based on environmental feedback in a loop.” Workflow = predefined code path; agent = model directs its own steps.
- The loop has one branch:
stop_reason: "tool_use"→ run tools, returntool_result, request again;end_turn→ stop. - A turn = one tool-use round-trip. The final text-only response does not count against
max_turns. - Tool-result handling: a failed call returns a
tool_resultwithis_error: true(the model adapts; you do not throw); paralleltool_useblocks need all theirtool_results in the next user message, keyed bytool_use_id. stop_reasonvalues:tool_use(continue),end_turn(stop, usable),max_tokens(truncated — maybe continue),refusal(non-answer).- Termination is yours to set:
max_turns+max_budget_usd; on exhaustion theResultMessage.subtypebecomes anerror_*value and.resultis empty. - Check
subtypebefore reading.result— every error subtype leaves.resultunpopulated.
Further reading
The design rationale for the inner/outer split — why owning the loop boundary is what turns a model into a harness — is developed at length in the Agentic Systems Design book, Chapter 1, Agent = Model + Harness. It is optional depth, not required for the exam; this chapter is self-contained.
Coordinator–Subagent Patterns: Hub-and-Spoke and Isolated Context
The coordinator–subagent (orchestrator-worker) pattern — a lead agent that decomposes a task and spawns isolated-context subagents. Teaches why a second agent ever helps, when the pattern earns its 3–10x token cost, the full single-vs-multi trade-off (including reliability and maintainability), why decomposition must split by context and not by role, and the one variant that works across domains.
Once the agent loop of D1.1 can run tools, the next architectural question is whether to run more than one agent. This chapter develops the canonical multi-agent shape — a coordinator that spawns isolated subagents — and, just as importantly, the discipline of not reaching for it. The exam tests judgment here: when the pattern wins, what it costs on every axis, how to cut the work, and which single variant is reliable.
Why run more than one agent
A single agent (D1.1) is one model reasoning over one finite context window. That window is the bottleneck: everything the agent reads, every tool result, every intermediate thought accumulates in it, and a model attends less reliably as it fills. So the motivation for a second agent is not “more brains” — it is more windows. When a subtask would flood the main window with data the final answer doesn’t need, or when independent paths could be explored at once, splitting the work across separate context windows relieves the constraint a single loop cannot.
That is the line Building Effective Agents draws between a workflow and an agent: an agent is a “system where LLMs dynamically direct their own processes and tool usage.” [Official] Building effective agents · Erik Schluntz and Barry Zhang (2024)T1-official original A coordinator of subagents is one such system — an agent whose tool is “spawn another agent.”
The orchestrator and its workers
The canonical multi-agent shape is hub-and-spoke: a lead agent analyzes the task, plans a strategy, and spawns subagents that explore parts of it independently. [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original The lead synthesizes their results and decides whether more work is needed — it is an orchestrator, and the subagents are workers.
This is a real architecture with measured stakes, not a toy. In Anthropic’s research system, “a multi-agent system with Claude Opus 4 as the lead agent and Claude Sonnet 4 subagents outperformed single-agent Claude Opus 4 by 90.2% on our internal research eval.” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original The number is real but specific to that model pairing and that eval — read it as evidence the pattern can pay off, not as a portable benchmark.
Isolated context is the whole point
The property that defines the pattern is that each subagent runs in its own context window and does not see the parent’s state. [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original A subagent is given a task and returns a result; the intermediate tokens it generates never touch the coordinator’s window. That isolation is the feature: it keeps a subtask’s noise out of the agent that has to reason over the whole problem.
Because results must cross a context boundary, large outputs use the artifacts pattern — a subagent writes its full output to the filesystem or external storage and passes a lightweight reference back, rather than streaming everything through the coordinator. [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original The coordinator stays lean; the high-fidelity output lives outside its window until needed.
When the pattern earns its cost
The capability is bought with tokens. “In our data, agents typically use about 4× more tokens than chat interactions, and multi-agent systems use about 15× more tokens than chats,” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original and against a single agent on an equivalent task, “multi-agent implementations typically use 3-10x more tokens than single-agent approaches.” [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original So the official guidance leads with restraint: “Start with the simplest approach that works, and add complexity only when evidence supports it.” [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original Try improved prompting, context compaction, and the Tool Search Tool on one agent first.
Reach for coordinator–subagent only when one of three conditions holds: [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original
| Win condition | The signal | What it buys |
|---|---|---|
| Context protection | A subtask generates large, mostly-irrelevant intermediate data (>1000 tokens) that would pollute the main agent’s reasoning | A clean main-agent window |
| Parallelization | Genuinely independent paths to explore concurrently | Thoroughness, not speed (coordination often makes wall-clock slower) |
| Specialization | Tool-set overload (avoid 20+ tools on one agent), conflicting personas, or deep domain expertise | Focused agents that outperform an overloaded generalist |
Cost is only the first axis. The full trade-off — the one a scenario question makes you weigh — is worse for multi-agent on most rows, and the architect must be able to name them:
Multi-agent is not “more advanced and therefore better”; it trades cost, latency, reliability, and maintainability for capability on tasks that genuinely need separate windows. Three of those five rows are downsides — which is why the exam frames the decision as restraint first.
Decompose by context, not by role
When you do split, how you cut the work is the most-tested judgment in this domain — and the most common way to get it wrong. The anti-pattern is role-based / problem-centric decomposition: planner → implementer → tester → reviewer. It feels organized but “creates constant coordination overhead and context loss at handoffs — the telephone game,” spending more tokens coordinating than executing. [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original
The reliable alternative is context-centric decomposition: split only at true context boundaries.
The verification subagent
One multi-agent shape “consistently succeeds across domains”: the verification subagent. [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original The main agent does the work; a separate agent blackbox-tests the result with clear success criteria and minimal context transfer. The isolation is the strength — the verifier has no stake in, and no memory of, how the work was produced.
Its failure mode is early victory: verifiers tend to declare success after one or two checks. The documented mitigation is an explicit instruction — “You MUST run the complete test suite before marking as passed.” [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original
Practice
Exercise solutions
Multi-agent is plausibly warranted — for specialization — but the proposed split is the role-based anti-pattern. The real signal is tool-set overload (40 tools on one agent; the guidance says avoid 20+ and prefer focused agents). So the justified cut is by tool/domain context — e.g. a CRM-and-orders agent vs a messaging-and-analytics agent — each carrying a focused tool set. The proposed intake → diagnosis → resolution → follow-up split is problem-centric: those are sequential phases of one tightly-coupled ticket, so they would lose fidelity at every handoff (the telephone game) and add coordination cost. Decompose by what context is independent (tool domains), not by what step comes next. And first confirm the Tool Search Tool alone can’t relieve the tool overload on a single agent.
The failure mode is context loss / information-fidelity degradation at handoffs, plus constant coordination overhead — the guidance calls it “the telephone game.” It happens because planner/implementer/tester/reviewer are sequential phases of one tightly-coupled task, not independent contexts. The rule: place a split only at a true context boundary — independent paths, clean-interface components, or blackbox verification — never by “what step comes next.”
Any two of: Reliability — single (one point of failure) → multi (multiple failure points). Maintainability — single (one prompt set) → multi (multiple prompt sets to keep in sync). Latency — single (fast sequential) → multi (often slower despite parallelism). Context coherence — single (unified) → multi (fragmented at handoffs). “More scalable” is not free: three of the five trade-off rows move the wrong way when you add agents.
Exam essentials
- Why multi-agent at all: a single agent has one finite context window; extra agents buy more windows, not more intelligence.
- Coordinator–subagent = hub-and-spoke: a lead decomposes, spawns subagents, and synthesizes; subagents run in isolated context windows and do not see parent state.
- Isolation is the feature (context protection); large outputs use the artifacts pattern — write to storage, pass a reference back.
- Cost is 3–10× tokens (and ~15× vs a chat). The 90.2% gain was Opus 4 lead + Sonnet 4 subagents vs single Opus 4 — not a portable number.
- The full trade-off is mostly worse for multi-agent: higher tokens, often higher latency, multiple failure points, multiple prompt sets, fragmented coherence. Start single-agent; split only for context protection, parallelization, or specialization.
- Decompose by context, not role. planner/implementer/tester/reviewer is the telephone-game anti-pattern; split at true context boundaries.
- The verification subagent is the reliable variant — blackbox-test the result; mitigate early victory with “run the complete test suite before marking as passed.”
Further reading
The environment angle on isolation — how bounding what each agent loads is the same discipline that makes a large codebase legible — is developed in the Agentic Systems Design book, Chapter 7, Environments at Scale. Optional depth; this chapter stands on its own.
Subagent Invocation: AgentDefinition, the Agent Tool, and allowedTools
The mechanics beneath the coordinator–subagent pattern — how a subagent is actually invoked. The Agent tool and its three creation paths, the AgentDefinition contract, why "Agent" must be in allowedTools, the single prompt-string channel into a fresh context, what crosses back out, and what a subagent does not inherit (including parent permissions).
Chapter D1.2 established when to reach for a subagent and how to cut the work. This chapter drops one level — to the mechanics of actually defining and invoking one, and reading what crosses each way across the context boundary. The exam tests whether you can read an AgentDefinition, predict whether it will be invoked at all, and say what crosses into the subagent’s fresh context and what crosses back — and what silently does not.
The Agent tool is the invocation surface
A subagent is invoked through exactly one tool — the Agent tool — and there are three ways to give that tool something to invoke. [Official] Subagents in the SDK · AnthropicT1-official original Everything in this chapter hangs off that one tool: how you define what it runs, how you allow it to run, and what crosses the boundary in each direction.
One naming wrinkle is load-bearing on the exam and in tool-name filters. The tool was renamed from Task to Agent in Claude Code v2.1.63; current SDK releases emit Agent in tool_use blocks but still report Task in the system:init tools list and in result.permission_denials[].tool_name.
[Official]
Subagents in the SDK · AnthropicT1-official original Code that filters on the tool name must check both values.
AgentDefinition is the subagent’s contract
When you create a subagent programmatically, its AgentDefinition is a contract with two required halves: a description that says when to use it and a prompt that says how it behaves.
[Official]
Subagents in the SDK · AnthropicT1-official original Everything else is optional refinement.
| Field | Required | Purpose |
|---|---|---|
description | yes | Natural-language when to use this agent — drives automatic matching |
prompt | yes | The agent’s system prompt: its role and behavior |
tools | no | Allowed tool names; omit to inherit all of the parent’s tools |
model | no | Model override (sonnet / opus / haiku / inherit / full ID) |
maxTurns | no | Cap the subagent’s agentic turns (its own budget) |
The description does double duty — it is also how Claude decides to invoke the agent automatically (below), so write it specific and keyword-rich rather than generic.
[Official]
Subagents in the SDK · AnthropicT1-official original
Enabling invocation: Agent in allowedTools
A defined subagent will not run unless the Agent tool itself is approved. Always include "Agent" in allowedTools to auto-approve subagent invocations; without it, the call falls through to your canUseTool callback or — in dontAsk mode — is denied outright.
[Official]
Subagents in the SDK · AnthropicT1-official original
The prompt string is the only channel in
A subagent starts in a fresh context window, and the only thing that crosses from parent to child is the Agent tool’s prompt string. [Official] Subagents in the SDK · AnthropicT1-official original
“A subagent’s context window starts fresh (no parent conversation) but isn’t empty. The only channel from parent to subagent is the Agent tool’s prompt string, so include any file paths, error messages, or decisions the subagent needs directly in that prompt.” [Official] Subagents in the SDK · AnthropicT1-official original
What that means concretely — the subagent receives its own system prompt (AgentDefinition.prompt), the Agent-tool prompt, the project CLAUDE.md (loaded via settingSources), and its tool definitions. It does not receive the parent’s conversation or tool results, the parent’s system prompt, or any preloaded skill content.
[Official]
Subagents in the SDK · AnthropicT1-official original Permissions are part of what does not cross: a subagent does not inherit the parent’s permissions — each runs its own evaluation chain — so a tool the parent could use is not automatically usable by the child.
[Official]
Configure permissions · AnthropicT1-official original
What crosses back: the return channel
The boundary is asymmetric, and the exam probes the outbound side too. When the subagent finishes, the parent receives the subagent’s final message as the Agent-tool result — but the parent may summarize it rather than carry it through verbatim. If a downstream step depends on the subagent’s exact output (a precise list, a diff, a structured payload), instruct the main agent to preserve the subagent’s result verbatim.
[Official]
Subagents in the SDK · AnthropicT1-official original Two more facts ride the return path: every message generated inside the subagent carries a parent_tool_use_id linking it to the invoking Agent call,
[Official]
How the agent loop works · AnthropicT1-official original and the subagent’s transcript persists independently of the main conversation (it survives main-session compaction).
[Official]
Subagents in the SDK · AnthropicT1-official original
Invocation paths and the one-level limit
Once Agent is allowed, a subagent is invoked one of two ways.
[Official]
Subagents in the SDK · AnthropicT1-official original
- Automatic — Claude matches the subagent’s
descriptionto the task. This is why the description must be specific and keyword-rich. - Explicit — name the agent in the prompt (“Use the code-reviewer agent to check the auth module”), bypassing automatic matching.
There is a hard structural limit: subagents cannot spawn subagents. Don’t include Agent in a subagent’s tools array — delegation is one level deep.
[Official]
Subagents in the SDK · AnthropicT1-official original
Practice
Exercise solutions
Two faults, neither in the tools array. (1) "Agent" is not in the parent’s allowedTools, so the Agent-tool call is never auto-approved — it falls through to canUseTool (or is denied in dontAsk). Fix: add "Agent" to allowedTools. (2) description: "Reviews things" is too vague for automatic matching — descriptions must be specific and keyword-rich. Fix: rewrite it (e.g. “Review Markdown/docs for accuracy, broken links, and stale references; read-only”), or invoke the agent explicitly by name to bypass matching. The tools: ["Read", "Grep", "Glob"] set is exactly right for read-only review — leave it. The two faults map to the two gates: allowedTools is the run gate, the description is the match gate.
Programmatic (an AgentDefinition in the agents option) — the recommended path for SDK apps; filesystem (.claude/agents/*.md, loaded at startup); and the built-in general-purpose agent. The built-in needs no AgentDefinition because it ships with a default description and prompt — Claude can invoke it through the Agent tool with nothing defined, which is why it is the zero-config fallback.
The return channel is lossy by default: only the subagent’s final message returns to the parent, and the parent may summarize it. So the coordinator is acting on a paraphrase of the reviewer’s findings, not the exact file:line list the subagent produced. The fix is to instruct the main agent to preserve the subagent’s result verbatim (and have the reviewer return a structured, easily-quoted format). The subagent’s own transcript being correct is the tell that the loss happened on the way back, not inside the subagent.
Exam essentials
- One tool, three creation paths: subagents are invoked via the Agent tool (renamed from
Task— filters must match both), created programmatically (agentsoption, recommended), via filesystem (.claude/agents/*.md), or as the built-ingeneral-purposeagent. AgentDefinition=description+prompt(both required);tools/model/maxTurnsoptional. Thedescriptiondrives automatic matching, so make it keyword-rich.Agentmust be inallowedToolsor the subagent never runs. Two gates: description matches,allowedToolsruns.- The prompt string is the only inbound channel. A subagent gets a fresh context — no parent conversation, no parent system prompt, no preloaded skills — but does get project CLAUDE.md + its tools. Permissions do not inherit; each subagent has its own evaluation chain.
- The return channel is asymmetric and lossy: the parent receives the subagent’s final message but may summarize it; instruct the main agent to preserve it verbatim when fidelity matters.
parent_tool_use_idattributes a message to its subagent. - Delegation is one level deep: subagents cannot spawn subagents (no
Agentin a child’stools).
Multi-Step Workflows: Programmatic vs Prompt-Based Handoff
A multi-step task's control flow is enforced either in your code (programmatic) or in the model (prompt-based). When to choose each, why every step boundary is a handoff that leaks fidelity, the Writer/Reviewer pattern as the handoff that works, how a written artifact makes a handoff survivable, and how a programmatic validation gate rejects-and-retries a bad step before it propagates.
Most real agent work is several steps, not one. The question this chapter answers is not what the steps are but who enforces the order — your application code, or the model itself — and how you keep a bad step from poisoning the next one. That choice is an Evaluate-level judgment: the exam gives you a workflow and asks where the control flow belongs, how the handoff is specified, and where the gate goes.
Two places to enforce a workflow
A multi-step workflow’s control flow lives in exactly one of two places. Either your code drives the sequence — run a step, take its output, decide the next call — or the model drives it, having been told the steps in a prompt. The Agent SDK frames the split directly: “With the Client SDK, you implement a tool loop. With the Agent SDK, Claude handles it.” [Official] Agent SDK overview · AnthropicT1-official original
These are not rival products — Anthropic notes the same workflow “translate[s] directly” between the CLI and the SDK. [Official] Agent SDK overview · AnthropicT1-official original The architect’s decision is which layer holds the control flow, and it turns on how much the workflow needs determinism versus flexibility.
Every step boundary is a handoff
Whichever layer enforces the steps, each transition between them is a handoff — the output of step N becomes the input of step N+1 — and a handoff is where information is lost. Chapter D1.2 named the worst case: dividing a tightly-coupled task by role (planner → implementer → tester → reviewer) “creates constant coordination overhead and context loss at handoffs — the telephone game,” spending more tokens coordinating than executing. [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original Prompt-based handoffs across a long sequential chain are the most fidelity-fragile arrangement: each step re-narrates the last, and detail erodes at every retelling.
The Writer/Reviewer handoff that works
Not every handoff leaks — the canonical multi-step quality workflow depends on one. In the Writer/Reviewer pattern, one session implements and a second reviews: “A fresh context improves code review since Claude won’t be biased toward code it just wrote.” [Official] Best practices for Claude Code · AnthropicT1-official original Session A writes the rate limiter; Session B reviews the file for edge cases, race conditions, and consistency; Session A then addresses the feedback. The same shape works for tests — “have one Claude write tests, then another write code to pass them.” [Official] Best practices for Claude Code · AnthropicT1-official original
The handoff contract and its artifact
When work does cross a boundary, what crosses must be specified, not assumed. Anthropic’s research system makes each handoff an explicit contract: “Each subagent needs an objective, an output format, guidance on the tools and sources to use, and clear task boundaries.” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original That is the same discipline as D1.3’s rule that everything the subagent needs goes in the prompt — applied to every step of a multi-step flow.
The most robust way to carry that contract across a boundary is as a written artifact — a file the next step reads, not prose it re-narrates. Two concrete forms appear in the best-practices guidance:
- A spec file. After an interview/planning phase, “start a fresh session to execute it … and you have a written spec to reference.” [Official] Best practices for Claude Code · AnthropicT1-official original The spec, not the conversation, is what crosses to the implementation step.
- A test file. In the test/code split, the tests are the contract: one Claude writes them, another writes code to pass them. The implementer’s target is the file, not a description of it.
The validation gate: reject before propagating
A precise contract is also what lets a programmatic pipeline put a gate between steps — a check the step-N output must pass before step N+1 is allowed to consume it. The gate does two kinds of check, and the distinction is the one Domain 4 builds on (D4.4):
On failure, the gate does not pass the bad output downstream — it rejects and retries: re-prompt the failing step with the specific errors, and only advance when the output passes. That is the difference between a programmatic pipeline and a prompt-based one: the gate is enforced in your code, where a malformed step cannot quietly become the next step’s input.
Choosing where the control flow lives
The Evaluate-level call: enforce programmatically when the workflow needs determinism, an audit trail, validation gates between steps, or a fixed and repeatable sequence — the steps are known in advance and you want them to run the same way every time. Stay prompt-based when the path is flexible, the model can sensibly self-direct, and the orchestration code would cost more than it saves.
This pairs with two neighboring decisions: whether to split into multiple agents at all (D1.2) and whether the decomposition is a fixed pipeline or an adaptive one (D1.6). Enforcement locus is how the steps are driven; those chapters cover whether and into what shape. The mechanics of the validation/retry loop itself — schema vs semantic errors, bounded retries — are developed in D4.4.
Practice
Exercise solutions
Choose (b), but split at one boundary only. The fact-check is failing for the exact reason the Writer/Reviewer pattern addresses — a context biased toward the draft it just produced rationalizes its own claims. Hand the fact-check to a fresh context (a second session or a verification subagent), passing an explicit handoff contract: the draft, the claims to verify, the success criteria, the output format. That is a programmatic handoff — your code routes the draft to the reviewer and the verdict back. Keep research → draft coupled in one context: they are tightly coupled and share state, so a handoff there would only leak fidelity. And do not split all four steps into role-agents — that is the telephone-game pipeline, four lossy handoffs where you needed one. The skill is placing the single split where fresh context buys independence.
Programmatic enforcement puts the control flow in your code — you sequence the steps, pass each output to the next, and can gate between them; choose it when you need determinism, an audit trail, validation gates, or a fixed repeatable sequence. Prompt-based enforcement puts the control flow in the model — it is told the procedure and self-directs; choose it when the path is flexible, the model can sensibly adapt, and orchestration code would cost more than it saves.
The schema is doing a structural check — right shape, fields present, types valid — and is missing the semantic check: whether the content is actually correct (valid JSON can still carry fabricated or contradictory data). The gate should not pass a failing output along; it should reject and retry — re-prompt the failing step with the specific error and only advance once the output passes both the structural and semantic checks.
Exam essentials
- Two enforcement loci: a multi-step workflow’s control flow lives in your code (programmatic — you sequence steps and gate between them; deterministic/auditable) or the model (prompt-based — told the steps, self-directs; flexible).
- Every step boundary is a handoff, and handoffs leak. A sequential chain of role-agents is the most fidelity-fragile arrangement — the telephone game.
- Writer/Reviewer is the handoff that works because the reviewer has fresh context — it can’t defend code it never wrote. Don’t let an author review its own work.
- Carry the contract in a written artifact — a spec file or a test file the next step reads from disk — so the handoff doesn’t depend on re-narration.
- A programmatic validation gate runs a structural check and a semantic check between steps, and on failure rejects and retries rather than propagating the bad output. (The loop in depth: D4.4.)
- Choose programmatic for determinism / audit / gates / fixed sequence; prompt-based for flexible self-direction. (Whether to split = D1.2; pipeline vs adaptive = D1.6.)
Agent SDK Hooks: Intercepting, Gating, and Normalizing the Loop
A hook is a typed callback the SDK fires at a named lifecycle event — the architect's control plane to intercept, gate, and normalize the agent loop without touching the model. The events to know, the two interception modes (PreToolUse gates, PostToolUse normalizes), the four PreToolUse decisions including what defer does, and the deny-beats-defer-beats-ask-beats-allow precedence.
The agent loop of D1.1 runs the model’s tool calls automatically. Hooks are how the architect gets between the model and those calls — to block a dangerous one, rewrite its input, or clean up its output — without editing the model’s prompt. The exam tests three things: which events exist, the two modes of intervention, and who wins when hooks disagree.
Hooks intercept the loop at named events
A hook is a callback that runs your code in response to an agent event: “Hooks are callback functions that run your code in response to agent events, like a tool being called, a session starting, or execution stopping.”
[Official]
Intercept and control agent behavior with hooks · AnthropicT1-official original They arrive through two channels that share one lifecycle — programmatic hooks (callbacks in your query() options) and filesystem hooks (shell commands in settings.json, loaded via settingSources).
[Official]
Intercept and control agent behavior with hooks · AnthropicT1-official original
The events you must recognize
Hooks fire at named lifecycle points. The Python SDK exposes ten; the TypeScript SDK extends the same set to twenty. [Official] Intercept and control agent behavior with hooks · AnthropicT1-official original The ones an architect must recognize:
| Event | Fires when | Typical use |
|---|---|---|
PreToolUse | a tool call is requested | block or rewrite the call before it runs |
PostToolUse | a tool returns a result | normalize or replace the result before the model sees it |
PostToolUseFailure | a tool execution fails | log or handle the error |
UserPromptSubmit | a prompt is submitted | inject extra context |
Stop | the agent stops | persist state before exit |
SubagentStart / SubagentStop | a subagent begins / ends | track spawned parallel work |
PreCompact | compaction is about to run | archive the full transcript first |
PermissionRequest | a permission dialog would appear | custom permission handling |
Notification | an agent status message | forward to Slack / PagerDuty |
The TypeScript-only additions (PostToolBatch, SessionStart, SessionEnd, Setup, and others) are why SessionStart / SessionEnd are not available as Python SDK callbacks — Python apps needing them load filesystem hooks from settings instead.
[Official]
Intercept and control agent behavior with hooks · AnthropicT1-official original
PreToolUse gates; PostToolUse normalizes
The two most-tested events define the two interception modes, and they sit on opposite sides of tool execution. [Official] Intercept and control agent behavior with hooks · AnthropicT1-official original
PreToolUsegates — it runs before the tool and returns apermissionDecisionofallow,deny,ask, ordefer, optionally withupdatedInputto rewrite the call. Blocking a write to a.envfile is aPreToolUsehook matchingWrite|Editthat returnsdenywhen the target path ends in.env.PostToolUsenormalizes — it runs after the tool and returns eitheradditionalContext(appended to the result) orupdatedToolOutput(which replaces the output before Claude sees it). Stripping noise, redacting secrets, or reshaping a tool’s response into a clean form isPostToolUsework.
Matchers select which calls a hook sees: they are regex strings tested against the tool name — "Write|Edit", "^mcp__" for all MCP tools, or omitted to match everything.
[Official]
Intercept and control agent behavior with hooks · AnthropicT1-official original Matchers do not filter by argument; filter on the file path or command inside the callback.
Precedence: deny beats defer beats ask beats allow
When several hooks (or permission rules) act on one event, the outcome is decided by a fixed precedence, not by who ran first: “When multiple hooks or permission rules apply, deny takes priority over defer, which takes priority over ask, which takes priority over allow. If any hook returns deny, the operation is blocked regardless of other hooks.”
[Official]
Intercept and control agent behavior with hooks · AnthropicT1-official original The four decisions are not symmetric, and defer is the one candidates miss:
Hooks and subagents
Two subagent-aware events — SubagentStart and SubagentStop — let you track spawned work, but the operational gotcha is about permissions. As D1.3 established, subagents do not inherit the parent’s permissions; each runs its own evaluation chain.
[Official]
Configure permissions · AnthropicT1-official original The clean way to pre-approve a subagent’s tools is a PreToolUse hook rather than re-prompting inside every child.
[Official]
Intercept and control agent behavior with hooks · AnthropicT1-official original
Practice
Exercise solutions
(a) is a gate, (b) is a normalization — opposite sides of tool execution, so they need different events. (a) Use a PreToolUse hook with a matcher of "Write|Edit"; in the callback, inspect the target path and return hookSpecificOutput.permissionDecision: "deny" when it ends in .env. The decision must come before the write runs. (b) Use a PostToolUse hook with a matcher of "Bash"; return hookSpecificOutput.updatedToolOutput containing the result with ANSI codes stripped — updatedToolOutput replaces the output before Claude sees it. (b) can’t reuse PreToolUse because the output does not exist yet when the call is requested; you can only normalize a result after the tool has produced it.
The call is blocked. All matching hooks run in parallel and the most restrictive result wins, so the deny overrides the allow and the ask — the precedence is deny > defer > ask > allow. The one-word rule is restrictive (equivalently, “deny wins”): one hook saying deny is enough to block; permitting requires every hook to agree.
Both expectations are wrong. defer does not run the call — it ends the query so the host can resume it later from the persisted session (a pause-and-hand-back, not an allow). And updatedInput is ignored with defer — that field applies only to allow (or ask). To run a rewritten command, the hook must return allow with updatedInput, not defer.
Exam essentials
- A hook is a callback at a named lifecycle event, delivered programmatically (query options) or via filesystem settings. It runs in your process and does not consume agent context.
- Two interception modes:
PreToolUsegates (returnsallow/deny/ask/defer+ optionalupdatedInput);PostToolUsenormalizes (updatedToolOutputreplaces the result,additionalContextappends). - The four decisions:
allow/ask/deny, anddefer— ends the query so the host can resume it later from the persisted session (andupdatedInputis ignored withdefer). - Precedence is
deny > defer > ask > allow. Matching hooks run in parallel; the most restrictive wins; onedenyblocks. - Don’t rely on hook order (non-deterministic) — write each hook independently.
- Subagents don’t inherit permissions; pre-approve their tools with a
PreToolUsehook. Watch the silent-failure traps: case-sensitive names,max_turnscut-offs, recursive subagent loops.
Task Decomposition: Sequential Pipelines vs Adaptive
Once you've decided to decompose, the structural choice is fixed-in-advance (a sequential pipeline — predictable, cheap, auditable) versus decided-at-runtime (adaptive — the orchestrator scales effort to the task). Why open-ended work can't be hardcoded, why predictable work shouldn't be adaptive, the quantified token cost of choosing adaptive, and the failure modes at both extremes.
Chapter D1.2 settled whether to split a task and where the context boundaries are. This chapter asks a different question about the same task: once it is cut into pieces, is the set of pieces fixed in advance, or decided at runtime? That is the pipeline-versus-adaptive choice, and getting it wrong is expensive in opposite directions — an Evaluate-level judgment the exam probes with concrete tasks.
Two shapes of a decomposed task
A decomposed task takes one of two structural shapes, distinguished by when the structure is determined.
The difference is not how many agents run but who decides the shape and when: the author, in advance, or the orchestrator, on the fly. Each is right for a different kind of task.
When the path can’t be hardcoded → adaptive
Some work resists a fixed pipeline by its nature. Anthropic’s research system is explicit about why: “Research work involves open-ended problems where it’s very difficult to predict the required steps in advance. You can’t hardcode a fixed path for exploring complex topics, as the process is inherently dynamic and path-dependent.” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original When step N+1 depends on what step N discovered, no design-time sequence can capture it.
Adaptive systems handle this by scaling effort to the input. The research system embeds the heuristic directly in its lead-agent prompt: “Simple fact-finding requires just 1 agent with 3-10 tool calls, direct comparisons might need 2-4 subagents with 10-15 calls each, and complex research might use more than 10 subagents with clearly divided responsibilities.” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original
When the path is predictable → pipeline
The opposite case is just as common and far cheaper to run. When a task’s steps are known, repeatable, and the same every time, a fixed sequential pipeline is the right structure: it is deterministic, auditable, and predictable in cost, and it asks the orchestrator to make no runtime judgment at all. A pipeline is typically programmatically enforced (D1.4) — your code drives the fixed sequence — precisely because nothing about the structure needs to be decided live. Reaching for adaptivity here is wasted capability: you pay for an orchestrator’s deliberation to re-derive a structure you already knew at design time.
The cost you are choosing
“Cheaper” and “more expensive” are not hand-waving — the choice has a price tag, and the exam expects the number. An adaptive multi-agent flow “typically use[s] 3-10x more tokens than single-agent approaches for equivalent tasks,” [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original and on the absolute scale, “multi-agent systems use about 15× more tokens than chats.” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original What that spend buys is thoroughness, not speed — parallel subagents explore a larger space, but coordination plus the slowest subagent often make the wall-clock slower, not faster. [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original
The failure mode at each extreme
Both shapes fail, in opposite ways, when matched to the wrong task.
Over-decomposition is the adaptive failure: an orchestrator that misjudges effort produces absurd structures. Among the research system’s documented early failures was “spawning 50 subagents for simple queries” — capability with no judgment behind it, multiplying token cost for nothing. [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original The effort-scaling heuristic exists to mitigate exactly this.
Rigid pipelining is the inverse: forcing a fixed sequence onto path-dependent work, which then cannot adapt when a step surfaces something the design didn’t anticipate. And a pipeline cut by role rather than context is the telephone-game anti-pattern of D1.2 — sequential phases of one coupled task, losing fidelity at every handoff. [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original
Choosing the structure
The Evaluate-level call comes down to predictability. Choose a sequential pipeline when the steps are knowable in advance and you value determinism, auditability, and bounded cost. Choose adaptive decomposition when the task is open-ended and path-dependent and that capability is worth the orchestrator overhead and the 3–10× variable cost. Either way, the orchestrator (or the author) must size effort to the task — adaptivity does not excuse you from judgment; it relocates it to runtime.
Practice
Exercise solutions
(a) Sequential pipeline. (b) Adaptive. (a) The steps are known and identical for every ticket, and the output is a fixed shape — nothing to decide at runtime, so a deterministic pipeline wins on cost, auditability, and predictability. (You fan the same fixed path across 200 tickets — parallel throughput, not adaptive structure.) (b) The depth is unknowable up front and each finding changes what to look at next — “you can’t hardcode a fixed path… inherently dynamic and path-dependent” — so the orchestrator must scale the decomposition to what it discovers. On (b) guard against over-decomposition (don’t spawn ten subagents to confirm an obvious “no” — size effort via the 1 / 2–4 / 10+ ladder), and accept that you are paying roughly 3–10× the tokens of a single agent for the thoroughness.
Hardcode a sequential pipeline when the steps are knowable and identical in advance (predictable, repeatable); decompose adaptively only when the task is open-ended and path-dependent — when step N+1 depends on what step N discovers, so no design-time sequence can capture it.
They are paying roughly 3–10× the tokens of a single-agent pipeline (about 15× a chat). That spend buys thoroughness — a larger explored space — not speed (coordination plus the slowest subagent often make it slower in wall-clock). It is wasted here because a fixed, predictable nightly job has nothing to decide at runtime: the orchestrator’s judgment is re-deriving a structure already known at design time, so you pay the multiplier for capability the task never needed. A deterministic pipeline is the correct, bounded-cost shape.
Exam essentials
- Two shapes, distinguished by when the structure is set: a sequential pipeline is fixed at design time; adaptive decomposition is decided at runtime by the orchestrator.
- Adaptive when the path can’t be hardcoded — open-ended, path-dependent work where step N+1 depends on step N. Scale effort to the input (the 1 / 2–4 / 10+ heuristic).
- Pipeline when steps are predictable — deterministic, auditable, bounded cost, usually programmatically enforced (D1.4). Adaptivity here is wasted.
- Adaptive costs 3–10× the tokens of a single agent (≈15× a chat), and buys thoroughness, not speed. A pipeline’s cost is bounded; an adaptive flow’s is variable and can spike.
- Two opposite failure modes: over-decomposition (“50 subagents for a simple query”) for adaptive; rigid/role-based pipelining (the telephone game) for fixed.
- Choosing turns on predictability — and either way the orchestrator must size effort to the task. (Whether to split = D1.2; how to enforce = D1.4.)
Session State: resume, fork, and Scratchpads
A session is the persisted conversation, not the filesystem. The architect's tools for carrying or branching state across context windows — continue, resume, and fork, with their literal Python/TS spellings — plus the encoded-cwd resume trap and the discipline of capturing durable artifacts as application state rather than shipping transcripts.
The loop of D1.1 runs inside a single context window. Real agents outlive one window — they pause, resume on another host, or branch to try an alternative. The state that survives is the session, and this chapter fixes what a session is, the three controls that carry or branch it, and the one discipline that outlasts the session itself. This closes Domain 1: the loop, scaled across many windows.
A session is the persisted conversation
A session is the conversation history — the prompt and every tool call, tool result, and response — persisted as JSONL on disk at ~/.claude/projects/<encoded-cwd>/<session-id>.jsonl.
[Official]
Work with sessions · AnthropicT1-official original The boundary that matters most is what a session does not include:
“Sessions persist the conversation, not the filesystem. To snapshot and revert file changes the agent made, use file checkpointing.” [Official] Work with sessions · AnthropicT1-official original
continue, resume, and fork
Three controls carry or branch a session — and the exam expects you to recognize their literal spellings, which differ between Python and TypeScript. [Official] Work with sessions · AnthropicT1-official original
| Control | Literal call (Python / TS) | What it does | When to reach for it |
|---|---|---|---|
continue | continue_conversation=True / continue: true | Picks up the most recent session in the current cwd — no ID needed | Resume after a process restart in the same directory |
resume | resume=sessionId / resume: sessionId | Picks up a specific session by ID | Multi-user / multi-conversation apps where “most recent” is ambiguous |
fork | fork_session=True / forkSession: true | Starts a new session ID from a copy of the original’s history; the original is untouched | Try an alternative direction without losing the original thread |
continue and resume extend one thread; fork splits one into two. Capture the ID you’ll need later from ResultMessage.session_id — it is present even on errors.
[Official]
Work with sessions · AnthropicT1-official original
Fork branches the conversation, not the filesystem
The most-missed property of forking is the same boundary from section 1, sharpened:
“Forking branches the conversation history, not the filesystem. If a forked agent edits files, those changes are real and visible to any session working in the same directory.” [Official] Work with sessions · AnthropicT1-official original
Resume to recover — and the encoded-cwd trap
resume is the recovery tool for a loop that ended on a budget. When a session stops with error_max_turns (D1.1), you resume it with a higher limit and let it finish rather than restarting from scratch.
[Official]
Work with sessions · AnthropicT1-official original Because the work hit a budget, not a wall, the transcript is intact and resumable.
[Official]
How the agent loop works · AnthropicT1-official original
The single most common resume bug is the encoded-cwd mismatch:
“If a
resumecall returns a fresh session instead of the expected history, the most common cause is a mismatchedcwd. Sessions are stored under~/.claude/projects/<encoded-cwd>/*.jsonl, where<encoded-cwd>is the absolute working directory with every non-alphanumeric character replaced by-.” [Official] Work with sessions · AnthropicT1-official original
Scratchpads: durable state beyond the session
Sometimes the session itself is the wrong unit to carry — especially across hosts, where a CI worker or ephemeral container won’t have yesterday’s transcript file. The robust move is to lift the state you care about out of the conversation: “capture the artifacts you care about (analysis output, decisions, file diffs) as application state and pass into a fresh session’s prompt,” which the docs call “often more robust than shipping transcript files around.” [Official] Work with sessions · AnthropicT1-official original A scratchpad — a working file the agent writes to and reads from — is that same discipline applied within a run: the durable artifact, not the transcript, is the thing that survives. A fresh session then starts from that artifact, the way the best-practices guidance recommends executing a written spec in a clean session. [Official] Best practices for Claude Code · AnthropicT1-official original
This is where session state shades into memory — the design rationale for persisting durable context across many sessions is optional depth, in the Further reading.
Practice
Exercise solutions
(b) is more robust. (a) To resume by ID, the original session JSONL must be restored to the same path — ~/.claude/projects/<encoded-cwd>/<session-id>.jsonl — and the fresh worker must run from the same cwd, because the encoded-cwd is derived from the working directory; only then does resume=sessionId (with a bumped max_turns) find the transcript. That means shipping the transcript file and reproducing the exact directory on every worker. (b) Instead, capture the artifacts that matter — the decisions made, the diff so far, the remaining plan — as application state and seed a fresh session’s prompt with them. No transcript to ship, no cwd to reproduce; the docs call this “often more robust than shipping transcript files around.” Resume is for recovering in place; cross-host work favors captured artifacts. (And note: neither option restores the files the agent edited — that is file checkpointing, separate from session state.)
C — fork. Forking starts a new session ID from a copy of the original’s history, and the original session is left untouched — exactly “branch to try an alternative without losing the original thread.” A (continue) and B (resume) both extend the same thread, so the alternative attempt would pollute the original conversation, not branch from it. D (file checkpointing) snapshots files, not the conversation — useful if the risky refactor must be revertible on disk, but it does not give you a second conversation. (Reminder: fork branches the conversation, not the filesystem, so pair it with checkpointing if the files must branch too.)
The most common cause is a mismatched cwd. Sessions are stored under ~/.claude/projects/<encoded-cwd>/*.jsonl, where <encoded-cwd> is the absolute working directory with every non-alphanumeric character replaced by -; if you resume from a different directory, the SDK derives a different encoded path, finds nothing, and starts a fresh session. The fix: run the resume from the same working directory as the original session (or otherwise ensure the encoded-cwd path matches).
Exam essentials
- A session is the persisted conversation, not the filesystem — JSONL under
~/.claude/projects/<encoded-cwd>/<session-id>.jsonl. File state is separate (file checkpointing). - Three controls, with literal spellings:
continue(continue_conversation=True/continue: true),resume(resume=sessionId/resume: sessionId),fork(fork_session=True/forkSession: true). resume/continue carry; fork branches (new ID from a copy; original untouched). - Fork branches the conversation, not the disk — forked file edits are real and shared. Pair with file checkpointing to branch files.
resumerecovers anerror_max_turnssession with a bumped budget; the #1 bug is a mismatchedcwd→ wrong encoded path → a fresh, empty session.- For cross-host work, capture artifacts as application state and seed a fresh session’s prompt — more robust than shipping transcripts.
Further reading
The design rationale for persisting durable context across many sessions — scratchpads, memory files, retrieval as a discipline — is developed in the Agentic Systems Design book, Chapter 10, Memory: Persisting Context Across Sessions. Optional depth; this chapter stands on its own.
Effective Tool Interfaces: Descriptions, Boundaries, and Naming
A tool's caller-facing contract — description, input examples, operation boundary, name, and response shape — is what a non-deterministic model reads to select and use it. Why the description is the highest-leverage surface, how input_examples show correct usage, when to consolidate, how to namespace, and the object schemas (input and output) every interface stands on.
Part I built the agent and its orchestration; Part II turns to the tools that agent reaches for. A tool is a contract between a deterministic system and a non-deterministic caller — and the architect’s leverage is not the implementation behind it but the surfaces the model actually reads: the description, the input examples, the operation boundary, the name, and the response. Get those right and a capable model selects the tool correctly; get them wrong and no amount of model quality rescues it.
The description is the highest-leverage surface
Of every field on a tool definition, the description moves performance the most: detailed descriptions are “by far the most important factor in tool performance.” [Official] Define tools · AnthropicT1-official original A description is not documentation for a human reader — it is the surface the model selects from, so it must spell out what the tool does, when it should be used (and when it should not), what each parameter means, and any caveats. [Official] Define tools · AnthropicT1-official original The guidance even sets a floor: aim for “at least 3-4 sentences per tool description, more if the tool is complex.” [Official] Define tools · AnthropicT1-official original
The gap is concrete. A get_stock_price described as “Retrieves the current stock price for a given ticker symbol… returns the latest trade price in USD… It will not provide any other information” tells the model exactly when to reach for it and what it gets back; the same tool described as “Gets the stock price for a ticker” leaves it guessing about inputs, outputs, and boundaries.
[Official]
Define tools · AnthropicT1-official original
Show correct usage with input_examples
The description tells the model how to use a tool; input_examples show it. This optional field carries an array of example argument objects that demonstrate correct calls — the documented “Tool Use Examples” feature.
[Official]
Define tools · AnthropicT1-official original A weather tool can ship three: a full call, a call with a different unit, and a call that omits the optional field — teaching the model the shape by demonstration rather than prose.
The one hard rule: each example must validate against the tool’s input_schema, or the request returns a 400.
[Official]
Define tools · AnthropicT1-official original Two more facts for the exam: input_examples are for client (user-defined) tools, not server-side tools, and they cost roughly 20–50 tokens for simple examples, 100–200 for complex nested ones — a context cost you pay deliberately where ambiguity is high.
[Official]
Define tools · AnthropicT1-official original
Consolidate operations to reduce selection ambiguity
The next surface is the operation boundary — how much each tool does. The documented default is to consolidate: “Consolidate related operations into fewer tools. Rather than creating a separate tool for every action (create_pr, review_pr, merge_pr), group them into a single tool with an action parameter. Fewer, more capable tools reduce selection ambiguity.” [Official] Define tools · AnthropicT1-official original Every extra near-equivalent tool is one more line the model can pick wrong.
The deeper principle is to design for the agent’s affordances, not mirror your API’s endpoints: rather than make the model chain list_users + list_events + create_event, give it one schedule_event; rather than get_customer_by_id + list_transactions + list_notes, give it get_customer_context.
[Official]
Writing tools for agents · AnthropicT1-official original A tool that returns exactly the workflow the agent needs beats three tools it must orchestrate.
Namespace tool names by service
A name is the model’s fastest disambiguator, and the documented convention is to namespace by service: “Use meaningful namespacing in tool names… prefix names with the service (e.g., github_list_prs, slack_send_message). This makes tool selection unambiguous as your library grows.”
[Official]
Define tools · AnthropicT1-official original Bare search becomes a liability the moment a second search exists; github_search and jira_search never collide.
Names also carry hard constraints that differ by regime. A Claude API tool name must match ^[a-zA-Z0-9_-]{1,64}$.
[Official]
Define tools · AnthropicT1-official original An MCP tool name should be 1–128 characters of ASCII letters, digits, underscore, hyphen, or dot — no spaces — and unique within its server.
[Official]
Tools — Model Context Protocol Specification 2025-11-25 · AnthropicT1-official original Those MCP tools then reach the agent through a fixed pattern, mcp__<server>__<tool>: a list_issues tool on a server keyed github becomes mcp__github__list_issues.
[Official]
Connect to external tools with MCP · AnthropicT1-official original
Return only high-signal information
The response is the half of the contract authors forget. The model reads every token a tool returns, so a tool should “return only high-signal information… semantic, stable identifiers (e.g., slugs or UUIDs) rather than opaque internal references, and include only the fields Claude needs to reason about its next step.” [Official] Define tools · AnthropicT1-official original Bloated responses waste the context window and bury the fields that matter. The shape of the response also shapes the next call: a semantic identifier the model can pass straight into the following tool keeps a multi-step task cheap; an opaque internal handle forces a re-lookup. [Official] Writing tools for agents · AnthropicT1-official original
When the response should be machine-shaped, MCP lets a tool declare an optional outputSchema — and when it does, the server MUST return structuredContent conforming to that schema (mirroring it in a text block for compatibility).
[Official]
Tools — Model Context Protocol Specification 2025-11-25 · AnthropicT1-official original That is the output-side analogue of the required input schema; the structured-output machinery that drives it is Domain 4’s subject (D4.3).
The structural floor: an object input schema
Beneath the design judgments sits a requirement no interface can skip. Every tool’s input schema is a JSON Schema object: in the Claude API a tool definition’s three required fields are name, description, and an input_schema object;
[Official]
Define tools · AnthropicT1-official original in MCP the inputSchema is required and must be a valid JSON Schema object, not null.
[Official]
Tools — Model Context Protocol Specification 2025-11-25 · AnthropicT1-official original A tool that takes no arguments still declares an empty object schema — the object is the floor every interface stands on.
Practice
Exercise solutions
Consolidate the three into one get_customer_context tool (namespace it — e.g. crm_get_customer_context — if the agent spans services). Its description should state what it returns and when to use it: “Returns a customer’s profile, recent transactions, and notes for a given customer ID; use it whenever you need context about a customer before acting.” The redesign applies consolidation (fewer, more capable tools reduce selection ambiguity) and design-for-affordances (one call returns the context the agent needs instead of three CRUD calls it must chain). The agent stalled because three thin tools forced multi-step chaining the descriptions never made obvious; a single high-signal response also lets any follow-up call reuse the returned identifiers cheaply.
The most likely cause is that one of the examples does not validate against the tool’s input_schema — an invalid input_examples entry returns a 400. Every example must conform to the same input_schema the real calls do (right types, required fields present, enum values legal); a single bad example (a typo’d enum, a missing required field) fails the whole request. The examples have to agree with input_schema — which is also why they double as a check on the schema itself.
A good description must add, at minimum: (1) what the tool does concretely (not “gets data” but which data, in what form); (2) when to use it and when not to — the boundary that prevents misrouting; (3) what each parameter means (and what the response returns). Aim for 3–4 sentences. The audience is the model, which selects tools by description alone and never reads the implementation — so an opaque description is a performance bug the model cannot route around, making the description the single highest-leverage fix (“by far the most important factor in tool performance”). Adding input_examples compounds the gain by showing correct argument shape.
Exam essentials
- The description is the highest-leverage surface — “by far the most important factor in tool performance.” Say what the tool does, when (and when not) to use it, and what each parameter means; 3–4 sentences minimum.
input_examplesshow correct usage — an array of example argument objects; each must validate againstinput_schema(invalid → 400). Client tools only, not server tools; ~20–50 / ~100–200 tokens.- Consolidate to reduce selection ambiguity — fewer, more capable tools (an
actionparameter overcreate_pr/review_pr/merge_pr); design for the agent’s affordances, not your API’s endpoints. - Namespace names by service —
github_list_prs, not a baresearch. API names match^[a-zA-Z0-9_-]{1,64}$; MCP names are 1–128 ASCII chars and surface asmcp__server__tool. - Return only high-signal information — semantic, stable identifiers and only the fields the model needs. MCP’s optional
outputSchemagoverns the machine-shaped output (server must then return conformingstructuredContent). - The input schema must be an object — the structural floor of every tool;
strict: true(D2.2/D2.3) then makes inputs conform to it.
Structured Error Responses: isError, Retryability, and the Protocol-Error Split
A tool's failure contract — which channel a failure travels down, what its text says, and whether the schema could have prevented it. The two regimes (Messages-API is_error vs MCP isError + JSON-RPC), why is_error turns a failure into a recoverable signal, the normative execution-vs-protocol error split, and the difference between steering a retry and preventing the error.
D2.1 designed a tool’s happy path; this chapter designs its failure path. When a call goes wrong, the architect decides three things: which channel the failure travels down, what the failure text says, and whether the schema could have prevented it at all. But “which channel” depends on which regime you are in — and conflating the Claude Messages API with MCP is the most common mistake here, so we separate them first.
Two regimes, two spellings
“Structured error” means two related-but-distinct things depending on the surface you are on, and the exam (and real code) punish conflating them.
The casing is the tell: is_error is the Claude Messages API; isError is MCP. The two-channel (isError vs JSON-RPC) split below is an MCP model — the direct Messages API has only the single is_error signal for tool failures.
is_error: true is the canonical failure signal (Messages API)
On the Claude Messages API, a failed tool still returns a tool_result — but flagged. is_error: true is the canonical signal that a tool call failed: Claude folds the error into its next-turn reasoning and may retry.
[Official]
Handle tool calls · AnthropicT1-official original The flag is what turns a failure into a message to the model rather than a dead end — a result whose content reads ConnectionError: the weather service API is not available (HTTP 500) with is_error: true lets the next turn reason about what to do.
[Official]
Handle tool calls · AnthropicT1-official original The design principle is two lines: set is_error: true on the tool_result block, and make the content text actionable.
[Official]
Writing tools for agents · AnthropicT1-official original
Write instructive error messages
The flag says that it failed; the content must say what to do next. The documented principle is explicit: “Write instructive error messages. Instead of generic errors like ‘failed’, include what went wrong and what Claude should try next, e.g., ‘Rate limit exceeded. Retry after 60 seconds.’ This gives Claude the context it needs to recover or adapt without guessing.” [Official] Handle tool calls · AnthropicT1-official original
Execution vs protocol errors: the MCP channel split
Within MCP, a failure travels down one of two channels, and the choice is normative, not stylistic. The specification draws the line: isError: true inside a successful result is for execution errors the model should self-correct on — input validation, API failures, business-logic errors — while a JSON-RPC error response is for protocol errors the model cannot fix, such as an unknown tool or a malformed request.
[Official]
Tools — Model Context Protocol Specification 2025-11-25 · AnthropicT1-official original
The most-tested trap lives on this line: a validation failure belongs in the isError channel, not in a JSON-RPC -32602. The 2025-11-25 spec (per SEP-1303) is explicit that input-validation errors return as isError: true content — for example, Invalid departure date: must be in the future. Current date is 08/08/2025. — so the model can correct and retry.
[Official]
Tools — Model Context Protocol Specification 2025-11-25 · AnthropicT1-official original
Retryability is documented behavior, not a parameter
The model already retries failed calls on its own: “If a tool request is invalid or missing parameters, Claude will retry 2-3 times with corrections before apologizing to the user.” [Official] Handle tool calls · AnthropicT1-official original That loop is why the channel choice and the content quality matter so much — a failure returned as legible error content feeds each retry something to correct against, whereas a protocol error the model cannot read gives it nothing to adjust and burns the budget toward an apology.
Prevent the error: strict: true
The cheapest error to handle is the one that never happens. For the largest class of failures — malformed inputs — there is a prevention switch: “To eliminate invalid tool calls entirely, use strict tool use with strict: true on your tool definitions. This guarantees that tool inputs will always match your schema exactly, preventing missing parameters and type mismatches.”
[Official]
Handle tool calls · AnthropicT1-official original Define tools · AnthropicT1-official original With strict: true the schema-violation error class disappears before it can reach your handler.
Practice
Exercise solutions
(a) Return isError: true content, not a JSON-RPC error. A past departure date is an input-validation / business-logic failure the model can self-correct, and the MCP spec (SEP-1303) routes validation errors to the isError channel, reserving JSON-RPC errors for protocol problems the model cannot fix. (b) Make it actionable, e.g. “Invalid departure date: must be in the future. Current date is 2026-06-02.” (c) No. strict: true guarantees the input matches the schema (a correctly-typed date string), but “must be in the future” is a semantic constraint a JSON Schema type cannot express, so this failure must be caught at runtime and returned as legible error content. (d) Over the Claude Messages API the flag is is_error (snake_case) on the tool_result block — same meaning, different spelling — and there is no JSON-RPC channel at all (protocol problems would be HTTP 400s).
Two things are wrong. (1) Wrong regime: the Claude Messages API has no JSON-RPC error channel — JSON-RPC -32602 is an MCP protocol error. On the direct API a tool failure is a tool_result with is_error: true; protocol-level problems are HTTP errors (e.g. 400), not JSON-RPC. (2) Wrong channel even in MCP: an internal API timeout is an execution error the model could retry against, so even under MCP it belongs in isError: true content, not a protocol -32602. The candidate conflated the two regimes and mis-routed a recoverable error.
Claude retries the call 2–3 times with corrections before apologizing to the user — that is documented default behavior, not something you configure. Making the error content actionable matters because each retry reads that content to decide its correction; there is no retry-count knob, so the legibility of the error text is the only lever you have over whether those 2–3 attempts converge on a fix or burn down to an apology.
Exam essentials
- Two regimes, two spellings: Claude Messages API uses
is_error(snake_case) on atool_result, with one tool-failure signal (protocol problems are HTTP errors). MCP usesisError(camelCase) on aCallToolResult, plus a separate JSON-RPC channel for protocol errors. Don’t conflate them. is_error/isErroris the canonical failure signal — it turns a failed result into a message the model reasons over and may retry against. Flag it and make the content actionable (“Rate limit exceeded. Retry after 60 seconds.” beats"failed").- MCP’s two channels, two audiences —
isError: truecontent = execution errors the model self-corrects (validation, API, business logic); a JSON-RPC error = protocol errors it cannot fix (unknown tool, malformed request). Validation errors go in theisErrorchannel (SEP-1303), never JSON-RPC-32602. - Retry is default behavior, not a parameter — Claude retries 2–3 times with corrections; there is no retry-count knob, so error-content quality is your only lever.
- Prevent with
strict: true— eliminates schema-violation errors entirely; reserve error content for runtime, business-logic, and semantic failures you cannot prevent.
Tool Distribution and tool_choice: auto, any, Forced, and none
Controlling whether and which tool the model may call — the four tool_choice modes, the extended-thinking constraint, the any+strict guarantee of a schema-valid call, and the prompt-cache invalidation cost — plus allowedTools scoping versus bypassPermissions, with parallel execution as its own orthogonal axis.
Defining a good tool (D2.1) and a good failure (D2.2) is only half the architect’s job; the other half is controlling whether and which tool the model may reach for. Two knobs do this at two different scopes: tool_choice steers a single request — force a tool, free the model, or forbid all tools — while the SDK’s allowedTools defines which tools even exist for an agent. The exam tests the four tool_choice modes (especially the one constraint that trips people up), the any+strict guarantee, and the difference between steering a call and distributing a surface.
The four tool_choice modes
tool_choice is the per-request control over tool calling, and it has four documented modes: auto (Claude decides — the default when tools are provided), any (Claude must use some tool but picks which), {"type": "tool", "name": …} (forces one specific tool), and none (no tools this turn — the default when none are provided).
[Official]
Define tools · AnthropicT1-official original Tool use with Claude · AnthropicT1-official original
Forced modes are incompatible with extended thinking
The constraint the exam loves: only auto and none are compatible with extended thinking; any and forced tool return an error, and adaptive thinking carries the same limitation.
[Official]
Tool use with Claude · AnthropicT1-official original Define tools · AnthropicT1-official original If you need the model to reason before acting, you cannot also force it to call a tool — the two are mutually exclusive.
Forced modes prefill the assistant message
Forcing has a second, subtler effect. “When you have tool_choice as any or tool, the API prefills the assistant message to force a tool to be used. This means that the models will not emit a natural language response or explanation before tool_use content blocks, even if explicitly asked to do so.”
[Official]
Define tools · AnthropicT1-official original A forced call therefore cannot also produce a spoken preamble — there is no room before the tool_use block for one.
Guarantee a schema-valid call: any + strict
any guarantees that a tool fires, but not that its inputs are valid — and forcing one specific tool isn’t always what you want. Compose two switches to get both guarantees at once: “Combine tool_choice: {'type': 'any'} with strict tool use to guarantee both that one of your tools will be called AND that the tool inputs strictly follow your schema. Set strict: true on your tool definitions to enable schema validation.”
[Official]
Define tools · AnthropicT1-official original any covers that a tool is called; strict: true (D2.2) covers that its arguments match the schema. Together they make “some tool, well-formed” a hard guarantee — the right shape for a classifier or extractor that must always emit structured output through a tool.
Distribution: scope the surface with allowedTools
tool_choice steers one request; distribution decides which tools an agent has at all, and in the SDK that knob is allowedTools / disallowedTools. The two behave differently: allowed_tools=["Read", "Grep"] pre-approves the listed tools (others still exist and fall through to the permission mode), while disallowed_tools=["Bash"] removes the tool from the request entirely, so the model never sees it.
[Official]
Configure permissions · AnthropicT1-official original
For MCP access the documented guidance is to scope with allowedTools rather than open the gates with a permission mode: a mcp__github__* wildcard “grants exactly the MCP server you want and nothing more,” whereas permissionMode: "bypassPermissions" auto-approves MCP tools but disables every other safety prompt — broader than necessary.
[Official]
Connect to external tools with MCP · AnthropicT1-official original
Parallel execution is a separate request-level control
A third knob is easy to confuse with tool_choice but is orthogonal to it: disable_parallel_tool_use. Claude 4 models may emit several tool_use blocks in one turn by default; setting disable_parallel_tool_use=true caps that — with tool_choice: auto Claude then uses at most one tool, and with any or forced tool it uses exactly one.
[Official]
Parallel tool use · AnthropicT1-official original
Practice
Exercise solutions
Forced tool mode is incompatible with extended thinking, so the request errors — you cannot both force a specific tool and let the model reason with extended (or adaptive) thinking. To get a schema-valid guaranteed tool call, combine tool_choice: {"type": "any"} (guarantees some tool fires — here only record_decision exists) with strict: true on the tool (guarantees the inputs match the schema). But any is also incompatible with thinking, so to keep the hard guarantee you must give up extended thinking on this turn (or move the reasoning to a prior tool-free turn and force/any the decision on the next). No single configuration gives you a forced, schema-valid call and visible reasoning at once; forcing also prefills the assistant turn and suppresses the preamble.
Use allowedTools: ["mcp__linear__*"] — the wildcard pre-approves exactly the linear server’s tools and nothing else. It is preferable to permissionMode: "bypassPermissions" because bypass auto-approves the MCP tools and disables every other safety prompt across the whole agent (far broader than you need), whereas the scoped wildcard grants exactly the one server and leaves all other gates intact.
Every turn that changes tool_choice invalidates the cached message blocks — tool definitions and the system prompt stay cached, but the message content has to be reprocessed — so alternating auto/forced means roughly every other turn pays full message-processing cost, which is why caching “barely helps.” The fix: keep tool_choice stable across the cached turns (don’t toggle it per turn); if some steps genuinely need a forced tool, group them so the value changes as rarely as possible rather than every turn.
Exam essentials
- Four
tool_choicemodes —auto(default; Claude decides),any(must use some tool), forced{"type":"tool","name":…}(this tool),none(no tools). A spectrum from free to coerced. - Forced modes break extended thinking — only
autoandnonework with extended (or adaptive) thinking;anyand forcedtoolerror. The single most-testedtool_choiceconstraint. any+strict: true= a schema-valid guaranteed call —anyguarantees a tool fires,strict: trueguarantees its inputs match the schema; compose them for classifiers/extractors. (strictis a per-tool property.)- Forced modes prefill the assistant turn — no natural-language preamble before the
tool_useblock; for preamble plus a specific tool, useautoand ask in the user message. tool_choicechanges invalidate the prompt cache — message blocks must be reprocessed (tool defs + system prompt stay cached); keeptool_choicestable across cached turns.- Distribution ≠ steering —
allowedToolsdefines which tools exist (disallowed_toolsremoves one entirely);tool_choicesteers a request. For MCP, amcp__server__*wildcard beatsbypassPermissions(narrower). Parallelism is its own axis (disable_parallel_tool_use).
MCP Server Configuration: .mcp.json, Scopes, and Env-Var Expansion
Wiring an MCP server so it resolves predictably across personal, team, and machine contexts. The two config paths and strictMcpConfig, claude mcp add --scope, the three scopes and their precedence, the local-scope-versus-local-settings trap, env-var expansion for secrets, verifying the connection via system:init, and the transports atop a mid-revision wire protocol.
D2.1 through D2.3 designed tools, their failures, and their distribution; this chapter is where an external MCP server actually gets connected. Almost every trap here is about location — which file holds the config, which scope it lives in, which directory it resolves against — plus one notorious naming collision and one silent-failure mode: a server that never connected. Get the location right and check the connection, and a server resolves predictably across personal, team, and machine contexts; get it wrong and the agent silently never sees the tools.
Two ways to configure a server
An MCP server reaches an agent through one of two configuration paths. You can register it programmatically — mcp_servers in Python, mcpServers in TypeScript — or declare it in a .mcp.json file at the project root.
[Official]
Connect to external tools with MCP · AnthropicT1-official original The file-based path is not automatic, though: .mcp.json loads only when the SDK’s settingSources includes "project".
[Official]
Use Claude Code features in the SDK · AnthropicT1-official original Connect to external tools with MCP · AnthropicT1-official original
For a reproducible, clean-room config there is a third lever: strictMcpConfig: true uses only the servers you pass in mcpServers, ignoring .mcp.json, user settings, and plugins.
[Official]
Connect to external tools with MCP · AnthropicT1-official original It is how you guarantee an SDK run sees exactly the servers you declared and nothing the machine happens to carry.
Three scopes — and claude mcp add --scope
Claude Code stores MCP servers in three scopes, each a different file with a different audience: Local (~/.claude.json, under a per-project key — current project only, not shared), Project (.mcp.json at the repo root — shared via version control, with a one-time approval prompt on first use), and User (~/.claude.json — available across all your projects).
[Official]
Connect Claude Code to tools via MCP · AnthropicT1-official original
The CLI is how you actually install one and pick its scope: claude mcp add <name> --scope <local|project|user> --transport <http|stdio|sse> … — for example, claude mcp add --transport http --scope project notion https://mcp.notion.com/mcp registers a project-scoped HTTP server.
[Official]
Connect Claude Code to tools via MCP · AnthropicT1-official original --scope is the flag that decides who sees the server; omit it and you get Local (the default).
When the same server name appears in more than one scope, Claude Code connects once, using the highest-precedence source: Local → Project → User → plugin-provided → claude.ai connectors (the first three match duplicates by name). [Official] Connect Claude Code to tools via MCP · AnthropicT1-official original
"Local scope” is not “local settings”
The single most confusing collision in MCP configuration is the word local. “MCP local-scoped servers are stored in ~/.claude.json (your home directory), while general local settings use .claude/settings.local.json (in the project directory).”
[Official]
Connect Claude Code to tools via MCP · AnthropicT1-official original They are different files in different directories — one in your home, one in the project — and they hold different things.
Env-var expansion keeps secrets out of the file
Because a Project-scoped .mcp.json is committed to version control, secrets must never be written into it literally — they are referenced through env-var expansion instead. .mcp.json supports ${VAR} (expands, or fails the parse if unset) and ${VAR:-default} (expands, or uses the default), and the expansion works inside command, args, env, url, and headers.
[Official]
Connect Claude Code to tools via MCP · AnthropicT1-official original So a committed config carries "Authorization": "Bearer ${API_KEY}", and the key itself lives only in the environment.
Verify the server connected
A wired server that never connected is the silent failure of this chapter — the agent simply runs without the tools and you find out from a confusing answer. Don’t assume; check. Detect connection failures via the system:init message: each server’s status is one of connected | failed | needs-auth | pending | disabled — read it before letting the agent run.
[Official]
Connect to external tools with MCP · AnthropicT1-official original The default connection timeout is 60 seconds for server initialization, so a slow-starting server may need pre-warming or a lighter-weight package.
[Official]
Connect to external tools with MCP · AnthropicT1-official original
Transports and the snapshot-dated wire protocol
A server’s type selects its transport: stdio for local processes, sse (Server-Sent Events, now deprecated — use HTTP), and http (Streamable HTTP; JSON configs accept streamable-http as an alias for http). A fourth type, ws (WebSocket), is configurable only through .mcp.json or claude mcp add-json, not the --transport flag (whose values are http/stdio/sse).
[Official]
Connect Claude Code to tools via MCP · AnthropicT1-official original Separately — and this is not a .mcp.json type — the SDK lets you run an MCP server in-process inside your application (an SDK deployment mode, e.g. a built-in tool server), rather than as an external process or endpoint.
[Official]
Connect to external tools with MCP · AnthropicT1-official original
Beneath the config sits the MCP wire protocol — and it is mid-revision, so cite it with a date. Under the 2025-11-25 specification, an initialize handshake “MUST be the first interaction between client and server,” negotiating protocol version and capabilities before any tool call.
[Official]
Lifecycle — Model Context Protocol Specification 2025-11-25 · AnthropicT1-official original The 2026-07-28 release candidate (locked May 2026; the final spec ships 2026-07-28) removes that handshake for a stateless model, so the wire details here are a dated snapshot, not a permanent contract.
[Official]
The 2026-07-28 MCP Specification Release Candidate · Model Context ProtocolT2-release-notes original
Practice
Exercise solutions
(a) Project scope — a .mcp.json at the repo root (committed so every clone gets the server), installed with claude mcp add … --scope project. (b) Reference the secret through env-var expansion rather than inlining it, e.g. "env": { "DATABASE_URL": "${DATABASE_URL}" } (or "${DATABASE_URL:-…}" if a safe default exists) — expansion works in env, url, and headers, so no literal credential is committed. (c) The teammate’s definition wins on their machine. Their ~/.claude.json entry is Local scope, and precedence runs Local → Project → User, matched by name — so the local definition overrides the shared project one for them. That is the intended override path for personal credentials, not a conflict.
The mistake is conflating “MCP local scope” with “general local settings.” A local-scoped MCP server is stored in ~/.claude.json (the home directory, under a per-project key), not in .claude/settings.local.json (the project’s machine-local settings). Editing the latter to change an MCP server is a silent no-op — point them at ~/.claude.json.
Inspect the system:init message and its mcp_servers field — each server reports a status. A status other than connected explains the missing tools: failed (missing env var, uninstalled package, bad connection string, unreachable host), needs-auth (OAuth not completed), pending, or disabled. The 60-second default initialization timeout is a common cause of failed/pending for slow-starting servers — pre-warm or use a lighter package. Always read status before letting the agent run rather than discovering the gap from a wrong answer.
Exam essentials
- Two config paths — programmatic (
mcp_servers/mcpServers) or a.mcp.jsonat the project root (loads only whensettingSourcesincludes"project").strictMcpConfig: trueuses onlymcpServers, ignoring.mcp.json/user/plugins. claude mcp add … --scope <local|project|user> --transport <http|stdio|sse>installs a server and picks its scope;--scopedefaults to Local.- Three scopes, three audiences — Local (
~/.claude.json, per-project, private), Project (.mcp.json, committed/shared, approval-prompted), User (~/.claude.json, all projects). Precedence: Local → Project → User → plugin → claude.ai, matched by name. - “Local scope” ≠ “local settings” — a local-scoped MCP server lives in
~/.claude.json(home), not.claude/settings.local.json(project). - Env-var expansion —
${VAR}/${VAR:-default}incommand/args/env/url/headers;${CLAUDE_PROJECT_DIR}needs the:-form in hand-written configs. - Verify the connection — read the
system:initstatus(connected/failed/needs-auth/pending/disabled) before running; the default init timeout is 60s. - Config transports —
stdio/sse(deprecated) /http(streamable-httpalias), plusws(WebSocket,.mcp.json-only); the SDK in-process server is a separate deployment mode, not atype. The 2025-11-25initializehandshake is a dated snapshot the 2026-07-28 RC removes.
Built-in Tools: The Roster, Execution Order, and Permission Gating
The fixed roster of built-in tools every agent ships with — Read, Write, Edit, Bash, Grep, Glob — their exact, case-sensitive names, the read-only-versus-state-modifying line that decides which run in parallel, the six permission modes, the five-step evaluation order, and the allow/deny rules that gate them. Closes on the allowlist-is-not-a-sandbox trap.
D2.1 through D2.4 designed tools, their failure contracts, their distribution, and the wiring of external MCP servers. This chapter steps back to the tools an agent already has on the first turn — the fixed built-in roster — and the permission machinery that decides whether any given one actually fires. The exam angle is recognition: the exact roster, the read-versus-write execution split, the six modes, the evaluation order, and one high-value trap where a developer thinks they have locked an agent down and have not.
The built-in tool roster
Every agent starts with a fixed roster of built-in tools — the SDK ships roughly fourteen of them, in six categories, identical to those that power Claude Code.
[Official]
Agent SDK overview · AnthropicT1-official original How the agent loop works · AnthropicT1-official original The six that do the everyday work of reading and changing a codebase are Read, Write, Edit (file operations), Grep, Glob (search), and Bash (execution). The parity with Claude Code is explicit: “The SDK includes the same tools that power Claude Code,”
[Official]
How the agent loop works · AnthropicT1-official original and “Everything that makes Claude Code powerful is available in the SDK.”
[Official]
Agent SDK overview · AnthropicT1-official original Beyond this built-in set sit MCP server tools (Chapter D2.4) and your own custom tools; this chapter is about the built-ins every agent has from the start.
These names are exact — they appear verbatim in allowed_tools / allowedTools rules and as the tool_use.name block in messages, so Read is the tool and read is not.
[Official]
How the agent loop works · AnthropicT1-official original
Read-only and state-modifying tools run differently
The roster splits along a line that the runtime cares about: whether a tool reads state or changes it. Read-only tools — Read, Glob, Grep, and MCP tools marked read-only — can run concurrently; tools that modify state — Edit, Write, and Bash — run sequentially to avoid conflicts. Custom tools default to sequential execution and opt into parallelism by setting readOnlyHint in their annotations.
[Official]
How the agent loop works · AnthropicT1-official original
Gating the tools: six permission modes
Having a tool in the roster does not mean it fires. When the model requests a tool, the active permission mode is consulted, and there are six: default, acceptEdits, plan, dontAsk, bypassPermissions, and auto (TypeScript-only).
[Official]
Configure permissions · AnthropicT1-official original Two are worth memorizing for the exam because they change the tool surface directly: plan restricts the agent to read-only tools, so it explores and proposes a plan without editing source files; acceptEdits auto-approves file edits and filesystem commands (mkdir, touch, rm, rmdir, mv, cp, sed) — but only inside cwd plus additionalDirectories, and paths outside that scope, or protected paths, still prompt.
[Official]
Configure permissions · AnthropicT1-official original
Allow and deny rules — and the five-step order
Within a mode, allow and deny rules pre-approve or block specific tools and calls. A bare name and a scoped pattern behave differently: allowed_tools=["Read", "Grep"] auto-approves those tools; disallowed_tools=["Bash"] removes Bash from the request entirely, so the model never sees it; and disallowed_tools=["Bash(rm *)"] keeps Bash available but denies any rm * call — in every mode, including bypassPermissions.
[Official]
Configure permissions · AnthropicT1-official original All of this resolves through a fixed sequence: “When Claude requests a tool, the SDK checks permissions in this order: 1. Hooks. 2. Deny rules. 3. Permission mode. 4. Allow rules. 5. canUseTool callback.”
[Official]
Configure permissions · AnthropicT1-official original
The high-value trap lives in the gap between pre-approving and restricting. allowed_tools only pre-approves the tools you list; it does not filter everything else out. Set allowed_tools=["Read"] alongside permission_mode="bypassPermissions" and the agent “still approves every tool, including Bash, Write, and Edit.”
[Official]
Configure permissions · AnthropicT1-official original The allowlist was never a sandbox.
The day-to-day use of these tools — the muscle memory of Read, Edit, and Bash inside a working session — is the handbook’s territory (the Use book’s chapter on Claude Code’s toolset); this chapter is the architect’s exam angle on the roster and the permission surface that gates it.
Practice
Exercise solutions
C. allowed_tools pre-approves the tools you list; it never restricts the ones you omit. Paired with bypassPermissions, the configuration “still approves every tool, including Bash, Write, and Edit” — the allowlist is silently irrelevant (it sits at step 4 of the evaluation order, below the mode at step 3). A is the core misconception (treating the allowlist as a filter). B confuses this with acceptEdits, which is the mode that auto-approves filesystem commands — bypassPermissions approves everything, not just filesystem ops. D invents a conflict; the two settings combine without error, which is exactly why the trap is dangerous. The fix: drop bypassPermissions and use permission_mode="plan" (read-only tools only), or keep a stricter mode and add a deny rule such as disallowed_tools=["Write", "Edit", "Bash"] — deny rules block even under bypassPermissions. Reach for the allowlist to permit, and for the mode or a deny rule to forbid.
The three Read calls may run concurrently; the Edit must run on its own (sequentially). The deciding property is whether a tool is read-only or state-modifying: read-only tools (Read, Glob, Grep) can run in parallel because they cannot conflict, while state-modifying tools (Edit, Write, Bash) run sequentially to avoid clobbering each other. So the runtime can fan out the three reads at once, then run the single edit after.
(a) plan mode — it restricts the agent to read-only tools, so it can explore and propose changes but cannot edit any source file, with no allow/deny list to maintain. (b) acceptEdits — it auto-approves file edits and filesystem commands (mkdir/rmdir/mv/…) inside cwd plus additionalDirectories, while paths outside that scope (and protected paths) still prompt. plan forbids edits entirely; acceptEdits permits them but only within the working scope.
Exam essentials
- The roster is fixed and the names are exact —
Read,Write,Edit,Grep,Glob,Bashare the six core built-ins (of ~14), identical to Claude Code’s; they appear verbatim in allow/deny rules and astool_use.name, and a mis-cased name matches nothing. - Read vs write decides parallelism — read-only tools (
Read/Glob/Grep) run concurrently; state-modifying tools (Edit/Write/Bash) run sequentially; custom tools default to sequential and opt in viareadOnlyHint. Orthogonal to D2.3’sdisable_parallel_tool_use. - Six permission modes —
default,acceptEdits,plan,dontAsk,bypassPermissions,auto(TS).plan= read-only;acceptEdits= auto-approve edits + filesystem ops (mkdir/touch/rm/rmdir/mv/cp/sed) insidecwd/additionalDirectories, prompt outside. - Five-step evaluation order — Hooks → Deny rules → Permission mode → Allow rules →
canUseTool. Deny rules and hooks fire before the mode, so they bind even underbypassPermissions; allow rules fire after it, so the mode can override them. - The allowlist trap —
allowed_toolspre-approves, it does not restrict; withbypassPermissionsit approves everything regardless. Confine withplanmode or a deny rule, never with the allowlist alone.
CLAUDE.md Hierarchy & @import: Four Scopes That Concatenate
How Claude Code assembles persistent instructions from four CLAUDE.md scopes that concatenate without precedence — the opposite of the strict five-level settings hierarchy (Managed > CLI > Local > Project > User) — plus the @import mechanism (depth-5, first-use approval), the AGENTS.md bridge, and the managed claudeMd / claudeMdExcludes controls.
D2.4 resolved MCP servers and settings across a strict precedence where the highest scope wins. The instruction layer looks like the same machinery — files at managed, user, project, and local scopes — but it behaves in the opposite way: the files do not compete, they concatenate. This chapter is the exam angle on that distinction and on the @import mechanism that stitches files together. The design rationale for treating the file as a context budget lives in the Further reading.
Four scopes that load broadest-first
Claude Code assembles its persistent instructions from up to four CLAUDE.md scopes, loaded broadest to most specific: Managed policy, then User (~/.claude/CLAUDE.md), then Project (./CLAUDE.md or ./.claude/CLAUDE.md), then Local (./CLAUDE.local.md, which you add to .gitignore).
[Official]
How Claude remembers your project · AnthropicT1-official original Discovery walks up the directory tree from your working directory; a CLAUDE.md nested below cwd is not loaded at launch but on demand, when Claude first reads a file in that subdirectory.
[Official]
How Claude remembers your project · AnthropicT1-official original The managed-policy file is the one scope that cannot be excluded by any individual setting — which is exactly what makes it the instrument for org-enforced instructions.
[Official]
How Claude remembers your project · AnthropicT1-official original
Concatenation, not precedence
Here is the property that separates the instruction layer from every settings file: the discovered CLAUDE.md files do not override one another. “All discovered files are concatenated into context rather than overriding each other. Across the directory tree, content is ordered from the filesystem root down to your working directory.”
[Official]
How Claude remembers your project · AnthropicT1-official original Within a single directory, CLAUDE.local.md is appended after CLAUDE.md.
[Official]
How Claude remembers your project · AnthropicT1-official original
Contrast settings, which resolve by a strict five-level precedence where the highest scope wins. Named in full, highest to lowest: [Official] Claude Code Settings · AnthropicT1-official original
So CLAUDE.md and settings sit at opposite ends: one accumulates, the other overrides — and the override ladder is five rungs, not three, with CLI and Local the two most often forgotten.
@import: stitching files together
A CLAUDE.md can pull in other files with @path/to/import. The imported files expand and load at launch alongside the referencing file; relative paths resolve relative to the file containing the import; and the import chain has a maximum recursion depth of 5.
[Official]
How Claude remembers your project · AnthropicT1-official original The first time a session encounters an import, Claude shows an approval dialog — and declining it disables imports permanently (the dialog does not reappear).
[Official]
How Claude remembers your project · AnthropicT1-official original
AGENTS.md, managed policy, and the budget you don’t develop here
Three controls round out the layer. First, the cross-tool bridge: Claude Code reads CLAUDE.md, not AGENTS.md — to share one instruction set with other agents, create a CLAUDE.md that imports @AGENTS.md.
[Official]
How Claude remembers your project · AnthropicT1-official original Second, managed settings can deploy instructions with no file at all: the claudeMd key carries inline CLAUDE.md content, honored only in managed/policy settings, and it loads before the user and project files.
[Official]
How Claude remembers your project · AnthropicT1-official original Claude Code Settings · AnthropicT1-official original Third, claudeMdExcludes — glob patterns matched against absolute paths, merged across layers — skips ancestor CLAUDE.md files, with the single exception that the managed-policy file can never be excluded.
[Official]
How Claude remembers your project · AnthropicT1-official original
Practice
Exercise solutions
C. CLAUDE.md files do not compete: “All discovered files are concatenated into context rather than overriding each other,” ordered “from the filesystem root down to your working directory.” So both “use tabs” and “use 2-space indent” are in context simultaneously — which is itself a smell, because contradictory ancestor instructions are not resolved by proximity. A and B both assume a precedence the instruction layer does not have. D is the high-value trap: it imports the settings model (where settings.local.json overrides user settings) onto memory, where files merge instead. To actually suppress the root file you would use claudeMdExcludes, not a closer CLAUDE.md.
haiku runs. The five-level settings precedence, highest to lowest, is Managed → CLI arguments → Local → Project → User; the --model haiku CLI argument (level 2) beats both the project opus (level 4) and the user sonnet (level 5). The same two-scope setup behaves differently for CLAUDE.md because the instruction layer concatenates instead of overriding — two CLAUDE.md files setting contradictory guidance would both load into context at once, with no “winner,” whereas settings resolve to exactly one value down the ladder.
(a) The maximum @import recursion depth is 5. (b) On first encountering an import, Claude Code shows an approval dialog; declining it disables imports permanently — the dialog does not reappear, so a future import in that environment silently will not expand until the choice is reset. Design import chains shallow (≤5) and approve deliberately.
Exam essentials
- Four scopes, broadest-first — Managed → User → Project → Local; discovery walks up from cwd, nested files load on demand, and the managed file cannot be excluded.
- Concatenate, don’t override — there is no precedence between CLAUDE.md files (root → cwd order;
CLAUDE.local.mdappended afterCLAUDE.md). This is the opposite of the strict five-level settings precedence: Managed → CLI → Local → Project → User (CLI and Local are the forgotten two). Conflating the two is the single most common instruction-layer error. @import—@pathexpands at launch, resolves relative to the importing file, caps at recursion depth 5, and prompts an approval dialog on first use (declining disables imports permanently).AGENTS.md— Claude Code readsCLAUDE.mdonly; it picks upAGENTS.mdsolely through a@AGENTS.mdimport.- Managed controls —
claudeMddeploys inline policy content (loads before user/project);claudeMdExcludesglobs away ancestor files by absolute path — but never the managed-policy CLAUDE.md.
Further reading
The discipline behind what belongs in these files — treating the always-loaded CLAUDE.md as a permanent slice of the context budget rather than documentation, and the controlled study measuring the cost of overstuffing it — is developed in the Agentic Systems Design book, Chapter 4, The Instruction Layer: CLAUDE.md & AGENTS.md. Optional depth; this chapter stands on its own.
Slash Commands & Skills: Stored Prompts, Lazy-Loaded Capabilities
Two ways to extend the workflow — a slash command (a stored prompt recognized at message start) and a skill (a lazy-loaded, auto-invocable, directory-bundled capability). The merged model, the full lazy-load lifecycle (description budget, $ARGUMENTS substitution, compaction carry-forward, live change detection), the SKILL.md frontmatter, the four scopes, and what disable-model-invocation does to the description.
D3.1’s CLAUDE.md is always-on context. This chapter is the other half of the instruction layer: capabilities the agent loads only when it needs them. A slash command is a stored prompt you trigger by typing /name; a skill is a richer, directory-bundled capability that Claude can also reach for on its own. The exam-relevant facts are that the two have converged, and that a skill’s lazy-loading — and its budget — are what keep it cheap.
Commands and skills have merged
A slash command controls Claude Code from inside a session, and “a command is only recognized at the start of your message. Text that follows the command name is passed to it as arguments.” [Official] Commands · AnthropicT1-official original Alongside the many built-in commands, some entries are bundled skills: “they use the same mechanism as skills you write yourself: a prompt handed to Claude, which Claude can also invoke automatically when relevant. Everything else is a built-in command whose behavior is coded into the CLI.” [Official] Commands · AnthropicT1-official original
That same mechanism is why the two authoring formats have converged: “Custom commands have been merged into skills. A file at .claude/commands/deploy.md and a skill at .claude/skills/deploy/SKILL.md both create /deploy and work the same way. Your existing .claude/commands/ files keep working.”
[Official]
Extend Claude with skills · AnthropicT1-official original Old flat-file commands still run; skills are the recommended form for new work because they add directory bundling, frontmatter, and auto-invocation.
Skills are lazy-loaded directories
A skill is “a markdown directory: .claude/skills/<name>/SKILL.md,” with optional supporting files (reference.md, scripts/, and the like) alongside.
[Official]
Extend Claude with skills · AnthropicT1-official original The reason a skill is cheap is its loading model — unlike CLAUDE.md, which loads every session: “skills load on demand. The agent receives skill descriptions at startup and loads the full content when relevant.”
[Official]
Extend Claude with skills · AnthropicT1-official original Each description is roughly 100 tokens; the full body materializes only when the skill is invoked, “enters the conversation as a single message and stays there for the rest of the session,” and is not re-read on later turns.
[Official]
Extend Claude with skills · AnthropicT1-official original
The lazy-load lifecycle in full
The “cheap” story has a budget and a lifecycle the exam can probe at each stage:
One operational nicety: live change detection — adding, editing, or removing a skill under ~/.claude/skills/, the project’s .claude/skills/, or an --add-dir directory takes effect within the current session, no restart — the one exception being a brand-new top-level skills/ directory that did not exist at launch.
[Official]
Extend Claude with skills · AnthropicT1-official original
SKILL.md frontmatter and the four scopes
The SKILL.md frontmatter is where a skill declares its behavior. Among the fields: name (display name in skill listings, defaults to the directory name — the directory name, not this field, sets the /command you type, except for a plugin-root SKILL.md), description (drives auto-invocation; description + when_to_use capped at 1,536 chars by default), argument-hint, disable-model-invocation, user-invocable, allowed-tools (CLI-only), model, effort, context, and paths (glob patterns that limit when the skill activates).
[Official]
Extend Claude with skills · AnthropicT1-official original
Skills resolve across four scopes by precedence — “enterprise > personal > project; plugin skills use a plugin-name:skill-name namespace and never conflict”: Enterprise (managed settings, all users), Personal (~/.claude/skills/, all your projects), Project (.claude/skills/, auto-discovered from cwd up to repo root), and Plugin (namespaced).
[Official]
Extend Claude with skills · AnthropicT1-official original
Who can invoke it
Two frontmatter switches decide who may call a skill. user-invocable: false hides it from the / menu so only Claude can invoke it; disable-model-invocation: true does the inverse — only the user can trigger it via /, Claude cannot auto-invoke, its description is kept out of context, and it also blocks subagent preloading.
[Official]
Extend Claude with skills · AnthropicT1-official original Between them you can build a skill that is purely automatic, purely manual, or both.
Practice
Exercise solutions
B. A skill lazy-loads: only its ~100-token description sits in context at startup, and the full 2,000-token body loads on invocation — and Claude can auto-invoke it when a deploy is relevant. A is the D3.1 budget mistake: a CLAUDE.md is loaded every session, so the whole 2,000 tokens would be spent on every unrelated turn. C is wrong on the “only way” claim — commands have merged into skills, so .claude/commands/deploy.md and .claude/skills/deploy/SKILL.md both create /deploy; the skill form is recommended. D defeats the point of a reusable capability. The skill is the form that is both cheap (lazy) and discoverable (auto-invocable).
disable-model-invocation: true keeps the skill’s description out of startup context entirely — so unlike an ordinary skill (whose ~100-token description Claude sees and can match against), the model is never told this skill exists. That is exactly the point for a risky release action: it forces a human to type /force-release, because Claude cannot auto-invoke something it cannot see. The teammate got the intended safety behavior; “Claude doesn’t know it exists” is the feature, not a bug. (If they wanted Claude to know about it but still gate execution, that is a permission/ask concern, not this flag.)
(a) Only the ~100-token description enters context at session start, counted against the skill-listing budget (default 1% of the model’s context window); the ~3,000-token body does not load yet. (b) The body loads on invocation, enters as a single message, and stays for the rest of the session (it is not re-read each turn). If a compaction fires, the body is carried forward within a budget — up to the first 5,000 tokens of each most-recently-invoked skill, capped at a combined 25,000 tokens post-compaction.
Exam essentials
- Commands merged into skills —
.claude/commands/x.mdand.claude/skills/x/SKILL.mdboth create/x; old commands keep working, skills are recommended (directory bundling, frontmatter, auto-invocation). A command is recognized only at message start; text after it is arguments. - Skills lazy-load on a budget — a ~100-token description at session start (budget defaults to 1% of the context window; on overflow least-invoked descriptions drop first;
/doctorshows it), the full body on invocation; the body persists and compaction carries it forward (≤5,000 tokens each, 25,000 combined). - Arguments —
$ARGUMENTS(all args),$ARGUMENTS[N]/$N(0-indexed) substitute into the body; absent references appendARGUMENTS: <value>. - The description is the retrieval interface — but not unconditional —
disable-model-invocation: truekeeps the description out of context (user-only via/), and budget overflow can drop one.user-invocable: false= Claude-only. - Four scopes — enterprise > personal > project; plugin skills are namespaced (
plugin-name:skill-name) and never conflict. Live change detection: editing a skill takes effect mid-session without restart (except a brand-new top-levelskills/dir).
Further reading
The deeper craft — how to write a description that gets discovered, when to extract a skill out of CLAUDE.md, and the progressive-disclosure design behind all of this — is developed in the Agentic Systems Design book, Chapter 5, Skills & Progressive Disclosure. Optional depth; this chapter stands on its own.
Path-Scoped Rules: Modular, Glob-Triggered Instructions
The .claude/rules/ system — a modular, eager-loaded instruction layer parallel to CLAUDE.md, with optional glob path-scoping so a rule loads only when Claude reads matching files. Covers unconditional vs path-scoped rules, user-level vs project rules and their load order, the glob format, and the directory and symlink mechanics.
D3.1 covered the always-on CLAUDE.md; D3.2 covered the lazy, invocable skill. Rules are the third shape of the instruction layer: modular markdown files that load eagerly like CLAUDE.md, but can be glob-scoped so a rule only enters context when Claude touches the files it governs. The architect’s job here is to know that rules are a separate, equal-priority system, that user and project rules have a load order, and to use path-scoping to keep file-specific guidance out of unrelated work.
A system parallel to CLAUDE.md
.claude/rules/*.md is a modular rules system, loaded into context every session, with recursive discovery across subdirectories.
[Official]
How Claude remembers your project · AnthropicT1-official original The relationship to CLAUDE.md is the fact to fix first: “Rules without paths frontmatter are loaded at launch with the same priority as .claude/CLAUDE.md.”
[Official]
How Claude remembers your project · AnthropicT1-official original A rule is not nested under CLAUDE.md or overridden by it — the two are parallel instruction sources discovered separately and loaded at equal priority.
User-level vs project rules — there is a load order
The “no precedence” story needs one refinement the exam can probe: rules come in scopes, and the scopes load in order. User-level rules live in ~/.claude/rules/ and apply to every project on your machine — use them for preferences that aren’t project-specific (your personal style, your workflows).
[Official]
How Claude remembers your project · AnthropicT1-official original Project rules live in the repo’s .claude/rules/. And the order between them is documented: “user-level rules are loaded before project rules, giving project rules higher priority.”
[Official]
How Claude remembers your project · AnthropicT1-official original
That is the same recency model as the CLAUDE.md hierarchy (D3.1): files concatenate, but the more-specific scope is read last and so effectively dominates when two instructions tension. So “concatenate, not override” is true within a scope; across scopes there is a load order — user → project, project higher — exactly mirroring CLAUDE.md’s broad-to-specific assembly.
Path-scoping with the paths frontmatter
The lever that makes rules more than “another CLAUDE.md” is the optional paths field. Give a rule a paths glob and it stops loading unconditionally: “path-scoped rules trigger when Claude reads files matching the pattern, not on every tool use.”
[Official]
How Claude remembers your project · AnthropicT1-official original So a rule scoped to src/api/**/*.ts costs nothing until Claude actually reads an API file — and then applies while that work is in scope.
---
paths:
- "src/api/**/*.ts"
---
# API Development Rules
- All API endpoints must include input validation
The glob format is the same one skills use for their paths field:
[Official]
How Claude remembers your project · AnthropicT1-official original
Layout and symlinks
Rules can mix unconditional and path-scoped files in one tree, and discovery recurses into subdirectories:
.claude/
└── rules/
├── code-style.md # no paths: loaded unconditionally
├── security.md # no paths: loaded unconditionally
├── frontend/
│ └── react.md # paths: src/frontend/**/*.tsx
└── backend/
└── api.md # paths: src/api/**/*.ts
The unconditional files (code-style.md, security.md) are always in context; the nested path-scoped files wait for a matching file-read.
[Official]
How Claude remembers your project · AnthropicT1-official original Rules can also be shared from a central directory by symlink: “symlinks work in .claude/rules/ — link shared rules from a central dir; circular symlinks are detected gracefully.”
[Official]
How Claude remembers your project · AnthropicT1-official original
Choosing the shape: rule vs CLAUDE.md vs skill
The three instruction shapes now in view divide cleanly by when they load and what they carry:
The practical rule of thumb: reach for a paths-scoped rule when guidance is real but only relevant to part of the tree — it is the one shape that lets you write file-specific instructions without paying for them on every unrelated turn.
Practice
Exercise solutions
C. A path-scoped rule is exactly this case: with paths: ["src/api/**/*.ts"] the standard loads only when Claude reads a matching API file, and stays out of context otherwise. A and B both load the line unconditionally — CLAUDE.md and an un-scoped rule sit in context at the same priority every session, which is the clutter you wanted to avoid. D misuses a skill: skills are invocable capabilities/workflows, not standing rules that should bind automatically while editing a file. The lever that matters is paths, and only a rule (or a skill) offers it — so the rule is the right shape for a standing, file-scoped instruction.
src/**/*.{ts,tsx} — or, to cover the whole repo regardless of directory, **/*.{ts,tsx}. A single paths entry with brace expansion {ts,tsx} matches both extensions; ** matches any directory depth. (You can also list two patterns, but the brace form does it in one.)
Claude favors 2-space indent — the project rule wins. Your preferences.md is a user-level rule (~/.claude/rules/, applies to every project); the repo’s code-style.md is a project rule. The documented order is that user-level rules load before project rules, giving project rules higher priority — so when both are in context and they tension, the project rule (read last) dominates. Your personal preference acts as a default that any project can override for its own repo.
Exam essentials
- Parallel to CLAUDE.md —
.claude/rules/*.mdloads every session with recursive subdir discovery; rules withoutpathsload unconditionally at the same priority as.claude/CLAUDE.md(not a subsystem of it). - Scopes have a load order — user-level (
~/.claude/rules/, all projects) loads before project (.claude/rules/), giving project rules higher priority. Rules concatenate, but the more-specific scope is read last and wins (the CLAUDE.md model). - Path-scoping — a
pathsglob makes a rule trigger only when Claude reads matching files, not at launch and not on every tool use. Glob format same as skillpaths(**/*.ts,src/**/*,{ts,tsx}). - Mechanics — unconditional and path-scoped rules can mix in one tree; symlinks work, with circular symlinks detected gracefully.
- Choosing the shape — CLAUDE.md (always-on), rules (modular, optionally glob-gated, user/project order), skills (lazy capability). Reach for a
paths-scoped rule for guidance that is real but only relevant to part of the tree.
Plan Mode vs Direct Execution: Research Before You Edit
Plan mode restricts Claude to read-only research and a written proposal — no edits — and approving the plan exits the mode into a write mode. Choosing plan versus going direct is a risk-containment decision, not a named-mode toggle; "direct execution" is simply working in a write mode. The opusplan alias pairs the mode with a model-per-phase split — Opus plans, Sonnet executes.
D2.5 enumerated the six permission modes as a tool-gating mechanism. This chapter zooms in on the one an architect chooses on purpose: plan. Plan mode is read-only research with a written proposal at the end — and the real exam question is not “what is plan mode” but “when do you plan first versus edit directly.” That is a risk-containment decision, and there is no named “direct execution” mode — going direct just means working in a mode that lets edits through. The chapter closes on opusplan, the alias that makes the strong-model-plans/fast-model-executes split automatic.
What plan mode restricts
Plan mode is the one mode whose purpose is to change nothing: “Plan mode tells Claude to research and propose changes without making them. Claude reads files, runs shell commands to explore, and writes a plan, but does not edit your source. Permission prompts still apply the same as default mode.” [Official] Choose a permission mode · AnthropicT1-official original It is read-only — best for exploring a codebase before changing it. [Official] Choose a permission mode · AnthropicT1-official original
Its contrast is not a single “direct” mode but the write modes from D2.5 — default (reads auto-approved, edits prompt) and acceptEdits (edits and common filesystem commands auto-approved) — the modes that let changes through.
[Official]
Configure permissions · AnthropicT1-official original “Going direct” is shorthand for working in one of those, not a toggle of its own.
Entering and exiting plan mode
You can enter plan mode four ways: cycle with Shift+Tab (the CLI cycle runs default → acceptEdits → plan), prefix a single prompt with /plan (optionally with a task, /plan fix the auth bug), start the session with claude --permission-mode plan, or set permissions.defaultMode: "plan" in settings.
[Official]
Choose a permission mode · AnthropicT1-official original
Exit is the part that trips people up: “Approving a plan exits plan mode and switches the session to the permission mode each approve option describes, so Claude starts editing. To plan again, cycle back to plan mode with Shift+Tab, or prefix your next prompt with /plan.”
[Official]
Choose a permission mode · AnthropicT1-official original When the plan is ready Claude presents it, and each approve option (auto, accept-edits, review-each-edit) names the write mode the session lands in.
The decision: plan first, or go direct?
Because plan mode adds a research-and-review round before any edit, the choice between it and going direct is a bet on reversal cost and uncertainty:
Model per phase: the opusplan alias
Plan mode separates thinking from doing in time — research first, edits after. That split is exactly where a model-per-phase pays, and Claude Code ships an alias for it. opusplan is one of the eight model aliases, and it “uses Opus in plan mode, switches to Sonnet for execution.”
[Official]
Model configuration · AnthropicT1-official original Spelled out: “The opusplan model alias provides an automated hybrid approach: In plan mode — Uses opus for complex reasoning and architecture decisions. In execution mode — Automatically switches to sonnet for code generation and implementation.”
[Official]
Model configuration · AnthropicT1-official original
Set it like any alias — /model opusplan during a session, or claude --model opusplan at startup.
[Official]
Model configuration · AnthropicT1-official original The reasoning-heavy plan runs on the strong model; the moment execution begins, the fast model takes over. You spend the expensive tokens where the leverage is — on the design — and the cheaper model on the mechanical edits.
Where plan mode fits the workflow
Plan mode is the front of the iterative loop — the “explore and plan” phase before “implement and commit” (the rhythm developed in D3.5). The hands-on mechanics — how the approval screen looks, how to drive the loop turn by turn — are the handbook’s territory rather than this book’s: see the Use book, Chapter 3, Your First Working Session.
Practice
Exercise solutions
B. The change is unfamiliar, multi-file, and expensive to get wrong — exactly the profile where planning first pays. In plan mode Claude maps the full set of call sites and proposes the complete edit before touching anything, so a missed reference shows up in a proposal you can reject, not as breakage you have to walk back. A and D both start editing before the scope is known: you discover missed call sites as broken edits (A) or as a long stream of one-at-a-time approvals with no view of the whole (D). C removes the safety entirely — and approving plans is the opposite move from skipping permission checks. Plan mode here converts an unbounded reversal cost into a bounded one.
In plan mode Claude can read files and run shell commands to explore, and it writes a plan — but it does not edit your source. The thing people wrongly assume it suppresses is permission prompts: they “still apply the same as default mode,” so plan mode is not a quiet, prompt-free sandbox — it is a no-edit mode that still gates the actions it does allow. Read-only is the guarantee; silence is not.
(a) The session is now in acceptEdits — approving a plan exits plan mode and switches the session to the write mode the chosen approve option names, and Claude starts editing. The read-only guarantee is gone. (b) To plan again you must re-enter plan mode: cycle back with Shift+Tab, or prefix your next prompt with /plan. Approval is one-way; getting back to research is a deliberate re-entry, not an undo.
(a) Under opusplan, the planning phase runs on Opus (complex reasoning and architecture) and execution runs on Sonnet (code generation) — the switch is automatic at the plan→execute boundary. (b) No — opusplan does not give you a 1M planning window. The automatic 1M-context upgrade applies to the opus alias only, not opusplan, whose Opus plan phase runs at the standard 200K. For a ~400K planning context you would use opus[1m] (or pin a 1M model) for that phase instead.
Exam essentials
- Plan mode = read-only research — Claude reads, explores via shell, and writes a plan, but does not edit source; permission prompts still apply the same as default mode (it is no-edit, not prompt-free).
- “Direct execution” is not a mode — it is working in a write mode (
default/acceptEdits); plan’s contrast is the modes that let edits through. - Enter —
Shift+Tabcycle (default→acceptEdits→plan),/plan [task],--permission-mode plan, ordefaultModein settings. - Exit — approving a plan exits plan mode into the chosen write mode and Claude starts editing; to plan again, cycle back with
Shift+Tabor prefix/plan. opusplan— alias that runs Opus in plan mode, Sonnet in execution; set via/model opusplanor--model opusplan. Approval flips both the permission mode and the model. Its plan phase runs at 200K — the 1M upgrade isopus-only, notopusplan.- The decision — plan when reversal cost is high or the design space is uncertain (unfamiliar / multi-file / risky); go direct for small diffs in known code. Plan contains misunderstandings before edits.
Iterative Refinement: The Loop, the Interview, and Test-Driven Prompting
Agentic work is iterative. The explore-plan-implement-commit rhythm, the interview pattern (Claude interviews you, writes a spec, a fresh session implements), and test-driven prompting are the durable disciplines — methodology that survives any tool rename, hence a stable principle.
D3.4’s plan mode is one phase of a larger loop. This chapter is the loop itself — how an architect drives a task from understanding to a committed change, refining as they go. These are disciplines, not features: the explore-plan-implement-commit rhythm, the interview pattern, and test-driven prompting outlast any particular keybinding or tool name, which is why this chapter is a stable principle rather than a feature surface.
The four-phase rhythm
Letting Claude jump straight to coding produces code that solves the wrong problem; the antidote is a phased loop. “The recommended workflow has four phases: Explore [plan mode, read files] → Plan [create detailed implementation plan] → Implement [switch out of plan mode, verify against plan] → Commit [descriptive message + PR].” [Official] Best practices for Claude Code · AnthropicT1-official original Explore and Plan are the read-only front half (D3.4’s plan mode is exactly this); Implement and Commit are where edits land. Skip the plan phase only when the diff is one-sentence-describable. [Official] Best practices for Claude Code · AnthropicT1-official original
The interview pattern
For a large feature with an unclear design space, the most effective opening is to invert the usual flow and let Claude drive the questions: “For larger features, have Claude interview you first. Start with a minimal prompt and ask Claude to interview you using the AskUserQuestion tool. Claude asks about things you might not have considered yet, including technical implementation, UI/UX, edge cases, and tradeoffs.”
[Official]
Best practices for Claude Code · AnthropicT1-official original The interview ends by writing a complete spec to a file, and then a fresh session implements from that spec — the interview session is full of question-answer thrashing, while the implementation session starts with clean context whose only input is the written spec.
The spec file is the bridge between the two sessions: a deliverable you can review and the context bootstrap the implementation session reads.
Give Claude a way to verify its work
The highest-return habit in the whole loop is supplying a success criterion: “Include tests, screenshots, or expected outputs so Claude can check itself. This is the single highest-leverage thing you can do. … Without clear success criteria, it might produce something that looks right but actually doesn’t work.” [Official] Best practices for Claude Code · AnthropicT1-official original This is also what turns a vague prompt concrete — “the build is failing” becomes “the build fails with this error: [paste]; fix it, verify the build succeeds, and address the root cause.” Test-driven prompting is the same instinct formalized: for a bug, ask for “a failing test that reproduces the issue, then fix it”; for longer work, have Claude write the tests first and keep them as the persistent contract. [Official] Best practices for Claude Code · AnthropicT1-official original
Course-correct early — and know when to restart
Iteration only pays if the loop stays clean. “The best results come from tight feedback loops” — correct Claude quickly rather than letting a wrong direction run. But there is a threshold: “If you’ve corrected Claude more than twice on the same issue in one session, the context is cluttered with failed approaches. Run /clear and start fresh with a more specific prompt that incorporates what you learned. A clean session with a better prompt almost always outperforms a long session with accumulated corrections.”
[Official]
Best practices for Claude Code · AnthropicT1-official original
The hands-on mechanics of this loop — the session rhythm turn by turn, and the dedicated treatments of the interview pattern and the testing discipline — are the handbook’s territory: see the Use book, Chapter 3, Your First Working Session, with the interview-pattern and testing-and-verification chapters forthcoming there.
Practice
Exercise solutions
B. A large feature with an unsettled design space is the documented home of the interview pattern: a minimal prompt asks Claude to interview you via AskUserQuestion, surfacing edge cases and tradeoffs you have not considered, and the interview ends in a SPEC.md that a fresh session then implements from clean context. A assumes you already hold a complete spec — but the premise is that you do not, so you would be encoding gaps as requirements. C burns turns thrashing and pollutes context with half-formed direction. D is the closest miss: plan mode (D3.4) makes Claude explore the code, but the interview pattern makes Claude interrogate you about intent and tradeoffs — and here the unknowns live in your requirements, not in the codebase. The interview elicits the spec; plan mode would plan against a spec you have not written yet.
The four phases in order are Explore → Plan → Implement → Commit — explore (read-only / plan mode) builds understanding, plan writes a detailed implementation plan, implement switches out of plan mode and verifies against that plan, and commit writes a descriptive message and PR. You may skip the plan phase only when the diff is one-sentence-describable — a change small and clear enough that there is nothing for a plan to de-risk.
The pattern is: “Write a failing test that reproduces the issue, then fix it” — make Claude first encode the bug as a test that fails, then make that test pass. It outperforms “fix the login bug” for two reasons. First, the failing test is an unambiguous success criterion: Claude can iterate against ground truth on its own turns instead of waiting for you to judge each attempt — the highest-leverage move in the loop. Second, the test persists as a regression contract: it stays green afterward, so the same bug cannot silently return. “Fix the login bug” gives Claude no way to check itself and leaves nothing behind to prove the fix held.
Exam essentials
- Four-phase rhythm — Explore (read-only / plan mode) → Plan → Implement (verify against plan) → Commit; skip the plan only when the diff is one-sentence-describable.
- Interview pattern — for large features, Claude interviews you via
AskUserQuestion, writes aSPEC.md, and a fresh session implements from it (clean context). - Verification is the single highest-leverage move — include tests, screenshots, or expected outputs so Claude checks itself; without criteria, the human is the only feedback loop.
- Concrete beats vague — replace “the build is failing” with the error plus “address the root cause”; the delta is verb + concrete example/test/file + a verification step.
- Course-correct early, then restart — tight loops beat drift, but after two corrections on the same issue,
/clearand rewrite the prompt incorporating what you learned.
CI/CD Integration: Headless Runs, Output Formats, and GitHub Actions
Running Claude Code in CI — the headless `claude -p` entry point and `--bare` for reproducibility, the three output formats, schema-validated structured output via `--json-schema`, the permission flags that lock down a run with no human to prompt, and the GitHub Actions wrapper with its credential model.
D2.5 and D3.4 governed permission inside an interactive session. This chapter takes the same agent out of the terminal and into a pipeline. The mechanics change — there is no one at the keyboard to approve a tool or answer a question — so a headless run has to decide its output shape and its permission surface up front. The payoff is that Claude Code becomes a scriptable, gated CI citizen.
The headless invocation
The entry point for everything in this chapter is one flag: “claude -p "<query>" is the canonical non-interactive invocation; the CLI exits after responding. All standard CLI options work with -p.”
[Official]
Run Claude Code programmatically · AnthropicT1-official original That single command runs the full agent loop and returns — no prompt, no session UI.
For CI you almost always pair it with --bare: “Add --bare to reduce startup time by skipping auto-discovery of hooks, skills, plugins, MCP servers, auto memory, and CLAUDE.md. Without it, claude -p loads the same context an interactive session would, including anything configured in the working directory or ~/.claude. Bare mode is useful for CI and scripts where you need the same result on every machine.”
[Official]
Run Claude Code programmatically · AnthropicT1-official original It is the recommended mode for scripted calls and is slated to become the default for -p in a future release.
[Official]
Run Claude Code programmatically · AnthropicT1-official original
Output formats
A headless run can emit one of three shapes, selected with --output-format: text (default), json (a single payload with result, session_id, and total_cost_usd), and stream-json (newline-delimited events).
[Official]
Run Claude Code programmatically · AnthropicT1-official original The json form is what makes a run scriptable: “With --output-format json, the response payload includes total_cost_usd and a per-model cost breakdown, so scripted callers can track spend per invocation without consulting the usage dashboard.”
[Official]
Run Claude Code programmatically · AnthropicT1-official original
Structured output with --json-schema
When a downstream step needs a specific shape rather than prose, constrain the output to a schema: “To get output conforming to a specific schema, use --output-format json with --json-schema and a JSON Schema definition. The response includes metadata about the request (session ID, usage, etc.) with the structured output in the structured_output field.”
[Official]
Run Claude Code programmatically · AnthropicT1-official original The flag is --json-schema '<schema>'; the CLI reference describes it as producing “validated JSON output matching a JSON Schema after agent completes its workflow (print mode only).”
[Official]
CLI reference · AnthropicT1-official original
A minimal schema looks like this (illustration only — the shape is yours to define):
claude -p "Classify this PR's risk" \
--output-format json \
--json-schema '{"type":"object","properties":{"severity":{"type":"string"}},"required":["severity"]}'
The schema-conforming result then arrives in structured_output, alongside the usual session_id and usage metadata.
Permission gates for a run with no human
The defining constraint of CI is that no one is there to approve a tool call, so the permission surface must be settled before the run starts. The locked-down mode from D3.4 is built for exactly this: --permission-mode dontAsk “denies anything not in permissions.allow or the read-only command set,” which the docs call out as useful for locked-down CI runs.
[Official]
Run Claude Code programmatically · AnthropicT1-official original Pair it with an allowlist: --allowedTools "Bash(git diff *),Read" auto-approves specific tools and supports prefix matching.
[Official]
Run Claude Code programmatically · AnthropicT1-official original
Two more knobs bound a run: --max-turns N “limits agentic turns and exits with error when reached,” and --max-budget-usd <N> caps dollar spend — both print-mode-only.
[Official]
CLI reference · AnthropicT1-official original At the far end, --permission-mode bypassPermissions (alias --dangerously-skip-permissions) skips prompts entirely
[Official]
Run Claude Code programmatically · AnthropicT1-official original — appropriate only inside an isolated container, never as a convenience.
Exit codes: what CI actually gates on
Output format decides what a run prints; the exit code decides whether the pipeline step passes. A CI step’s pass/fail is the exit status of the process it ran — so a headless claude -p that runs the full loop and “exits after responding”
[Official]
Run Claude Code programmatically · AnthropicT1-official original hands its exit code straight to the runner, and 0 means the step succeeds while non-zero fails it. The docs name concrete non-zero triggers you can rely on: --max-turns N “limits agentic turns and exits with error when reached,”
[Official]
CLI reference · AnthropicT1-official original and an over-cap stdin (10 MB) “returns a clear error and non-zero exit.”
[Official]
Run Claude Code programmatically · AnthropicT1-official original
The same mechanism gives you a clean pre-flight gate: claude auth status “exits 0 if logged in, 1 if not — useful as a CI gate before the agent step.”
[Official]
Run Claude Code programmatically · AnthropicT1-official original Run it first and the job fails fast with a clear cause instead of burning a turn on an unauthenticated agent call.
GitHub Actions and the credential model
The managed CI surface wraps all of the above. Claude Code GitHub Actions is “built on top of the Claude Agent SDK”
[Official]
Claude Code GitHub Actions · AnthropicT1-official original and wraps claude -p in a GitHub Action runner. Beyond the direct Anthropic API it supports two cloud providers — Amazon Bedrock (use_bedrock) and Google Vertex AI (use_vertex) — each authenticated through GitHub OIDC / Workload Identity Federation, so no static cloud keys are stored.
[Official]
Claude Code GitHub Actions · AnthropicT1-official original The v1.0 interface is deliberately small: “mode is auto-detected; use prompt for all instructions and claude_args for any CLI passthrough”
[Official]
Claude Code GitHub Actions · AnthropicT1-official original — so everything from earlier sections (--bare, --output-format, --allowedTools) reaches the runner through claude_args.
Practice
Exercise solutions
B. The job has no human to approve anything, so the permission surface must be fixed up front: dontAsk denies anything not pre-approved, --allowedTools "Read,Bash(npm test)" grants exactly the read and test-run capability (prefix matching scopes Bash to the test command), and --bare makes the run reproducible across machines. A does the opposite of locking down — --dangerously-skip-permissions approves everything, including edits and pushes. C auto-approves file edits, but the job is supposed to be read-only. D is the classic headless trap: in CI there is no one to answer the approval prompt, so a tool that falls through to default mode stalls or is denied rather than helpfully pausing. The locked-down combination is dontAsk + a tight allowlist + --bare.
Use --output-format json and read the session_id and total_cost_usd fields. The json form returns a single payload with result, session_id, and total_cost_usd (plus a per-model cost breakdown), so the later step can resume with the captured session_id and log spend from total_cost_usd — no usage-dashboard round-trip. The default text format returns only the final response (nothing parseable), and stream-json would make you reassemble the fields from an event stream.
The flag is --bare, and the cause it addresses is non-reproducible context discovery. Without --bare, claude -p auto-discovers and loads whatever the machine has — the repo’s CLAUDE.md, a developer’s personal ~/.claude config, locally-configured MCP servers, hooks, skills, plugins, auto memory — so the same command becomes a function of the host rather than of its inputs, and two runners diverge. --bare skips that discovery, making the run reproducible across machines (and it is slated to become the -p default).
(a) The job gates on the run’s process exit code: claude -p exits after responding and hands its exit status to the runner — 0 passes the step, non-zero fails it. (b) Two documented non-zero conditions: hitting --max-turns (“exits with error when reached”) and an over-cap stdin (piped input above the 10 MB limit “returns a clear error and non-zero exit”); claude auth status exiting 1 when not logged in is a third, useful as a pre-flight gate. (c) Swallowing the status — e.g. ending the command with || true or masking it behind a pipe — so the shell reports success even though the agent run failed; let the exit code propagate.
Exam essentials
- Headless entry —
claude -p "<query>"runs non-interactively and exits; add--bareto skip discovery (hooks/skills/MCP/CLAUDE.md) for reproducible CI.--bareis slated to become the-pdefault. - Output formats —
text(default),json(result,session_id,total_cost_usd, per-model cost),stream-json(newline-delimited events;system/initcarriesplugin_errorsto fail CI). - Structured output —
--output-format jsonwith--json-schemaadds a validatedstructured_outputfield;--json-schemais print-mode only (as are--max-turns,--max-budget-usd). - Permission gates — CI has no human to prompt, so decide the surface up front:
dontAsk(deny anything not pre-approved) +--allowedTools(prefix matching, e.g.Bash(git diff *));--max-turns/--max-budget-usdcap a run;bypassPermissions/--dangerously-skip-permissionsskips all checks — containers only. - Exit codes are the CI contract — a step passes/fails on the run’s exit status (
0success, non-zero failure).--max-turnsreached and over-cap stdin (10 MB) exit non-zero;claude auth statusexits0/1as a pre-flight gate. Don’t mask the status (|| true) — a failed run would report green. - GitHub Actions — wraps
claude -p; v1.0 auto-detects mode (prompt+claude_args); the Anthropic API plus Bedrock + Vertex (the two cloud providers via OIDC, no static keys); supply credentials as secrets, never hardcoded.
Explicit Criteria over Vague Instructions
The controllable lever for output quality is the specification, not the model. Name the success criteria and the output shape explicitly; positive instruction beats negative; the model will not infer a requirement you did not state. This is durable methodology — a stable principle, not a feature surface.
Part IV opens the domain the exam weights at 20% — getting a model to produce what you actually need. The first lever is the cheapest and the most overlooked: the prompt’s own precision. Everything later in this Part (few-shot, structured outputs, validation loops) is an escalation from this baseline. The principle here outlasts every model version — newer models make it more true, not less — which is why this chapter is a stable principle, not a feature surface.
Specify the output format; do not leave it to inference
The single most reliable quality lever is to state the output contract explicitly. “Precisely define your desired output format using JSON, XML, or custom templates so that Claude understands every output formatting element you require.”
[Official]
Increase output consistency · AnthropicT1-official original A vague instruction (“summarize this”) leaves the shape — length, fields, ordering, what to do with missing data — for the model to guess, and a guess varies from run to run. An explicit instruction (“return JSON with keys sentiment, key_issues (list), and action_items (list of objects with team and task)”) removes the variance at its source.
The model will not infer what you did not ask for
Modern models follow instructions more literally, which makes implicit expectations a liability. “Claude Opus 4.8 interprets prompts literally and explicitly, particularly at lower effort levels. It does not silently generalize an instruction from one item to another, and it does not infer requests you didn’t make.” [Official] Prompting best practices · AnthropicT1-official original The upside is precision and less thrash; the cost is that a requirement you held in your head but never wrote down will simply not be honored. The fix is not a cleverer prompt that the model “figures out” — it is stating the requirement.
Tell the model what to do, not what to avoid
When you are steering format or tone, a positive instruction outperforms a prohibition. The docs are explicit that demonstrating the wanted behavior beats forbidding the unwanted one: “Positive examples showing how Claude can communicate with the appropriate level of concision tend to be more effective than negative examples or instructions that tell the model what not to do.” [Official] Prompting best practices: Use examples effectively (multishot / few-shot) · AnthropicT1-official original “Respond in smoothly flowing prose” steers better than “do not use markdown lists,” because a prohibition names a forbidden region without locating the target inside the (still vast) permitted one. A positive instruction points straight at the destination. The same logic applies to eliminating preambles: state “respond directly without preamble” rather than enumerating the openings you dislike. [Official] Prompting best practices · AnthropicT1-official original
The escalation ladder: instruction, then examples, then a hard schema
Explicit instruction is the first rung, not the only one. The documented hierarchy is to ask plainly first and escalate only when you need a stronger guarantee: “Try simply asking the model to conform to your output structure first, as newer models can reliably match complex schemas when told to, especially if implemented with retries. For classification tasks, use either tools with an enum field containing your valid labels or structured outputs.” [Official] Prompting best practices · AnthropicT1-official original Plain instruction handles most cases; a few-shot example (the next chapter) disambiguates edge cases; a hard schema (D4.3) makes a shape unrepresentable to violate. Each rung costs more — context, latency, setup — so you climb only as far as the stakes require.
The hands-on craft of writing these prompts — the iteration rhythm, the worked before-and-after examples — is the Use book’s territory; its prompt-engineering chapter is the use-side companion to this exam-angle treatment (forthcoming in the handbook).
Practice
Exercise solutions
C. The inconsistency is latent in the prompt: “summarize” leaves length, fields, and sentiment-handling unstated, so the model resolves them differently each run. Specifying the exact contract — fixed fields, each with a type and a length bound — removes those degrees of freedom and is what a downstream parser needs. A (temperature) reduces token-level randomness but does nothing about an underspecified shape; a deterministic model still has to invent a structure you never gave it. B is a negative, vague instruction — “too much” is undefined and “be consistent” names the goal without specifying the target. D is the exact assumption modern literal-following models invalidate: a bigger model will not infer a contract you didn’t write, and may follow your vague prompt more faithfully, not less.
A positive rewrite: “Respond directly with the answer in flowing prose.” That single instruction covers both prohibitions — “directly” eliminates the preamble, “flowing prose” rules out headers — by naming the target instead of the forbidden regions. It steers more reliably because a prohibition (“don’t use headers,” “no preamble”) shrinks the output space without locating the destination inside the still-vast permitted region, whereas the positive form aims straight at the one shape you meant.
The three rungs, cheapest first: (1) explicit instruction — name the four labels in the prompt and ask the model to return exactly one; (2) few-shot examples — demonstrate the labeling on ambiguous inputs; (3) structured outputs / strict tools — constrain decoding so only an in-set label can be emitted. Here you should climb to rung 3: the premise is that a malformed or out-of-set label crashes the pipeline, so you need the shape to be unviolatable, not merely likely-correct. An enum-constrained tool or structured output makes an out-of-set value unrepresentable — the only rung that turns “should be valid” into “cannot be invalid,” which is what a crash-on-violation contract demands.
Exam essentials
- Consistency lives in the specification — if two runs disagree, a degree of freedom was left unstated; pin the format (fields, types, lengths, missing-data handling) explicitly.
- Modern models follow instructions literally — Opus 4.8 “does not infer requests you didn’t make,” so an unstated requirement is an unmet one; state it rather than expecting the model to read intent.
- Positive beats negative — “respond in flowing prose” / “answer in one sentence” steers better than “don’t use lists” / “don’t be verbose”; aim at the target instead of ruling out one failure.
- Escalation ladder — explicit instruction → few-shot examples (D4.2) → structured outputs / strict tools (D4.3); climb only as far as the stakes require, since each rung costs more.
- Why stable-principle — “be explicit about what you want” survives every model version; newer, more-literal models make it more load-bearing, not less.
Few-Shot Prompting for Ambiguous Cases
Examples are the most reliable way to steer format, tone, and structure — and the only clean way to pin down an ambiguous case. The pattern is 3-5 relevant, diverse, structured examples, with at least one placed on the edge case showing the desired handling.
D4.1’s escalation ladder put examples on the second rung: when a plain instruction can’t fully pin a behavior — especially on the messy, ambiguous inputs — a demonstration does what a description cannot. This chapter is that rung. It is an architectural pattern: the 3-5-example construction and its quality criteria are stable across model versions, with the example-tag syntax as the illustration that may shift.
Examples are the most reliable steering mechanism
When a behavior is hard to describe, demonstrate it. “Examples are one of the most reliable ways to steer Claude’s output format, tone, and structure. A few well-crafted examples (known as few-shot or multishot prompting) can dramatically improve accuracy and consistency.” [Official] Prompting best practices: Use examples effectively (multishot / few-shot) · AnthropicT1-official original An example carries information a sentence struggles to: the exact field ordering, the precise tone, how a borderline input should resolve. The model does not memorize the examples — it extracts the implicit pattern across them and applies it to the new input.
The 3-5 sweet spot
The documented count is small and specific: “Include 3-5 examples for best results. You can also ask Claude to evaluate your examples for relevance and diversity, or to generate additional ones based on your initial set.” [Official] Prompting best practices: Use examples effectively (multishot / few-shot) · AnthropicT1-official original The range is not arbitrary — too few examples and the model latches onto an incidental trait; too many and you burn context for no gain and risk contradictory examples confusing it.
Relevant, diverse, structured
Quality matters more than count. The three criteria are explicit: “When adding examples, make them: Relevant: Mirror your actual use case closely. Diverse: Cover edge cases and vary enough that Claude doesn’t pick up unintended patterns. Structured: Wrap examples in <example> tags (multiple examples in <examples> tags) so Claude can distinguish them from instructions.”
[Official]
Prompting best practices: Use examples effectively (multishot / few-shot) · AnthropicT1-official original Diversity is the criterion most often neglected and the most consequential: three examples that all happen to share an irrelevant trait teach that trait as if it were the rule.
Target the ambiguous case directly
This is the heart of the cert task area. When an extraction or classification has a messy case — a missing field, an “other” bucket, an unusual variant — do not write a separate rule for it; show an example on that case with the handling you want. The model generalizes from the example treatment, not from a prose rule beside it. Put the ambiguous input in the middle of the set with its desired output:
<examples>
<example>
<input>Order #4815 shipped on Apr 3 via UPS tracking 1Z999AA10123456784.</input>
<output>{"order_id": "4815", "carrier": "UPS", "tracking": "1Z999AA10123456784"}</output>
</example>
<example>
<input>Customer asked about order status yesterday but gave no order number.</input>
<output>{"order_id": null, "carrier": null, "tracking": null}</output>
</example>
<example>
<input>Shipped today via 'FedEx Express Saver' - see ref 7712-4488-9933.</input>
<output>{"order_id": null, "carrier": "FedEx", "tracking": "7712-4488-9933"}</output>
</example>
</examples>
The middle example is the ambiguous case: it teaches that “no order number” resolves to null — not an empty string, not "unknown", not "n/a".
[Official]
Prompting best practices: Use examples effectively (multishot / few-shot) · AnthropicT1-official original Few-shot also composes with the next rung of the ladder: with structured outputs (D4.3), the schema locks the shape while examples still teach content and edge-case handling — the two are complementary, not redundant.
[Official]
Prompting best practices: Use examples effectively (multishot / few-shot) · AnthropicT1-official original A schema can require order_id to be a string-or-null; only an example teaches that this kind of input is the null one.
The use-side craft of building and iterating on an example set lives in the Use book’s prompt-engineering chapter (forthcoming), alongside D4.1’s hands-on companion material.
Practice
Exercise solutions
B. The ambiguous case — an input with no due date — is exactly what an example should demonstrate: place one in the set whose output shows "due_date": null, and the model generalizes from that treatment. A is the D4.1 instinct (be explicit), and it helps, but a prose rule beside the examples is weaker than a demonstration on the case itself: the model generalizes from the example treatment, not from a separate written rule sitting next to it (the principle established in “Target the ambiguous case” above). C adds volume without diversity: ten clean invoices never show the missing-date case, so the model still has nothing to copy for it. D works mechanically but is a band-aid that hard-codes one symptom (“unknown”); the model may emit “none” or "" next, and you are now maintaining a translation table instead of fixing the prompt. The few-shot fix targets the root cause.
The three criteria are relevant (mirror your actual use case closely), diverse (cover edge cases and vary enough that the model doesn’t latch onto an unintended pattern), and structured (wrap each example in <example> tags, the set in <examples>, so the model separates demonstrations from instructions). Diversity is the silent corrupter: when three examples happen to share an incidental trait — every input ends in a period, every output capitalizes the first field — the model generalizes from whatever is common across the set, so it learns that trait as if it were the rule. The prompt still “looks right,” but it has quietly taught the wrong invariant, and the failure only surfaces on inputs that don’t share the accidental trait.
With a single example, the model cannot tell which of the example’s traits are the pattern and which are incidental — quoting the first field is one concrete trait of that one sample, and with nothing to contrast it against, the model copies it as if it were required. This is the documented failure of 1-2 examples: high risk of picking up an incidental pattern instead of the intended one. Raising the count into the 3-5 range fixes it by adding contrast: across several examples that vary the incidental traits (some quote nothing, different field orders) while holding the intended pattern constant, the first-field-quoting habit no longer appears in every example, so the model stops treating it as the rule. The cure is diversity-via-count, not volume for its own sake.
Exam essentials
- Few-shot is the most reliable steering mechanism for format, tone, and structure — the model extracts the implicit pattern across examples, so it disambiguates where instructions can’t.
- 3-5 examples is the documented sweet spot: 1-2 risks learning an incidental trait, 6+ burns context and risks contradictions; budget the 3-5 as one canonical + variants + one edge case.
- Relevant, diverse, structured — mirror the real use case, vary enough to avoid spurious patterns, and wrap each in
<example>(group in<examples>) so the model separates examples from instructions and input. - Target the ambiguous case — put an example on the edge case (null field, “other” bucket) showing the desired handling; the example teaches it, a prose rule beside it teaches it less reliably.
- Composes with structured outputs — the schema locks shape, examples teach content and edge-case handling; complementary, not redundant.
Structured Output via Tool Use and JSON Schema
Forcing a known-shape JSON result has two generations. The classic pattern borrows the tool-call channel as a typed output slot; the modern features (strict tool use and output_config.format) use grammar-constrained decoding to make a non-conforming shape unrepresentable. The JSON-Schema subset, additionalProperties false, and the per-request limits are the surfaces to know.
D4.1 ended with a top rung: when a shape must not be violated, make it unrepresentable. This chapter is that rung’s machinery. It has two generations — the older tool-use pattern that is still the right tool for open-ended schemas, and the newer grammar-constrained features that eliminate schema-violation retries entirely — and because the substance here is named API fields, schema rules, and numeric limits, it is a feature surface.
The classic mechanism: a tool whose input is your output
The oldest reliable way to get JSON is to borrow the tool-call channel. Define a tool whose input_schema is exactly the shape you want back, force Claude to call it, and read the call’s input — that object is your extracted JSON; you discard the tool’s “result” entirely. The convention is to name the tool print_X (print_summary, print_entities) so the model treats it as committing data rather than taking an action.
[Official]
Extracting Structured JSON using Claude and Tool Use · AnthropicT1-official original Forcing the call is what guarantees the extraction happens: tool_choice: {type: "tool", name: "print_summary"}.
[Official]
Extracting Structured JSON using Claude and Tool Use · AnthropicT1-official original
tools = [{
"name": "print_summary",
"description": "Prints a summary of the article.",
"input_schema": {
"type": "object",
"properties": {
"author": {"type": "string"},
"topics": {"type": "array", "items": {"type": "string"}},
"summary": {"type": "string"},
},
"required": ["author", "topics", "summary"],
},
}]
resp = client.messages.create(model="claude-opus-4-8", max_tokens=1024, tools=tools,
tool_choice={"type": "tool", "name": "print_summary"}, messages=[...])
json_summary = next(b.input for b in resp.content if b.type == "tool_use")
Strict tool use: from shape to guarantee
The classic pattern controls which fields appear, but not their types — Claude could still emit "2" where you need 2. Setting strict: true on the tool definition closes that gap: “Setting strict: true on a tool definition guarantees Claude’s tool inputs match your JSON Schema by constraining the model’s token sampling to schema-valid outputs (a technique called grammar-constrained sampling).”
[Official]
Strict tool use · AnthropicT1-official original The motivation is operational: “Without strict mode, Claude might return incompatible types (‘2’ instead of 2) or missing required fields, breaking your functions and causing runtime errors.”
[Official]
Strict tool use · AnthropicT1-official original For “call one of N candidate tools and validate its inputs,” combine tool_choice: {type: "any"} with strict: true on each tool.
[Official]
Strict tool use · AnthropicT1-official original
Structured outputs: constrain the response itself
Strict tool use constrains a tool call. Its sibling constrains Claude’s response directly: output_config.format coerces the final assistant text to a JSON schema using the same pipeline. The two are “two complementary features: JSON outputs (output_config.format) … Strict tool use (strict: true),” and the payoff is the elimination of retry loops: “Structured outputs guarantee schema-compliant responses through constrained decoding … No retries needed for schema violations.”
[Official]
Structured outputs · AnthropicT1-official original The request carries output_config: {format: {type: "json_schema", schema: {...}}}, and the conforming JSON arrives in the response text.
Note the migration: “The output_format parameter has moved to output_config.format, and beta headers are no longer required. The old beta header (structured-outputs-2025-11-13) and output_format parameter will continue working for a transition period.”
[Official]
Structured outputs · AnthropicT1-official original
The JSON-Schema subset and its one mandatory rule
Both features accept a subset of JSON Schema, not the full draft. Objects, arrays, the scalar types, enum, const, anyOf, and internal $ref are supported; external $ref, recursive schemas, numerical bounds (minimum/maximum), and string-length bounds are not — unsupported features return a 400.
[Official]
Structured outputs · AnthropicT1-official original The one rule that catches everyone: additionalProperties: false is required on every object node — it is the most common 400 for hand-authored schemas. When you need a numeric or length bound, the SDK helpers strip it from the schema, encode it as description text, and validate it client-side after the call instead.
[Official]
Structured outputs · AnthropicT1-official original
Limits, caching, and the failure modes that still get through
Three operational facts complete the picture. Caching: the compiled grammar carries a first-request latency, then is “cached for 24 hours from last use,” and the cache “invalidates if you change the JSON schema structure or set of tools. Changing only name or description fields does NOT invalidate cache.”
[Official]
Structured outputs · AnthropicT1-official original Limits: a request allows at most 20 strict tools, 24 cumulative optional parameters across strict schemas, and 16 union-typed parameters; beyond that (or an internal grammar-size cap) you get a 400 “Schema is too complex for compilation.”
[Official]
Structured outputs · AnthropicT1-official original The failures constrained decoding cannot prevent. Grammar constraints guarantee every emitted token is schema-valid — but not that Claude emits a complete result, so two gaps survive and a caller must check stop_reason for both. First, a refusal: if Claude refuses, the response is stop_reason: "refusal" with a 200 status, you are billed, and the output may not match the schema.
[Official]
Structured outputs · AnthropicT1-official original Second, truncation: max_tokens is a hard cap on output, and stop_reason: "max_tokens" is its output-budget signal.
[Official]
How the agent loop works · AnthropicT1-official original If generation hits that cap mid-structure, every token emitted was schema-valid but the object never closed — the JSON is cut off, so a parser rejects it just the same. The fix is not a retry on the same budget but a larger max_tokens (or a smaller schema); the constrained decoder cannot finish a structure it ran out of room to write.
This is also why the classic cookbook pattern has a permanent niche: open-ended extraction with additionalProperties: true — “I don’t know which fields will be present” — is something structured outputs cannot do, since it requires additionalProperties: false. For open-ended schemas, plain tool use stays the right tool.
[Official]
Extracting Structured JSON using Claude and Tool Use · AnthropicT1-official original
Practice
Exercise solutions
A. A single forced tool plus strict: true gives both guarantees you need: the forced tool_choice ensures exactly this extraction runs, and strict constrains the inputs by grammar so seats is a real integer, not "3" — exactly the “incompatible types breaking your functions” case strict mode exists to prevent. B is the open-ended pattern; additionalProperties: true is for when you don’t know the fields, and it forgoes the strict guarantee — wrong for a fixed, type-critical record. C silently drops strict through the OpenAI-compatibility layer, so you lose the type guarantee precisely where you needed it. D is the unconstrained baseline D4.1 warned about — it can parse fine and still hand you "3", the failure you are trying to design out.
The most likely cause is an object node missing additionalProperties: false, and the one-line fix is to add it to every object in the schema — nested ones included. The decoder requires it because an open object (one allowing arbitrary extra keys) has no closed set of valid continuations to compile into a grammar: at each decoding step the model must know exactly which keys are permitted, and additionalProperties: false is what closes that set. Without it there is no finite grammar to constrain sampling against, so the API rejects the schema with a 400 rather than allow unconstrained keys. It is the top 400 for hand-authored schemas precisely because standard JSON Schema defaults additionalProperties to true, so a schema that “validates fine in your editor” still fails compilation here.
The niche is open-ended extraction — “I don’t know which fields will be present” — and the schema feature that defines it is additionalProperties: true (an object that may carry arbitrary, unknown keys). Constrained decoding cannot serve it because it requires additionalProperties: false on every object: the grammar must enumerate the permitted keys ahead of time, and an object that allows any key has no closed grammar to compile. So when the set of fields is genuinely unknown in advance, the classic print_X tool-use pattern — which imposes no such closure — remains the right tool; structured outputs is for shapes you can pin down completely.
What is happening is truncation, not a grammar failure. max_tokens is a hard cap on total output, and the generation ran into it partway through writing the object; the grammar did its job — every token emitted was schema-valid — but it cannot guarantee the structure finishes within the budget, so the JSON is cut off before its closing braces and the parser rejects it. The confirming signal is stop_reason: "max_tokens" (the output-budget value), as opposed to end_turn. The fix is to raise max_tokens (or shrink the schema / split the extraction). A plain retry of the identical request is not the fix because it re-runs against the same budget and truncates at the same place — you must enlarge the room before the structure can complete.
Exam essentials
- Classic tool-use pattern — define a
print_Xtool whoseinput_schemais your output shape, force it withtool_choice: {type: "tool", name: ...}, readtool_use.input; the tool result is discarded. strict: true— grammar-constrains tool inputs to the schema (no wrong types, no missing required fields); pair withtool_choice: {type: "any"}for “one-of-N and valid.” Ignored on the OpenAI-compat layer.output_config.format— grammar-constrains Claude’s response to a JSON schema; “no retries needed for schema violations.” Migrated from theoutput_formatparam /structured-outputs-2025-11-13beta header.- Schema subset — a subset of Draft 2020-12;
additionalProperties: falseis mandatory on every object (top 400 cause); no numeric/length bounds (SDKs strip them to descriptions + post-validate); no external$refor recursion. - Limits + failure modes — 20 strict tools / 24 optional params / 16 union types per request; grammar cached 24h from last use (invalidated by schema/tool-set change, not name/description); two failures slip past constrained decoding — a refusal (
stop_reason: "refusal"; 200, billed, may not match) and truncation (stop_reason: "max_tokens"; JSON cut off mid-object). Always checkstop_reasonfor both; raisemax_tokensfor the second rather than retrying.
Validation, Retry, and Feedback Loops
Constrained decoding eliminates schema errors, never semantic ones — valid JSON can still hold wrong data. The architect's job is the layer above the schema, discriminating the two error kinds, encoding semantic checks into the schema itself, and closing a bounded validate-feed-back-retry loop that escalates to a human on exhaustion.
D4.3 closed with a hard guarantee — a schema a response cannot violate — and one caveat: a refusal still gets through. This chapter is about a deeper caveat. Constrained decoding guarantees the shape, never the truth. A perfectly schema-valid record can name the wrong customer or fabricate a total. The pattern for catching that lives above the API, in your validation loop — and the loop, not its current field names, is what this chapter is about, which makes it an architectural pattern.
Two kinds of error: schema and semantic
Structured outputs (D4.3) eliminates a whole class of failure: “Always valid: No more JSON.parse() errors. Type safe: Guaranteed field types and required fields. Reliable: No retries needed for schema violations.”
[Official]
Structured outputs · AnthropicT1-official original What it cannot touch is the other class — the semantic errors: responses that are valid JSON matching your schema but containing incorrect data, the very failures the SDK’s validate-and-feed-back machinery exists to catch.
[Official]
Get structured output from agents · AnthropicT1-official original A schema can require customer_name to be a non-empty string; it cannot know the source said “Jane” while the model wrote “John.”
The SDK retry loop handles the schema layer
For the residual schema mismatches in a multi-tool agentic run, the Agent SDK adds a retry loop: “the SDK validates the output against it, re-prompting on mismatch. If validation does not succeed within the retry limit, the result is an error instead of structured data.”
[Official]
Get structured output from agents · AnthropicT1-official original Crucially, exhaustion is a result you inspect, not an exception that throws — you discriminate on subtype: success carries the typed payload in message.structured_output; error_max_structured_output_retries means the budget ran out and you must fall back.
[Official]
Get structured output from agents · AnthropicT1-official original
Encode the semantic check into the schema
You cannot retry your way out of a semantic error the SDK never sees — so make the model commit to signals a caller can check. The pattern is to add fields whose only job is verification:
Each pattern converts an un-checkable judgment (“is this right?”) into a mechanical test (“does calculated_total equal the sum?”).
[Official]
Get structured output from agents · AnthropicT1-official original The model is doing the same extraction either way; you are just asking it to show enough of its work that a downstream check can catch a lie.
Close the loop: validate, feed back, retry, escalate
The full pattern stacks three independent layers, and skipping any one surfaces a different failure.
[Official]
Get structured output from agents · AnthropicT1-official original Layer 1 is constrained decoding (schema errors gone). Layer 2 is your application code running the semantic cross-checks above. Layer 3 re-prompts with the specific failures (“calculated_total does not equal the sum of line items; re-extract correcting this”) for a bounded number of attempts, then falls back.
The schema-design heuristics fold back into D4.3: keep schemas focused, and mark fields optional when the source might not contain them — an over-required schema turns a missing field into a retry and then an exhausted-budget error.
[Official]
Get structured output from agents · AnthropicT1-official original
Practice
Exercise solutions
B. A wrong-but-well-typed total is a semantic error — valid JSON, incorrect data — so no schema or type guarantee touches it. The fix is to make the error checkable: have the model emit both the document’s stated_total and its own calculated_total, then let application code compare them and route mismatches to review. A (strict) guarantees the total is a number, which it already was; it does nothing about a number being wrong. C (more tokens) addresses truncation, not arithmetic fabrication. D (minimum) is both unsupported by the structured-outputs subset and irrelevant — a bound on magnitude can’t detect a total that’s internally inconsistent with the line items.
The two subtypes are success — the run validated, and the typed payload is on message.structured_output — and error_max_structured_output_retries — validation failed within the retry budget, so there is no payload and you must fall back (simpler schema, simpler prompt, or human review). You must branch on subtype before reading the payload because exhaustion returns a result, not an exception: code that reads message.structured_output on the error path reads undefined and silently processes garbage downstream. The subtype is the success/failure contract; the payload is present and trustworthy only on success.
The three layers are (1) API constrained decoding (the schema layer — eliminates syntax/type/required/enum errors), (2) application-code semantic checks (the domain layer — cross-checks what the data means), and (3) a bounded feedback loop (re-prompts with the specific failures, then escalates on exhaustion). A fabricated customer_name — a valid string naming the wrong person — is a semantic error: layer 1 cannot catch it (the shape is perfect) and layer 2 is the one that must. The hook that makes the catch possible is a provenance field: have the model emit the source span it drew the name from (claim + source.span_quote + confidence), so application code can verify the quoted span actually appears in the document — turning “is this the right person?” into a mechanical string-containment check (D5.6).
(a) The failure is truncation, not a schema or semantic error: the response hit the max_tokens output cap partway through writing the object, confirmed by stop_reason: "max_tokens" (the output-budget value, versus end_turn). (b) The retry loop makes it worse because each re-prompt runs against the same max_tokens budget, so it truncates at the same place — every attempt fails validation identically, and the loop burns its entire allowance reaching error_max_structured_output_retries without ever being able to succeed. (c) The fix is to detect stop_reason: "max_tokens" and raise the cap (or shrink / split the schema) before retrying — retries cannot manufacture room the budget does not allow.
Exam essentials
- Schema vs semantic — constrained decoding eliminates syntax/type/required/enum errors; semantic errors (valid JSON, wrong data) are invisible to the API and need domain logic.
- SDK retry loop — validates and re-prompts on mismatch; the result is
success(payload inmessage.structured_output) orerror_max_structured_output_retries(fall back). It’s a result you check, not an exception; the retry count is undocumented. - Schema-level semantic hooks —
detected_pattern,stated_totalvscalculated_total,conflict_detected, nullable “other” + detail, and the provenance triple turn “is it correct?” into a mechanical cross-check. - Three layers — API constrained decoding (schema) + application-code semantic checks + a bounded feedback loop that re-prompts with the specific errors; escalate to human review on exhaustion.
- Loop economics — bound the attempts (each retry is a full inference, ~4× for three retries) and keep the schema stable across retries so you don’t re-pay grammar compilation. A truncation (
stop_reason: "max_tokens") is the trap retries can’t fix — re-prompting on the same budget re-truncates; detect it and raise the cap instead.
Batch Processing: The Message Batches API
When nothing is waiting on the answer, batch trades latency for half the price. The Message Batches API processes up to 100,000 async requests at a 50 percent discount with a 24-hour SLA, and its one non-negotiable contract is custom_id matching, because results come back in any order.
The first three chapters of Part IV controlled a single response. This one scales to a hundred thousand of them. A batch is the right tool whenever the work is large and nothing is waiting on the answer — and its surface (the endpoint, the size limits, the custom_id rule, the beta header) is exactly the kind of named detail that shifts between releases, so this is a feature-surface chapter.
The cost-latency trade
The Message Batches API exists for one trade: give up immediacy, get half off. “The Message Batches API is a powerful, cost-effective way to asynchronously process large volumes of Messages requests. This approach is well-suited to tasks that do not require immediate responses, with most batches finishing in less than 1 hour while reducing costs by 50% and increasing throughput.” [Official] Batch processing (Message Batches API) · AnthropicT1-official original The discount is a flat 50% on both input and output across all tiers, and it stacks with prompt-caching discounts. The cost of the discount is a service-level agreement measured in hours, not milliseconds: most batches finish within an hour, but the guarantee is 24, and a batch that does not complete in 24 hours expires. [Official] Batch processing (Message Batches API) · AnthropicT1-official original Results stay retrievable for 29 days after creation.
The custom_id contract
A batch is a set, not a sequence, and that has one non-negotiable consequence: “Batch results can be returned in any order, and may not match the ordering of requests when the batch was created. … To correctly match results with their corresponding requests, always use the custom_id field.”
[Official]
Batch processing (Message Batches API) · AnthropicT1-official original Every request carries a unique custom_id (1–64 characters, alphanumeric plus - and _), and that id is the only thread connecting an output back to the input that produced it.
The batch envelope: what fits and what it can’t do
A batch is bounded by size and by shape. Size: “A Message Batch is limited to either 100,000 Message requests or 256 MB in size, whichever is reached first.”
[Official]
Batch processing (Message Batches API) · AnthropicT1-official original Exceed the payload and the create call returns HTTP 413 — break huge datasets into multiple batches. Shape: a batch supports all Messages API features including beta features, “however, streaming is not supported for batch requests,”
[Official]
Batch processing (Message Batches API) · AnthropicT1-official original and each request is single-shot — there is no follow-up turn inside a batch, so multi-turn tool round-trips do not work. Structured outputs (D4.3), by contrast, compose cleanly: a batched request can carry output_config.format and you get schema-valid JSON at 50% off.
[Official]
Structured outputs · AnthropicT1-official original
Billing, result types, and the lifecycle
You pay only for what works: a result is succeeded, errored, canceled, or expired, and “you are not billed for errored, canceled, or expired requests.”
[Official]
Batch processing (Message Batches API) · AnthropicT1-official original For unusually long generations there is an opt-in: the output-300k-2026-03-24 beta header “raises the max_tokens cap to 300,000 for batch requests using Claude Opus 4.8, Claude Opus 4.7, Claude Opus 4.6, or Claude Sonnet 4.6” — batch-only, and a single 300k generation can itself take over an hour, so submit it with the 24-hour window in mind.
[Official]
Batch processing (Message Batches API) · AnthropicT1-official original
Practice
Exercise solutions
B. The job is large, offline, and cost-sensitive with no one waiting — the exact profile batch is built for: 50% off, and 80,000 requests sits within the 100,000-request limit. Matching by custom_id after the batch ends is the required pattern because results return unordered. A works but forfeits the 50% discount and adds rate-limit and orchestration overhead for latency nobody needs. C is impossible — streaming is not supported for batch requests. D collapses 80,000 independent classifications into one prompt, which blows past context limits and produces a single entangled response with no per-ticket structure.
custom_id is mandatory because batch results “can be returned in any order” — a batch is a set, not a sequence, so there is no positional correspondence between the request list and the result stream to fall back on. The unique custom_id is the only thread joining an output back to the input that produced it. If a caller instead assumes submission order, the specific failure is a silent mis-join: result n is attributed to request n when it actually answers some other request, so records carry the wrong data and nothing in the response flags it. That is the most dangerous failure class — one that corrupts data without surfacing an error.
The two limits are 100,000 requests or 256 MB in size, whichever is reached first; the 256 MB payload limit is the one an HTTP 413 reports on creation (the fix is to split the dataset into multiple batches). A Messages API capability that does not work inside a batch: streaming (explicitly unsupported), or equally a multi-turn tool loop — each batched request is single-shot, with no tool_result round-trip, because a batch processes each request as one independent user→assistant turn with no follow-up.
Exam essentials
- The trade — batch is async and 50% off (input and output, all tiers, stacks with caching); most finish under an hour, the SLA is 24 hours, then the batch expires; results retained 29 days. Choose it by latency tolerance alone.
custom_idcontract — results return in any order; match by the uniquecustom_id(1–64 chars, alphanumeric +-+_). Never rely on positional order; never reuse an id.- Envelope — 100,000 requests or 256 MB per batch (HTTP 413 over payload); streaming unsupported; each request is single-shot (no multi-turn tool loop). Structured outputs compose (schema-valid at 50% off).
- Billing + beta — billed only for
succeeded;errored/canceled/expiredare free; theoutput-300k-2026-03-24beta raisesmax_tokensto 300,000 on batch for Opus 4.8/4.7/4.6 and Sonnet 4.6. - Succeeded ≠ usable — a
succeededresult still carries a per-messagestop_reason; a refusal ("refusal", 200, billed, may not match schema) or a truncation ("max_tokens", incomplete) reaches you as succeeded. Check each succeeded message’sstop_reason, not just the result type. - Lifecycle —
POST /v1/messages/batches→ poll untilended→ streamresults_urlJSONL → match bycustom_id→ optionalDELETEbefore the 29-day window.
Multi-Pass Review: Independent Reviewers and Attention Dilution
A fresh context catches what a self-review cannot, because attention dilutes as the window fills and an implementer is biased toward its own code. The same independent-reviewer pattern scales from a two-session Writer/Reviewer pair to a fleet of specialists guarded by a verification pass.
Part IV’s final chapter is about checking the work — at scale. The instinct to “ask the model to double-check itself” is exactly the wrong one, for a structural reason: a model reviewing in the same window that wrote the code is both dilated by a full context and biased toward what it just produced. The fix is independence, and it scales from two sessions to a fleet. The principle is stable; the product that embodies it is the illustration — so this is an architectural pattern.
Why a fresh context beats self-review
Two independent forces make same-session self-review weak. The first is attention dilution: “LLM performance degrades as context fills. When the context window is getting full, Claude may start ‘forgetting’ earlier instructions or making more mistakes. The context window is the most important resource to manage.” [Official] Best practices for Claude Code · AnthropicT1-official original The second is implementer bias: “A fresh context improves code review since Claude won’t be biased toward code it just wrote.” [Official] Best practices for Claude Code · AnthropicT1-official original A reviewer that never watched the code get written carries neither the polluted context nor the sunk-cost instinct to defend it.
The Writer/Reviewer pattern and its lightweight form
The canonical realization is two sessions. One writes; a second, with no inherited context, reviews; the first then addresses the feedback. The docs give a worked example: Session A implements a rate limiter, Session B reviews @src/middleware/rateLimiter.ts “looking for edge cases, race conditions, and consistency with existing middleware patterns,” and Session A applies the result.
[Official]
Best practices for Claude Code · AnthropicT1-official original The same shape works for tests: “have one Claude write tests, then another write code to pass them.”
[Official]
Best practices for Claude Code · AnthropicT1-official original When spinning up a second session is too heavy, the single-session analog is a verification subagent — “use a subagent to review this code for edge cases” — which runs in its own context window and so inherits none of the parent conversation’s assumptions.
[Official]
Best practices for Claude Code · AnthropicT1-official original
The fleet: parallelism plus a verification pass
At the top of the scale, the pattern becomes a fleet. In Anthropic’s Code Review product, “when a review runs, multiple agents analyze the diff and surrounding code in parallel on Anthropic infrastructure. Each agent looks for a different class of issue, then a verification step checks candidates against actual code behavior to filter out false positives.” [Official] Code Review · AnthropicT1-official original The surviving findings “are deduplicated, ranked by severity, and posted as inline comments.” [Official] Code Review · AnthropicT1-official original Fanning out to specialists is the direct architectural answer to attention dilution: rather than ask one reviewer to hold every bug class in one window at once, each agent owns a single class — and the isolated-context, lead-plus-specialists shape is the same one Anthropic’s multi-agent research system uses. [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original
Code Review in practice: availability, triggers, severity
The productized form of the pattern has a concrete surface the exam can probe. Availability is gated: “Code Review is in research preview, available for Team and Enterprise subscriptions. It is not available for organizations with Zero Data Retention enabled.” [Official] Code Review · AnthropicT1-official original
Triggers come in three modes per repo: once after PR creation, after every push, or manual — invoked by commenting @claude review (which then subscribes to subsequent pushes) or @claude review once (a single one-off pass).
[Official]
Code Review · AnthropicT1-official original The one-off is the lever for “review this PR now, but don’t enroll it in re-review on every push.”
Severity is a fixed three-tag taxonomy on every finding:
And one subtlety that catches people: the check run “always completes with a neutral conclusion so it never blocks merging through branch protection rules.” [Official] Code Review · AnthropicT1-official original Code Review advises; it does not gate. To actually block a merge on findings, read the severity breakdown from the check-run output in your own CI and fail the step yourself. [Official] Code Review · AnthropicT1-official original
Convergence: keep multi-pass from spamming
More passes is not strictly better — multi-pass review needs convergence rules, and attention dilution applies to the instructions as much as the code. The docs are explicit that “a long REVIEW.md dilutes the rules that matter most,”
[Official]
Code Review · AnthropicT1-official original so the broadcast instruction block stays short. And re-review on every push needs a damping rule so trivial diffs don’t draw endless commentary: an instruction like “after the first review, suppress new nits and post Important findings only” stops “a one-line fix from reaching round seven on style alone.”
[Official]
Code Review · AnthropicT1-official original Production-quality fleet review is real work, not free — Code Review averages roughly $15–25 and 20 minutes per review at current figures
[Official]
Code Review · AnthropicT1-official original — so the convergence rules are also cost control.
Practice
Exercise solutions
B. A fresh session or a verification subagent reviews with no inherited context, which removes both weaknesses of self-review at once — the reviewer is at peak performance in a clean window and has no authorship bias toward the code, exactly the conditions the docs credit for catching what self-review misses. A is self-review: most dilated (the implementation already fills the context) and most biased (the model defends what it wrote). C generates a second implementation, not a review, and gives you two artifacts to reconcile rather than a found bug. D re-pastes code into an already-polluted context; freshening the code does nothing about the accumulated context or the authorship bias — only a fresh, independent context fixes those.
Attention dilution says performance degrades as the context window fills — so a single agent asked to find every class of bug must hold the diff, the surrounding code, and a long checklist of bug categories in one window, and its attention to any one class thins as the others crowd in. Fanning out gives each specialist its own fresh context with a single mandate — race conditions, or injection, or edge cases — so none of them is operating dilated, and each brings full attention to its one class. The fleet trades one over-loaded reviewer for many focused ones; that is the direct architectural answer to dilution (paired, crucially, with a verification pass so the extra candidates don’t become noise).
The failure mode is false-positive amplification: parallel reviewers each independently flag plausible-but-wrong issues, and with no filter those candidates accumulate, so adding reviewers adds noise, not just signal — five agents surface five streams of unverified guesses. The verification pass re-checks each candidate against actual code behavior to filter out false positives before anything is posted, so only findings that survive a behavioral check reach the human. It is what makes fan-out a net gain rather than a faster false-positive generator; the surviving findings are then deduplicated and ranked by severity.
Exam essentials
- Fresh context beats self-review for two independent reasons: attention dilution (performance degrades as the window fills) and implementer bias (a model defends code it just wrote). Independence removes both.
- Writer/Reviewer — one session writes, a second independent session reviews, the writer addresses feedback; the test/code split is a variant; the verification subagent is the single-session, isolated-context form.
- Fleet + verification pass — parallel specialists each own one issue class (the answer to attention dilution); a verification step filters false positives, then dedupe + severity ranking. A fleet without verification amplifies false positives.
- Convergence rules — keep the broadcast instruction block short (“a long REVIEW.md dilutes the rules that matter most”) and damp re-review (“suppress new nits, post Important findings only”) so trivial diffs don’t draw endless passes; this is also cost control.
- Code Review surfaces — research preview, Team & Enterprise only, not under Zero Data Retention; three trigger modes (once after PR creation / after every push / manual via
@claude reviewor one-off@claude review once); severity taxonomy 🔴 Important / 🟡 Nit / 🟣 Pre-existing; the check run is neutral and never blocks a merge — gate by reading the severity breakdown in your own CI.
Long-Conversation Context: Accumulation, Degradation, Compaction
Context is a finite, accumulating resource, and a long conversation degrades before it overflows. This chapter frames the exam angle — cumulative context, the lost-in-the-middle and summarization failure modes, and lossy automatic compaction — and points to the design book where the degradation mechanics are proven in depth.
Part V is the reliability domain — context management, escalation, error propagation, provenance. It opens with the most basic constraint behind all of them: the context window is finite, and a long conversation gets worse before it gets full. This chapter is the cert-exam angle; the mechanics of why long context degrades are proven in depth in the design book, to which it points. It is an architectural pattern — the accumulation-and-compaction shape is stable, while the window sizes and message types are the moving surface.
Context is a finite, accumulating resource
Everything in a session shares one budget. “Context window is cumulative within a session. System prompt, tool definitions, CLAUDE.md, conversation history, tool inputs/outputs all accumulate.” [Official] How the agent loop works · AnthropicT1-official original And the budget is concrete: current windows are 1M tokens on Opus 4.8 and Sonnet 4.6 and 200k on Haiku 4.5 — though Opus 4.8’s tokenizer can consume up to 35% more tokens for the same text, so the same conversation costs more of the budget on one model than another. [Official] Models overview · AnthropicT1-official original
Degradation comes before overflow
The failure that matters is not hitting the limit — it is the quiet decline well before it. As a window fills, a model attends less reliably to material buried in the middle of a long context (the “lost-in-the-middle” effect), and any progressive summarization of earlier turns discards detail that may later turn out to matter. These degradation mechanisms — context rot, lost-in-the-middle, summarization loss — are the subject of the Agentic Systems Design book’s chapter on context rot, where they are established against the research; this chapter’s job is to make you recognize them on the exam.
Compaction: the automatic defense, and its cost
When a session approaches the limit, the loop defends itself: “Automatic compaction triggers near the context limit.” [Official] How the agent loop works · AnthropicT1-official original The defense is lossy by construction: “Compaction replaces older messages with a summary, so specific instructions from early in the conversation may not be preserved. Persistent rules belong in CLAUDE.md (loaded via settingSources) rather than in the initial prompt, because CLAUDE.md content is re-injected on every request.” [Official] How the agent loop works · AnthropicT1-official original Compaction buys room by trading away fidelity to the early conversation — exactly the region most at risk from lost-in-the-middle in the first place.
Where the depth lives
This chapter is the exam-angle surface; the design book owns the mechanism. The degradation research, the measurement of context rot, and the assembly strategies that fight it live in the Agentic Systems Design book — its chapter on context rot for the failure modes and its chapter on context assembly for the deliberate construction of what goes in the window. The exam-relevant skill is diagnostic: given a long-session scenario, name whether it is accumulation pressure, lost-in-the-middle, or post-compaction loss, and reach for the matching mitigation.
Practice
Exercise solutions
B. CLAUDE.md content is re-injected on every request, so a rule placed there is present in the context after compaction just as before it — exactly what a session-long constraint needs. The timing (failure right after a compaction summary) is the tell that the original instruction was summarized away. A works for a few turns but fights the symptom by hand and will fail again at the next compaction. C delays the limit but does not address degradation — a larger window still loses the middle, and a long enough session compacts anyway. D governs output length, not whether an early instruction is retained, so it is unrelated to the failure.
What accumulates: the system prompt, tool definitions, CLAUDE.md, the full conversation history (every user and assistant turn), and all tool inputs and outputs — everything shares one cumulative budget within the session. “Still fits” is not “well-attended” because the token limit is a capacity bound while attention is a quality that declines as the window fills: a model attends less reliably to material buried in the middle of a long context, so a conversation comfortably under the limit can still have effectively lost an instruction given fifty turns ago. Fitting is necessary but not sufficient for the model to be using all of it well.
When a session nears the context limit, automatic compaction triggers and replaces older messages with a summary to buy room. An instruction placed only in the opening prompt is at risk because compaction summarizes the oldest turns first, and a one-line rule from turn one rarely survives the summary intact — so the constraint silently stops being honored (the failure often shows up right after a summary appears). A durable rule belongs where it is re-injected on every request: CLAUDE.md (loaded via settingSources), whose content is re-added to context each request and so is present after compaction exactly as before it — its survival no longer depends on a summarizer’s discretion.
Exam essentials
- Cumulative budget — system prompt, tool defs, CLAUDE.md, conversation history, and tool I/O all accumulate in one finite window (1M tokens on Opus 4.8 / Sonnet 4.6, 200k on Haiku 4.5; tokenizer density varies by model).
- Degradation precedes overflow — lost-in-the-middle and summarization loss erode a long context before it hits the limit; “fits” is not “well-attended.” Depth lives in the design book’s context-rot chapter.
- Compaction is lossy — it triggers near the limit and replaces older messages with a summary, so early-conversation specifics may not be preserved.
- Durable rules belong in re-injected context — put session-long constraints in CLAUDE.md (re-injected every request), not the opening prompt, so compaction cannot strand them.
Escalation and Ambiguity Resolution
When an agent is uncertain or blocked, the reliable architecture makes it surface the decision rather than guess at intent. AskUserQuestion is the structured mechanism, the interview pattern is its proactive form, and the check-in is a control point where a human resolves what the model cannot.
Reliability is not only about catching errors after the fact; it is about not committing to a wrong interpretation in the first place. When intent is ambiguous, a well-built agent asks rather than assumes. This chapter is the exam angle on that discipline — the mechanism and its limits — and points to the handbook for the hands-on use-side workflow. The principle is durable; the tool that carries it is the illustration, so this is an architectural pattern.
Escalate, don’t guess
The foundational move is to surface a decision the model cannot make on its own. “While working on a task, Claude sometimes needs to check in with users. It might need permission before deleting files, or need to ask which database to use for a new project. Your application needs to surface these requests to users so Claude can continue with their input.” [Official] Handle approvals and user input · AnthropicT1-official original An architect’s job is to make those check-ins possible and routine — to design the agent so that hitting an ambiguity raises a question instead of silently resolving it.
AskUserQuestion: structured clarification
The mechanism is a built-in tool with a deliberately bounded shape. Each AskUserQuestion call carries 1–4 questions, each question a header (≤12 characters) and 2–4 options with a label and a description; the response maps each question to the chosen label, and free-text is handled by offering an “Other” choice and passing the typed text rather than the literal "Other".
[Official]
Handle approvals and user input · AnthropicT1-official original The structure is the point: bounded multiple-choice makes the human’s answer fast to give and unambiguous to route back into the agent’s flow.
The application side: the canUseTool callback
AskUserQuestion is the tool Claude calls; canUseTool is the callback your application implements to receive these interruptions and answer them. It fires on two triggers: when Claude wants to use a tool (a permission check) and when Claude calls AskUserQuestion (a clarification).
[Official]
Handle approvals and user input · AnthropicT1-official original So one callback is the single surface through which your app both gates tools and answers questions; it returns PermissionResultAllow / PermissionResultDeny (Python) or { behavior: "allow" | "deny" } (TS).
[Official]
Handle approvals and user input · AnthropicT1-official original
And its response is far richer than yes/no — the docs document six patterns:
Two interactions are worth memorizing. The callback is skipped in dontAsk mode — anything not pre-approved is denied without ever calling it (it is the last step of the permission chain, and dontAsk short-circuits before reaching it).
[Official]
Handle approvals and user input · AnthropicT1-official original And for a human who is not watching the terminal, the PermissionRequest hook can fire an external notification (Slack, email, push) when Claude is waiting on approval.
[Official]
Handle approvals and user input · AnthropicT1-official original
The interview pattern: escalate before you start
Escalation is strongest when it is proactive — ask the questions before any work depends on the answers. That is the interview pattern from D3.5: Claude interviews you with AskUserQuestion, writes a spec, and a fresh session implements from it.
[Official]
Best practices for Claude Code · AnthropicT1-official original The natural home for this is plan mode, since “clarifying questions are especially common in plan mode, where Claude explores the codebase and asks questions before proposing a plan.”
[Official]
Handle approvals and user input · AnthropicT1-official original Front-loading the questions resolves ambiguity while it is still cheap — before a single edit is built on a guessed interpretation.
The check-in as a control point
A clarifying question is also a deliberate pause, and the SDK treats it as one: the canUseTool callback “can stay pending indefinitely” while a human decides, and for long delays the agent can return a "defer" decision that ends the query and resumes later from the persisted session.
[Official]
Handle approvals and user input · AnthropicT1-official original That makes escalation the upstream half of human-in-the-loop — the agent yields control at the moment of uncertainty, and the human resolves what the model could not (the downstream half, routing low-confidence output to review, is D5.5). The hands-on, use-side treatment of when and how to prompt for clarification is the handbook’s territory (its escalation-patterns chapter is forthcoming).
Practice
Exercise solutions
B. The database choice is a genuine intent decision the model cannot infer, so the reliable move is to surface it: AskUserQuestion with a few bounded options gets a fast, unambiguous answer and lets the agent continue on the right branch. A guesses — a reasonable default is still a coin flip on a decision with downstream lock-in (migrations, drivers, hosting). C infers intent from an accident of the environment; what is installed on a build machine is not a statement of what the project should use. D is the over-correction: failing throws away a recoverable situation that one bounded question would resolve. Escalation, not guessing and not giving up, is the pattern.
One AskUserQuestion call carries 1–4 questions; each question offers 2–4 options (each an option a label + description), plus a short header (≤12 characters) and a multiSelect flag. The response associates an answer with its question by mapping the question to the chosen option’s label — { "answers": { "<question text>": "<label>" } } — so the agent routes each selection back to the specific question it answers (a multiSelect answer returns as an array or a comma-joined string). Free-text is handled by offering an “Other” option and passing the typed text rather than the literal "Other".
A subagent cannot call AskUserQuestion — it is an explicit SDK limitation, and a subagent runs in an isolated context with no channel back to the user. So a subagent that discovers an ambiguous spec has no way to escalate: it must guess or fail, the precise outcome escalation exists to prevent. The fix is to restructure the decomposition so the part that needs a human stays with the agent that can reach one: have the coordinator resolve the open questions first (via AskUserQuestion, ideally during a plan-mode interview), then hand the subagent a fully-specified, unambiguous task. Delegate only after intent is pinned — never push an unresolved decision down to an agent that cannot ask about it.
Exam essentials
- Escalate, don’t guess — design the agent to surface an ambiguous or blocked decision rather than silently pick an interpretation; the cost of asking is one round trip, the cost of guessing wrong is the whole task.
AskUserQuestion— 1–4 questions per call, 2–4 bounded options each (label+description, shortheader); the answer maps question to chosen label; free-text via an “Other” option. The bounded shape makes answers fast and routable.canUseTool— the app-side callback — fires on two triggers (Claude wants a tool / Claude callsAskUserQuestion); six response patterns: approve, approve-with-changes (updatedInput), approve-and-remember (PermissionUpdate), reject, suggest-alternative, redirect-entirely. Skipped indontAskmode; thePermissionRequesthook sends external notifications while waiting.- Proactive beats reactive — the interview pattern and plan mode front-load clarifying questions, resolving ambiguity while it is still cheap, before work is built on a guess.
- Subagents cannot escalate —
AskUserQuestionis unavailable in subagents, so resolve open questions in the coordinator before delegating a fully-specified task. - Check-in as control point — a clarifying question is a deliberate pause; the callback can stay pending or defer-and-resume, making escalation the upstream half of human-in-the-loop (D5.5 is the downstream half).
Error Propagation Across Multi-Agent Systems
In a chain of agents an error does not stay local — an upstream ambiguity becomes a downstream wrong decision, and concurrent faults compound into degradation no single component test reproduces. The defenses are structured error context across boundaries, independent validation, and circuit breakers.
A single agent fails visibly; a pipeline of agents fails by degrees. An error introduced at one stage rarely announces itself — it rides the handoff to the next stage as if it were sound input, and concurrent faults aggregate into a degradation that looks like nothing in particular. This chapter is about that propagation and the architecture that contains it. The pattern is durable; the percentages that illustrate it are evidence, so this is an architectural pattern.
Failures compound across agent boundaries
Multi-agent systems fail far more often than their individual agents do. One practitioner analysis, drawing on the MAST taxonomy of 1,600-plus execution traces, reports that “multi-agent LLM systems fail at rates between 41-86.7% in production because specification ambiguity and unstructured coordination protocols cause agents to misinterpret roles, duplicate work, and skip verification.” [Practitioner] Why multi-agent LLM systems fail and how to fix them · Augment CodeT3-practitioner original That taxonomy sorts the breakdowns into three categories covering most of them: specification problems (41.77%), coordination failures (36.94%), and verification gaps (21.30%). [Practitioner] Why multi-agent LLM systems fail and how to fix them · Augment CodeT3-practitioner original
An upstream ambiguity becomes a downstream wrong decision
The propagation mechanism is specific. “Agents cannot read between lines, infer context, or ask clarifying questions during execution. Every ambiguity becomes a decision point where agents explore all possible interpretations, selecting suboptimal ones.” [Practitioner] Why multi-agent LLM systems fail and how to fix them · Augment CodeT3-practitioner original This is the multi-agent counterpart to D5.2’s escalation problem: an interactive agent can pause and ask, but a mid-pipeline agent usually has no one to ask, so it resolves the ambiguity — and a quietly-wrong resolution is handed downstream as if it were settled fact. The next agent has no signal that the input it received was a guess.
Compounding failures evade per-component testing
When multiple faults run at once, their aggregate is not the sum of their symptoms — and that is what makes them hard to catch. Anthropic’s April-23 postmortem describes three production bugs with distinct, partly-overlapping windows: a reasoning-effort default change (Mar 4–Apr 7), a caching bug that broke thinking blocks (Mar 26–Apr 10), and a verbosity-reduction prompt (Apr 16–Apr 20). [Official] An update on recent Claude Code quality reports · AnthropicT1-official original Those windows union to a roughly seven-week span (Mar 4–Apr 20) — but that is the aggregate reach, not a stretch in which all three ran at once: the first two overlapped, while the third began only after both had been fixed. Even so, the combined effect “looked like broad, inconsistent degradation” that no single bug’s symptom resembled, and the most stubborn one — the caching bug — crossed context management, the API layer, and the extended-thinking system. [Official] An update on recent Claude Code quality reports · AnthropicT1-official original Detection was the central lesson: the bugs hit different traffic slices, and neither internal usage nor the existing eval suite reproduced them. [Official] An update on recent Claude Code quality reports · AnthropicT1-official original
Defenses: structured context, independent validation, circuit breakers
The countermeasures all push against under-specified, unchecked boundaries. The practitioner remedies are to convert prose specs into machine-validatable schemas, to enforce typed and schema-validated messages between agents (with MCP named as the schema-enforced substrate), to deploy isolated judge agents for independent validation, and to add circuit breakers that isolate a misbehaving agent before it cascades. [Practitioner] Why multi-agent LLM systems fail and how to fix them · Augment CodeT3-practitioner original The payoff is concrete: “PwC achieved a 7x accuracy improvement (10% to 70%) through structured validation loops.” [Practitioner] Why multi-agent LLM systems fail and how to fix them · Augment CodeT3-practitioner original
The single-agent equivalent is the postmortem’s own remedy — broad per-model evaluations on every prompt change, ablation testing, and soak periods before rollout [Official] An update on recent Claude Code quality reports · AnthropicT1-official original — the same instinct as independent validation, applied to a pipeline of one.
Practice
Exercise solutions
B. The failure is a boundary failure: an ambiguous handoff that no stage is positioned to catch. Fixing the boundary — a typed, schema-validated spec so there is less to misinterpret, plus an independent validation step that checks the coder’s output against that spec — intercepts the wrong interpretation before it reaches the reviewer. A makes one agent smarter but leaves the ambiguous interface intact; a better coder still has to guess at an under-specified spec. C asks the reviewer to work harder while still reading only the code, blind to the spec it was meant to satisfy. D doubles cost and gives two artifacts to compare with no oracle for which is right — a wrong-but-consistent interpretation reproduces on the retry.
The three MAST categories are specification problems (41.77%), coordination failures (36.94%), and verification gaps (21.30%); specification problems are the largest. Specification problems propagate especially badly in a pipeline because a mid-pipeline agent cannot pause to ask — an under-specified handoff becomes a guess the agent resolves and passes downstream as settled fact, so a single ambiguity at the top seeds a wrong decision that every later stage treats as valid input.
A compounding cross-system failure can pass every per-component evaluation because it lives between components, not inside any one: each part passes its own eval in isolation, and the degradation emerges only from their interaction, on traffic slices no single test exercises. In the April-23 case three bugs with overlapping windows (the union running Mar 4 – Apr 20) produced “broad, inconsistent degradation” that no single bug’s symptom resembled, and neither internal usage nor the existing eval suite reproduced it. The postmortem concluded that integration-level testing was needed instead — broad per-model evaluations on every prompt change, ablation testing, and soak periods before rollout — testing the system as it actually runs rather than each unit alone.
Exam essentials
- Failures compound across boundaries — multi-agent systems fail at much higher rates than their parts (the MAST taxonomy: specification 41.77%, coordination 36.94%, verification 21.30%); a chain’s reliability is the product of its handoffs, not its best agent.
- Ambiguity propagates silently — a mid-pipeline agent cannot pause to ask (unlike D5.2’s interactive escalation), so it resolves an ambiguity and passes the guess downstream as settled fact.
- Compounding failures evade unit tests — they live between components, on traffic slices no single test exercises; all-green component evals do not prove a multi-stage system healthy.
- Defenses — structured error context across boundaries (D2.2), independent validation / isolated judges (D4.6), circuit breakers to stop cascades, and keeping the escalable decision at the coordinator (D5.2); the single-agent analog is broad evals + ablation + soak periods.
Large-Codebase Context: Compaction, Scratchpads, Delegation
A large codebase has more relevant files than any window holds, and reading them all is the trap, not the solution. The three levers that extend the horizon are compaction, scratchpads that externalize state to disk, and subagent delegation that pays exploration cost in a separate context.
A large codebase has more relevant code than any context window can hold, so the question is never “how do I fit it all in” but “how do I keep the right slice in and the rest out.” This chapter is the exam-angle inventory of the levers; the design book owns the at-scale mechanics. It is an architectural pattern — the levers are stable, the commands and hooks that drive them are the surface.
The codebase does not fit, and reading it all is the trap
Context is cumulative and finite (D5.1): conversation history and every tool input and output accumulate in the window over a session. [Official] How the agent loop works · AnthropicT1-official original A large codebase has far more relevant files than that window holds, and the naive response — have the agent read everything “to be safe” — is itself the failure: it fills the window with material that crowds out the work and degrades the model’s attention to what matters.
Compaction extends a long session
When a working session approaches the limit, compaction reclaims room by summarizing older turns, and it is both automatic and steerable. Automatic compaction triggers near the context limit,
[Official]
How the agent loop works · AnthropicT1-official original and three knobs customize it: a “Summary instructions” section in CLAUDE.md that tells the compactor what to preserve, the PreCompact hook that runs before compaction, and a manual /compact sent on demand.
[Official]
How the agent loop works · AnthropicT1-official original
Compaction is lossy (D5.1), so the same caution applies: durable rules belong in re-injected CLAUDE.md, not in turns the summary may discard.
/compact vs /clear, and the scratchpad beneath both
Two commands free context, and confusing them wastes work. /compact [instructions] “frees context by summarizing” — it condenses the conversation in place, so you keep going on the same task with a shorter history, and the optional instructions focus what the summary keeps.
[Official]
Commands · AnthropicT1-official original /clear instead starts a fresh conversation — it discards the working context entirely, with the previous one still available via /resume (aliases /reset, /new).
[Official]
Commands · AnthropicT1-official original
The decision rule is continuity. Reach for /compact to continue a task whose history has grown long but is still relevant. Reach for /clear to switch to an unrelated task — or when a session is cluttered with failed approaches, the D3.5 rule that after more than two corrections on the same issue, “a clean session with a better prompt almost always outperforms a long session with accumulated corrections.”
[Official]
Best practices for Claude Code · AnthropicT1-official original Compaction keeps a lossy summary; clearing keeps nothing in the window at all.
Delegation pays exploration cost in another window
The lever that matters most for breadth is delegation. “Since context is your fundamental constraint, subagents are one of the most powerful tools available. When Claude researches a codebase it reads lots of files, all of which consume your context. Subagents run in separate context windows and report back summaries.” [Official] Best practices for Claude Code · AnthropicT1-official original A subagent can read the twenty files that answer “where is auth enforced,” and the main agent receives the three-line answer rather than the twenty files — the exploration cost is paid in the child’s window and discarded with it. Scratchpads are the complementary move: state written to a file (D1.7) lives on disk, not in the window, and the main agent reads it back only when needed.
Where the depth lives
This chapter is the exam-angle inventory; the design book owns the at-scale mechanics. The engineering of context at codebase scale — retrieval, the discipline of what to assemble into a window, and the cost trade-offs of fan-out — lives in the Agentic Systems Design book’s chapters on the environment at scale and context assembly. The exam-relevant skill is selecting the lever: compaction for a long single session, delegation for breadth of exploration, scratchpads for state that must outlive a compaction.
Practice
Exercise solutions
B. Delegation pays the exploration cost in the subagents’ separate context windows and returns only summaries, so the main session learns where authentication flows without absorbing dozens of files — exactly the “subagents run in separate context windows and report back summaries” pattern. A is the trap this chapter names: reading everything into the main window fills it with material that crowds out the actual change and dilutes attention. C buys a bigger budget but still spends it on noise, and a larger window degrades on irrelevant bulk just the same. D fights symptoms — compacting mid-exploration repeatedly summarizes away the very findings you are gathering, and is no substitute for never loading the bulk into the main window in the first place.
The three are (1) a CLAUDE.md “Summary instructions” section, (2) the PreCompact hook, and (3) manual /compact. The one that steers what content survives the summary is the CLAUDE.md “Summary instructions” section — the compactor reads CLAUDE.md like any other context, so a section describing what to preserve directs what the summary keeps. The PreCompact hook runs before compaction (e.g. to archive the full transcript) and manual /compact controls when it happens, not what survives — though /compact’s optional focus instructions also nudge the summary’s content.
Dispatching a subagent protects the main agent’s context because the subagent runs in its own separate context window: it reads the files needed to answer the question there, spending that exploration cost against its own budget, and that window is discarded when it returns. The main agent receives back only a summary — the three-line answer (“auth is enforced in middleware X via pattern Y”), not the twenty files the subagent read — so the main context learns the conclusion without absorbing the bulk that produced it. Exploration cost is paid in the child window and thrown away with it.
Exam essentials
- Reading it all is the trap — a large codebase exceeds any window; loading everything “to be safe” fills the context with noise and degrades attention. Load the task’s slice, keep the rest out.
- Compaction — triggers automatically near the limit and is steerable three ways: a CLAUDE.md “Summary instructions” section (steers what survives), the
PreCompacthook, and manual/compact. /compactvs/clear—/compactcondenses the conversation to continue the same task (keeps a lossy summary);/clearstarts a fresh conversation for an unrelated task or after >2 failed corrections (previous in/resume; aliases/reset,/new). A disk scratchpad (D1.7) survives both — neither command touches a file.- Delegation — subagents read files in their own context windows and report back summaries, so exploration cost is paid in the child window, not the main one; scratchpads (D1.7) park state on disk.
- Pick the lever — compaction for a long single session, delegation for breadth of exploration, scratchpads for state that must outlive a compaction; depth lives in the design book’s at-scale and context-assembly chapters.
Human Review and Confidence Calibration
Not every output earns automatic trust. The architect calibrates which results proceed and which route to a human, using checkable confidence signals and a tiered funnel — cheap auto-checks, then an isolated judge, then a person — so the human sees only the decisions where their judgment changes the outcome.
D4.4 closed the validation loop with the model — detect a semantic error, feed it back, retry. This chapter closes the other loop: when automation is not enough, route to a human. The architect’s job is calibration — deciding, per output, whether to trust it, verify it automatically, or escalate it to a person. The routing-and-funnel pattern is durable; the specific fields are illustration, so this is an architectural pattern.
Not every output earns automatic trust
The cheapest reliability move is to give the model a way to check itself: “Include tests, screenshots, or expected outputs so Claude can check itself. This is the single highest-leverage thing you can do.” [Official] Best practices for Claude Code · AnthropicT1-official original But some judgments cannot be auto-verified — a wrong-but-plausible extraction, a borderline classification, a high-stakes decision with no ground truth to diff against. Those are where a human belongs. Confidence calibration is the discipline of deciding, output by output, which path each one takes.
Confidence as a routing signal
To route, you need a signal you can act on — and the reliable signals are checkable, not self-reported. The schema hooks from D4.4 double as confidence signals. (These are a design pattern this book recommends, not a built-in platform field — you add them to your schema.) When a model’s calculated_total disagrees with the document’s stated_total, both demand human review; when a conflict_detected flag is true, the record routes to a person; and a structured confidence field (high / medium / low) on an extraction gives the caller an explicit value to threshold on. Each is a place where the system can say “I am not sure” in a form a router can read.
Two senses of “calibration”
The word is doing double duty in this chapter, and the distinction is worth making sharp. There is the routing calibration the chapter is built on — which output goes to which tier — and there is measurement calibration: whether a confidence value actually tracks accuracy. A model is well-calibrated, in the measurement sense, only if its “90% confident” outputs are right about 90% of the time. Self-reported confidence usually fails this test — models tend to be over-confident, reporting high certainty on answers that are wrong — which is precisely why a raw “high” cannot gate the human queue on its own.
Independent validation before the human
Between auto-accept and the human sits an automated reviewer tier: an isolated judge. Independent validation — “deploy isolated judge agents” — is one of the practitioner-recommended defenses, and the gains are real: “PwC achieved a 7x accuracy improvement (10% to 70%) through structured validation loops.” [Practitioner] Why multi-agent LLM systems fail and how to fix them · Augment CodeT3-practitioner original The judge works for the same reason the D4.6 reviewer does: a fresh, isolated context has no authorship bias toward the output it is checking. Its job is to filter — resolve what it can, escalate only what it cannot — so the human queue stays small and high-value.
Calibrating the threshold
Where the human-review line falls is the calibration, and it is set by stakes times uncertainty. A low-stakes, high-confidence output proceeds automatically; a high-stakes, low-confidence one goes to a person; the middle is where the judge tier earns its keep. This makes human review the downstream half of human-in-the-loop, with escalation (D5.2) as the upstream half — the agent asks before acting when intent is unclear, and a human checks after producing when confidence is low. Calibrate the thresholds so a reviewer sees the few outputs where their judgment changes the outcome, and nothing it does not.
Practice
Exercise solutions
B. The design calibrates: checkable signals (confidence, conflict_detected) select which records are uncertain, an isolated judge clears the merely-plausible ones, and only what the judge cannot resolve reaches a human — so high-stakes errors are caught without reviewing everything. A retries with the same model, and a confidently-wrong extraction reproduces on the second run, so agreement is not evidence of correctness. C trusts self-reported confidence, which a confidently-wrong output also reports as “high” — the exact trap. D is safe but uncalibrated: it spends the most expensive reviewer on every record, most of which need no human, and does not scale.
Three routable signals: a cross-check mismatch (calculated_total ≠ stated_total), a self-flagged conflict_detected: true, and a low stated confidence (a failed provenance check — a cited span absent from the source — is a fourth). A model’s self-reported “high confidence” is not reliable on its own because it is a claim, not a measurement: a confidently-wrong output reports high confidence too, and models tend to be over-confident, so stated confidence often fails to track actual accuracy. The signals worth routing on are the checkable ones — a cross-check either matches or it does not, independent of what the model believes — whereas self-reported confidence must first be empirically calibrated against observed accuracy before it can gate anything.
The funnel is auto-check → isolated judge → human. Tier 1 (cheap automated checks) handles the obvious cases — a cross-check mismatch or a thresholded signal — and escalates anything it cannot clear. Tier 2 (an isolated judge agent) reviews the merely-plausible cases in a fresh, independent context, resolving what it can and escalating only what it cannot. Tier 3 (the human) sees only what survived both. The isolated judge catches errors the cheap auto-checks miss because those errors are wrong-but-plausible — they pass the mechanical checks (valid shape, no flagged conflict) yet are semantically wrong, and a fresh-context reviewer with no authorship bias can judge correctness where a regex or equality test cannot. Each tier escalates only what it cannot resolve, so the most expensive reviewer spends attention only on the decisions that truly need human judgment.
Exam essentials
- Calibrate, don’t blanket-trust or blanket-review — decide per output by weighing the cost of a wrong auto-accept against a human glance; verification is the highest-leverage habit where it is possible.
- Route on checkable signals — cross-check mismatches (
calculated_total≠stated_total), self-flaggedconflict_detected, lowconfidence, and failed provenance are routable signals; self-reported confidence alone is a claim, not a measurement. Two senses of calibration: routing (which output to which tier) and measurement (does “90% confident” mean 90% correct?). Empirically calibrate aconfidencefield — accuracy per stated level — or prefer checkable signals that need no calibration. - Tiered funnel — cheap auto-checks → isolated judge (fresh context, no authorship bias) → human; each tier escalates only what it cannot resolve, keeping the human queue small (structured validation loops drove a documented 7× accuracy gain).
- Threshold by stakes × uncertainty — high-stakes + low-confidence routes to a human; human review is the downstream half of human-in-the-loop, escalation (D5.2) the upstream half.
Information Provenance: Citations and Temporal Validity
An architect tracks where each claim came from and when its data is valid. The native Citations API ties quoted text to real document spans so a source cannot be fabricated; the provenance triple is the schema-friendly fallback; and a model's knowledge cutoff bounds what it can be trusted to know without a dated source.
The book closes on the question that underlies trust in any agent output: where did this claim come from, and is it still true? Provenance answers the first — a claim mapped to its source is auditable, one without a source is a trust-me. Temporal validity answers the second. This chapter is the exam-angle treatment; the named features — the Citations API surface, the location modes — are the moving parts, so it is a feature surface.
Provenance maps every claim to its source
The point of provenance is verifiability. “Claude is capable of providing detailed citations when answering questions about documents, helping you track and verify information sources in responses. All active models support citations, with the exception of Haiku 3.”
[Official]
Citations · AnthropicT1-official original You enable it per document with citations: {"enabled": true} on the document block, and each cited claim in the response carries a sibling citations array pointing back to the exact span of the source it came from.
[Official]
Citations · AnthropicT1-official original One enablement rule to memorize: citations must be enabled on all or none of the documents within a request — you cannot mix cited and uncited documents.
[Official]
Citations · AnthropicT1-official original
The Citations API and its location modes
How a citation points at its source depends on the document type, and there are three modes. Plain text is chunked to sentences and cited by char_location; a PDF is cited by page_location; custom content, where you supply the chunks, is cited by content_block_location.
[Official]
Citations · AnthropicT1-official original The feature is also output-cheap: “the cited_text field is provided for convenience and does not count towards output tokens.”
[Official]
Citations · AnthropicT1-official original
The provenance triple: schema-friendly fallback
When the output must be structured JSON, the native Citations API is off the table, so you encode provenance into the schema yourself. This is the D4.4 hook applied to attribution — a design pattern this book recommends, not a platform feature: each extracted claim carries a source object with a document_id, a span_quote, and a confidence, and the caller verifies that span_quote actually appears in document_id. If it does not, the model fabricated the citation. It is a manual, checkable provenance you can drop inside any schema.
Temporal provenance: knowing when data is valid
Provenance is not only where a claim came from but when it can be trusted. Each model has a reliable knowledge cutoff — Opus 4.8 at January 2026, Sonnet 4.6 at August 2025, Haiku 4.5 at February 2025 — and that reliable cutoff is earlier than (or equal to) the model’s training-data cutoff, not later: Sonnet 4.6 trained on data through January 2026 but is reliable only to August 2025, and Haiku 4.5 trained through July 2025 but is reliable to February 2025. [Official] Models overview · AnthropicT1-official original Data near the training cutoff is sparse, so the model’s reliable knowledge stops before its training does. Past the reliable cutoff the model has no dependable knowledge, so a time-sensitive fact must come from a dated source supplied at request time (retrieval with a citation), not from the model’s memory. The use-side workflow for recording claim sources and decision dates is the handbook’s territory (its provenance and ADR material is forthcoming).
Practice
Exercise solutions
B. Citations and Structured Outputs are mutually exclusive — the 400 is the API telling you so — so when the structured shape is required, you encode provenance into the schema with a triple (document_id, span_quote, confidence) and verify each span against its source caller-side. A abandons the structured output the pipeline requires, trading one requirement for the other. C is the fabrication trap: an unverified span_quote may quote text that is not in the document, which is exactly the failure provenance exists to catch. D doubles cost and leaves you reconciling two responses with no guarantee the cited run and the schema run extracted the same facts.
The three modes are plain text → char_location (sentence-chunked; start/end character indices), PDF → page_location (start/end page numbers; scanned images without extractable text are not citable), and custom content → content_block_location (you supply the chunks; start/end block indices). cited_text is attractive on output cost because the field “is provided for convenience and does not count towards output tokens” — you get the quoted source span echoed back for verification without paying output tokens for it.
A question about an event after the model’s reliable knowledge cutoff should be answered from a supplied dated source because past that cutoff the model has no dependable knowledge — it may produce a plausible but fabricated answer. Crucially, the reliable cutoff is earlier than the training-data cutoff, so even data the model technically trained on near the boundary is unreliable; the earlier date is the one that bounds trust. Supplying the fact as a dated source at request time (retrieval plus a citation) makes the answer both correct and auditable. That connects to provenance broadly: provenance answers two questions — where a claim came from (a source span, via Citations or the triple) and when it is valid (a dated source past the cutoff). A time-sensitive claim needs both: an external dated source, bound to the answer by a verifiable citation.
Exam essentials
- Provenance is verifiability — the Citations API (enable per document with
citations: {"enabled": true}) ties each claim to a source span so a citation cannot be fabricated (span-bound, not grammar-constrained);cited_textdoes not count toward output tokens. Citations must be enabled on all or none of a request’s documents. - Three location modes — plain text →
char_location, PDF →page_location, custom content →content_block_location; image citations are not yet supported. - Mutual exclusion — Citations + Structured Outputs return 400; when you need both a schema and provenance, use the provenance triple (
document_id+span_quote+confidence) and verify the span caller-side. - Temporal provenance — each model has a reliable knowledge cutoff (Opus 4.8 Jan 2026, Sonnet 4.6 Aug 2025, Haiku 4.5 Feb 2025), which is earlier than (or equal to) the training-data cutoff — the model trains on later data but is only reliable to the earlier date (Sonnet 4.6 trained to Jan 2026, reliable to Aug 2025). Past the reliable cutoff, answer time-sensitive questions from a dated source, not the model’s memory.