Answers & rationales

75 questions, grouped by chapter, with answers and rationales expanded. Test yourself first in the practice question bank.

Chapter d1-01

d1-01-loop-end-turn Agentic Architecture & Orchestration ↑ question in bank

The agent loop repeats the tool-use round-trip until the model replies without a tool call. Which stop_reason marks that final reply?

Answer & rationale

The loop branches on stop_reason: tool_use means “execute the call and continue the loop,” and the loop ends when Claude responds with stop_reason: "end_turn" — a normal completion carrying no tool call. max_tokens (truncation) and stop_sequence (a stop string was hit) are other ways a response can end, but neither is the “task finished” signal the loop waits for.

`end_turn` ✓
`tool_use`
`max_tokens`
`stop_sequence`

Correct: `end_turn`

d1-01-parallel-results-batch Agentic Architecture & Orchestration ↑ question in bank

A handler runs two independent read-only queries concurrently in response to one assistant turn that contained two tool_use blocks. The slower query finishes a few seconds after the faster one. To keep the next request well-formed, how should the handler send the results back?

Answer & rationale

When a single response contains more than one tool_use block, the API requires every corresponding tool_result in the next user message — you collect all results and send them as one batch, each matched to its call by tool_use_id. Waiting for the slower query before sending is correct; (a) describes exactly that. (b) defers one result to a later turn, which the chapter rules out: you cannot answer one tool now and the other later. (c) collapses two results into one block, breaking the per-call tool_use_id join key that the next request needs. (d) drops one call’s result entirely and assumes the model will re-request it — but an unanswered tool_use block leaves the next request malformed, not merely incomplete.

Collect both results and return them in the next user message, each keyed by its tool_use_id ✓
Return the first result now and send the second in a later user message once it resolves
Merge both results into a single tool_result block so the next request stays well-formed
Return whichever result finished first and let the model re-request the other on the next turn

Correct: Collect both results and return them in the next user message, each keyed by its tool_use_id

d1-01-tool-error-return Agentic Architecture & Orchestration ↑ question in bank

Inside your tool handler the call throws an exception. To let the model see the failure and recover, what do you send back on the next turn?

Answer & rationale

A failed call still returns a tool_result — but flagged is_error: true, with actionable content (e.g. ConnectionError: weather API unavailable (HTTP 500)), so the next turn can reason about what to do. Omitting the result breaks the round-trip (every tool_use needs a matching tool_result); HTTP statuses are not how a handler signals the model; and tool_use blocks come from the model, not from your code.

A `tool_result` block with `is_error: true` ✓
An HTTP 500 status to the caller
Nothing — omit the `tool_result` so the model retries
A fresh `tool_use` block describing the failure

Correct: A `tool_result` block with `is_error: true`

Chapter d1-02

d1-02-artifacts-pattern-return Agentic Architecture & Orchestration ↑ question in bank

A subagent gathers 4,000 tokens of raw source material, but the coordinator only needs its short conclusion to keep going. Per the chapter, how should the subagent return its work so the coordinator’s window stays lean?

Answer & rationale

This is the artifacts pattern: because results must cross a context boundary, large outputs are written to the filesystem or external storage and a lightweight reference is passed back — the high-fidelity output lives outside the coordinator’s window until it is actually needed. (b) defeats the whole reason the pattern exists: streaming everything back repollutes the lead’s window with the noise isolation was meant to keep out. (c) is worse still — the system prompt is the most expensive, always-resident slot. (d) is conversational compaction, not the multi-agent mechanism; it still spends tokens every turn, whereas a reference is read only on demand.

Write the full output to the filesystem (or external storage) and return a lightweight reference to it ✓
Stream all 4,000 tokens back to the coordinator so nothing is lost
Append the raw material to the coordinator's system prompt for the rest of the run
Summarize the 4,000 tokens to ~200 and re-inject that summary on every later turn

Correct: Write the full output to the filesystem (or external storage) and return a lightweight reference to it

d1-02-decompose-context-not-role Agentic Architecture & Orchestration ↑ question in bank

A team must research three unrelated vendors and produce one comparison. An engineer proposes four subagents — researcher → analyst → writer → fact-checker — passing the task down the line. Applying the chapter’s decomposition rule, what is the better way to cut this work?

Answer & rationale

Decompose by context, not by role. The three vendors are independent research paths — a true context boundary — so the win is one isolated subagent per vendor, with the coordinator synthesizing once it holds all three (synthesis stays in one window because comparing needs every result at once). The researcher → analyst → writer → fact-checker split (b) is problem-centric: sequential phases of one tightly-coupled task that lose fidelity at every handoff — “the telephone game.” (c) nests that same anti-pattern inside the good split, multiplying handoff loss. (d) is wrong because there is a clean boundary to exploit — and parallelization to gain.

Spawn one subagent per vendor (independent research paths), then have the coordinator synthesize the comparison in a single window ✓
Keep the researcher → analyst → writer → fact-checker pipeline — it cleanly separates concerns
Run the four-role pipeline inside one subagent per vendor, for maximum specialization
Don't split at all — the comparison is too coupled for subagents to help

Correct: Spawn one subagent per vendor (independent research paths), then have the coordinator synthesize the comparison in a single window

d1-02-verifier-early-victory Agentic Architecture & Orchestration ↑ question in bank

You add a verification subagent to blackbox-test your agent’s output, but it keeps declaring success after a single check. What is the chapter’s documented mitigation for this failure mode?

Answer & rationale

The verification subagent is the one multi-agent shape that succeeds across domains, and its characteristic failure is early victory — declaring success after one or two checks. The documented fix is an explicit completeness instruction: “You MUST run the complete test suite before marking as passed.” (b) destroys the property that makes a verifier reliable — its isolation; a blackbox verifier should get minimal context and hold no stake in how the work was produced. (c) adds cost and more failure points without addressing the premature-stop behavior. (d) misdiagnoses the cause: the verifier stops early by choice, not for want of budget.

Instruct it explicitly — e.g. "You MUST run the complete test suite before marking as passed" ✓
Give the verifier the full transcript of how the work was produced, so it can judge in context
Replace the single verifier with three verifiers and take a majority vote
Raise the verifier's token budget so it has room to keep checking

Correct: Instruct it explicitly — e.g. "You MUST run the complete test suite before marking as passed"

Chapter d1-03

d1-03-depth-one-limit Agentic Architecture & Orchestration ↑ question in bank

A subagent that analyzes a module wants to hand a slice of its work to a further helper agent, giving you a coordinator-to-subagent-to-sub-subagent chain. How should the architecture deliver that third layer?

Answer & rationale

Delegation is one level deep: subagents cannot spawn subagents. If a task seems to need a third layer, orchestrate that layer from the parent, not from inside a child. (b) is the documented anti-pattern — putting Agent in a child’s tools attempts a three-level hierarchy, but the depth-1 limit means the nested call produces no grandchild. (c) misdiagnoses the cause as a budget problem; maxTurns caps a subagent’s own turns and cannot create a missing level of delegation. (d) addresses the match gate, which is irrelevant here — even a perfect description cannot let a child invoke a grandchild past the structural limit.

Orchestrate the extra layer from the parent; subagents cannot spawn subagents ✓
Add "Agent" to the analyzer's tools array so it can invoke its own helper
Raise the analyzer's maxTurns so it has budget to run the nested delegation
Give the analyzer a keyword-rich description so the helper is matched automatically

Correct: Orchestrate the extra layer from the parent; subagents cannot spawn subagents

d1-03-fresh-context-channel Agentic Architecture & Orchestration ↑ question in bank

You delegate to a subagent with the prompt “fix the bug we discussed,” and it flounders with no idea what bug you mean. To make the next delegation succeed, where must the specifics go?

Answer & rationale

A subagent starts in a fresh context window with no parent conversation, so “the bug we discussed” never crossed the boundary; the only inbound channel is the Agent tool’s prompt string. File paths, error messages, and prior decisions must be written into that prompt or they do not exist for the child. (b) is the exact misconception the chapter corrects — the parent conversation and tool results are among the things that do not cross. (c) misreads model: it only selects the engine and carries no conversation; even inherit transmits no prior turns. (d) confuses the run gate with a content channel — allowedTools decides whether the call is approved, not what context the child can see.

Write the file path and error text directly into the Agent-tool prompt string ✓
Rely on the parent conversation, which the subagent inherits as shared history
Set the subagent's model to inherit so it picks up the parent's prior turns
Add the subagent's name to allowedTools so the discussed context is approved

Correct: Write the file path and error text directly into the Agent-tool prompt string

d1-03-never-delegates-gate Agentic Architecture & Orchestration ↑ question in bank

You define a focused doc-reviewer subagent with a specific, keyword-rich description, yet it never triggers — the main agent just reviews the file inline. The parent’s allowedTools is ["Read", "Edit", "Bash"]. Which fault best explains why nothing is delegated?

Answer & rationale

A subagent passes through two independent gates: the description is the match gate and allowedTools is the run gate — you need both. Here the description is already specific, so matching is fine; the listed allowedTools has no "Agent" entry, so the Agent-tool call is never auto-approved and falls through to canUseTool (or is denied in dontAsk). The fix is to add "Agent". (b) inverts the depth rule: you should not put Agent in a child’s tools — that array scopes what the child may use, not whether it can be invoked. (c) is wrong because model is optional and defaults; an unset model does not block invocation. (d) names the match gate, but the stem states the description is specific and keyword-rich, so matching is not the failing gate here.

"Agent" is missing from the parent's allowedTools, so the call is never approved ✓
The subagent's tools array omits "Agent", so it cannot accept the delegated call
The subagent's model field is unset, so Claude has no engine to route the task to
The subagent's prompt is too generic, so Claude cannot match the task to it

Correct: "Agent" is missing from the parent's allowedTools, so the call is never approved

Chapter d1-04

d1-04-enforcement-locus Agentic Architecture & Orchestration ↑ question in bank

A compliance pipeline runs the same four steps in the same order every time, must produce an audit trail, and needs a check between steps that a malformed output cannot slip past. Where should this workflow’s control flow live?

Answer & rationale

The chapter’s Evaluate-level rule says to enforce programmatically when the workflow needs determinism, an audit trail, validation gates between steps, or a fixed and repeatable sequence — exactly this scenario, where the steps are known in advance and you want them to run the same way every time. (b) inverts the rule: prompt-based is the choice when the path is flexible and orchestration code would cost more than it saves, neither of which holds here. (c) invents a constraint the chapter denies — Anthropic notes the same workflow “translate[s] directly” between the CLI and the SDK, so translatability is not the deciding factor. (d) misstates when determinism matters: a fixed, known-in-advance sequence is precisely the case that wants deterministic enforcement, not the case that excuses it.

Enforce it programmatically, because a fixed repeatable sequence that needs an audit trail and gates is what your code is for ✓
Stay prompt-based, because the model can self-direct any sequence more cheaply than orchestration code
Enforce it programmatically, because prompt-based workflows cannot be translated to the SDK
Stay prompt-based, because determinism only matters when the steps are not known in advance

Correct: Enforce it programmatically, because a fixed repeatable sequence that needs an audit trail and gates is what your code is for

d1-04-fresh-context-reviewer Agentic Architecture & Orchestration ↑ question in bank

Your quality workflow has Session A write a rate limiter and Session B review it, and a teammate proposes saving effort by handing Session B all of Session A’s prior context. In the Writer/Reviewer pattern, why must the reviewer not inherit the writer’s context?

Answer & rationale

In the Writer/Reviewer pattern the absence of inheritance is the feature: a fresh context “won’t be biased toward code it just wrote” and “cannot rationalize choices it never made,” which is what makes the review independent. (b) invents a token-budget reason the chapter never gives — the point is bias, not context size. (c) describes early victory, a different failure mode from D1.2; here the issue is that an inheriting reviewer defends its own choices, not that it stops too soon. (d) is the biased case the pattern exists to avoid — handing the reviewer the writer’s transcript is exactly what reintroduces the bias the fresh context removes.

A fresh context cannot rationalize choices it never made, so it judges the work independently ✓
The writer's context is too large to pass to the reviewer without exceeding the token budget
Inheriting the context would let the reviewer finish faster and declare success too early
The reviewer needs the writer's full transcript to judge the code accurately in context

Correct: A fresh context cannot rationalize choices it never made, so it judges the work independently

d1-04-schema-only-gate Agentic Architecture & Orchestration ↑ question in bank

A pipeline validates every inter-step output against a JSON schema, yet it still ships data with a total that does not add up. The output parses cleanly and has every required field. What does the gate need to catch this?

Answer & rationale

A schema check confirms the shape and nothing else — the chapter warns that a structurally perfect output can be semantically wrong (valid JSON, fabricated data), so the gate must pair the structural check with a semantic one wherever the content can be wrong, not just the form. A total that doesn’t add up is the chapter’s own example of a semantic failure. (b) cannot work: no amount of field or type strictness detects that correctly-typed numbers fail to sum — that is a content question, not a shape question. (c) is the silent-failure path the gate exists to prevent; on failure the gate should reject and retry, not wave the output through to publish. (d) over-corrects — the structural check is cheap and still catches malformed shapes; the chapter says to pair the two checks, not swap one for the other.

Add a semantic check, because a structurally valid output can still carry fabricated or contradictory content ✓
Tighten the schema with more required fields and stricter types until the bad data fails validation
Pass the output downstream and let the publish step flag any content that looks wrong
Replace the schema check with a semantic check, since structural validation adds no value

Correct: Add a semantic check, because a structurally valid output can still carry fabricated or contradictory content

Chapter d1-05

d1-05-defer-pauses-query Agentic Architecture & Orchestration ↑ question in bank

A teammate writes a PreToolUse hook that returns defer carrying an updatedInput, expecting the rewritten command to then run. Will the command run, and will the rewrite take effect?

Answer & rationale

defer is the decision candidates miss: it ends the query so the host can resume it later from the persisted session — a pause-and-hand-back, not an allow — so the command does not run now. And updatedInput is ignored with defer; that field applies only to allow (or ask). To run a rewritten command the hook must return allow with updatedInput. (b) is the teammate’s mistaken model — defer neither permits nor applies the rewrite. (c) is closer but still wrong: defer hands the whole query back to the host rather than queuing the call for the next turn, and the rewrite is dropped either way. (d) confuses two distinct decisions — defer pauses and outranks ask, but a deny is what blocks a call outright, and deny outranks defer.

Neither: defer ends the query for the host to resume later, and updatedInput is ignored with defer ✓
Both: defer permits the call and the updatedInput rewrites it before it runs
The rewrite holds but the call waits: defer just queues the rewritten command for the next turn
The call is blocked outright, since defer is simply another name for the deny decision

Correct: Neither: defer ends the query for the host to resume later, and updatedInput is ignored with defer

d1-05-posttooluse-normalize Agentic Architecture & Orchestration ↑ question in bank

Every Bash result reaches the model carrying ANSI color codes, and you need them stripped before Claude reads the output — the call itself must still run normally. Which hook event and return field do that work?

Answer & rationale

Stripping noise from a result is normalization, which happens on the side of tool execution after the tool runs. That is PostToolUse, and the field that does the work is updatedToolOutput, which replaces the output before Claude sees it. (b) names the gate side: PreToolUse with updatedInput rewrites the call’s input before the tool runs, but the noisy output does not exist yet at that point, so there is nothing to strip. (c) picks the right event but the wrong field: additionalContext appends to the result rather than replacing it, so the original ANSI-laden output still reaches the model. (d) blocks the call entirely, but the requirement is that the call still run — deny defeats the purpose.

A PostToolUse hook returning updatedToolOutput, which replaces the result before the model reads it ✓
A PreToolUse hook returning updatedInput, which rewrites the call before the tool runs
A PostToolUse hook returning additionalContext, which strips the codes as it appends to the result
A PreToolUse hook returning a deny decision so the noisy Bash call never runs

Correct: A PostToolUse hook returning updatedToolOutput, which replaces the result before the model reads it

d1-05-precedence-deny-wins Agentic Architecture & Orchestration ↑ question in bank

Three PreToolUse hooks fire on one tool call and return, respectively, allow, ask, and deny. What happens to the call, and what rule decides it?

Answer & rationale

All matching hooks run in parallel and the most restrictive result wins, following the fixed precedence deny > defer > ask > allow — so one hook returning deny blocks the call regardless of what the others return. (b) inverts the safety model: permitting requires every hook to agree, not a majority, so two non-deny votes cannot override a single deny. (c) gets the precedence backwards — deny outranks ask, not the other way around. (d) invokes execution order, but the outcome is decided by precedence, not by who finishes first; completion order is non-deterministic and a hook must never assume it.

The call is blocked: matching hooks run in parallel and the most restrictive result wins ✓
The call is allowed, because two of the three hooks did not return deny
The call runs after a permission prompt, because ask outranks a lone deny
The outcome depends on which hook finishes first, since order breaks the tie

Correct: The call is blocked: matching hooks run in parallel and the most restrictive result wins

Chapter d1-06

d1-06-adaptive-cost-tradeoff Agentic Architecture & Orchestration ↑ question in bank

A lead picks an adaptive multi-agent decomposition over a single-agent approach for an open-ended research task and is asked to justify the spend to a skeptical stakeholder. Compared to a single agent on an equivalent task, what is the rough token cost, and what does that spend actually buy?

Answer & rationale

The chapter attaches a price tag: an adaptive multi-agent flow “typically use[s] 3-10x more tokens than single-agent approaches for equivalent tasks,” and that spend buys thoroughness, not speed — parallel subagents explore a larger space, but coordination plus the slowest subagent often make the wall-clock slower. (b) keeps the right number but inverts the benefit: the chapter is explicit that parallelism does not buy speed. (c) is doubly wrong — it both denies the multiplier (adaptive work is not “the same tokens”) and claims a speed gain the chapter rules out. (d) borrows the language of the opposite structure: deterministic, auditable, bounded cost describe the sequential pipeline, which is precisely what an adaptive flow is not.

Roughly 3-10x more tokens, and what it buys is thoroughness, not speed ✓
Roughly 3-10x more tokens, and what it buys is a faster wall-clock from parallel subagents
About the same tokens, since the work is identical, but a faster wall-clock from parallelism
Roughly 3-10x more tokens, with the gain being deterministic, auditable, bounded cost

Correct: Roughly 3-10x more tokens, and what it buys is thoroughness, not speed

d1-06-over-decomposition-guard Agentic Architecture & Orchestration ↑ question in bank

An adaptive system keeps spawning dozens of subagents to answer questions that turn out to be simple — the “50 subagents for a simple query” failure — multiplying token cost for nothing. The task genuinely is open-ended, so abandoning adaptivity isn’t an option. What is the chapter’s mitigation for this failure mode?

Answer & rationale

This is over-decomposition, the adaptive failure mode, and the chapter’s named guard is the effort-scaling heuristic: the orchestrator must be told how to size effort to complexity — “1 agent with 3-10 tool calls” for simple fact-finding, “2-4 subagents” for a comparison, “more than 10” for complex research. (b) throws away the adaptivity the open-ended task actually needs — forcing a fixed pipeline onto path-dependent work is the opposite failure mode (rigid pipelining). (c) misdiagnoses the cause: a budget cap may stop the bleeding but it doesn’t make the orchestrator size effort correctly — it just truncates, leaving the misjudgment intact. (d) is the wrong lever and inflates context: the fix is teaching the orchestrator to scale effort, not feeding every subagent the whole transcript.

Tell the orchestrator to scale effort to complexity — 1 agent for simple fact-finding, 2-4 for a comparison, 10+ for complex research ✓
Switch the work to a hardcoded sequential pipeline so the subagent count can never vary
Cap the orchestrator's total token budget so it runs out before it can spawn too many subagents
Give every spawned subagent the full transcript so each one can decide whether it is actually needed

Correct: Tell the orchestrator to scale effort to complexity — 1 agent for simple fact-finding, 2-4 for a comparison, 10+ for complex research

d1-06-path-dependent-structure Agentic Architecture & Orchestration ↑ question in bank

You are scoping “find out whether any competitor shipped feature X — go as deep as the question needs,” where each thing you learn changes what you look at next and the depth is unknowable up front. Which decomposition structure does this task require, and on what property of the task does that choice turn?

Answer & rationale

The chapter’s deciding property is predictability, and this task fails it: the work is “inherently dynamic and path-dependent” — step N+1 depends on what step N discovered — so “you can’t hardcode a fixed path,” which is exactly what makes adaptive decomposition the right shape. (b) and (c) both reach for a pipeline, but a fixed sequence applied to path-dependent work is the rigid-pipelining failure: it is brittle and cannot branch on what it finds, and you cannot map findings to steps you couldn’t predict. (d) lands on adaptive for the wrong reason — “spawn many subagents up front” is over-decomposition; adaptive means the orchestrator sizes effort at runtime (the 1 / 2-4 / 10+ ladder), not that more agents is always better.

Adaptive decomposition, because the work is path-dependent — step N+1 depends on what step N discovered, so no design-time sequence can capture it ✓
A sequential pipeline, because mapping each finding to the next step in advance is what makes the cost auditable
A sequential pipeline, because a clean fixed sequence is the structure least likely to be brittle under change
Adaptive decomposition, because spawning many subagents up front guarantees the broadest possible coverage of the topic

Correct: Adaptive decomposition, because the work is path-dependent — step N+1 depends on what step N discovered, so no design-time sequence can capture it

Chapter d1-07

d1-07-crosshost-artifacts Agentic Architecture & Orchestration ↑ question in bank

A CI agent stops on error_max_turns partway through a refactor, and its container is torn down. You must finish the work tomorrow on a fresh worker that won’t have today’s transcript file. What is the more robust way to carry the work across hosts?

Answer & rationale

The chapter’s cross-host discipline is to lift the state you care about out of the conversation: capture the artifacts (analysis output, decisions, file diffs) as application state and pass them into a fresh session’s prompt — which the docs call “often more robust than shipping transcript files around.” (b) is the brittle option the chapter argues against: resuming by ID requires both restoring the transcript to the exact ~/.claude/projects/<encoded-cwd>/ path and reproducing the same cwd on every worker. (c) cannot help across hosts — forking still copies a transcript that lives under an encoded-cwd path tied to a directory, the very thing the fresh worker lacks. (d) restores files, not the conversation; checkpointing is a separate concern and does not carry the agent’s reasoning forward.

Capture the artifacts that matter — decisions, the diff so far, the remaining plan — as application state and seed a fresh session's prompt ✓
Ship the original session's transcript file to the new worker and resume it by ID with a bigger budget
Fork the original session so the new worker gets an independent copy of the conversation to continue from
Use file checkpointing to snapshot the working directory and restore it on the new worker before resuming

Correct: Capture the artifacts that matter — decisions, the diff so far, the remaining plan — as application state and seed a fresh session's prompt

d1-07-fork-shared-disk Agentic Architecture & Orchestration ↑ question in bank

You fork your session to try a destructive refactor, expecting the original session’s files to be safe. The forked agent deletes a source file. Is that deletion isolated from the original session?

Answer & rationale

Forking branches the conversation history, not the filesystem — both forks share one working directory, so a file edit (or deletion) in either is real and visible to the other. The chapter’s pitfall names this exactly: treating a fork as an isolated sandbox corrupts the shared directory. (b) misreads what the “copy” is: a fork copies the conversation history, never a snapshot of the disk. (c) attaches sandboxing to the new session ID, but the ID identifies a second conversation, not a second world; to branch the files too you need file checkpointing. (d) invents an open/closed rule the chapter never states — isolation does not turn on whether the original session is still open; the disk is shared regardless.

No — forking branches the conversation, not the filesystem, so both forks share one disk and the deletion is real ✓
Yes — a fork starts from a copy of the original's history, and that copy includes a snapshot of the working directory
Yes — the new session ID gives the fork its own sandboxed filesystem, so edits cannot reach the original
No — but only because the original session is still open; closing it first would have isolated the files

Correct: No — forking branches the conversation, not the filesystem, so both forks share one disk and the deletion is real

d1-07-resume-fresh-cwd Agentic Architecture & Orchestration ↑ question in bank

A script captured a session_id from the first run’s ResultMessage and later calls resume with that exact ID, but the resumed session comes back empty and fresh instead of the expected history. What is the single most likely cause?

Answer & rationale

The #1 resume bug is a mismatched cwd: sessions are stored under ~/.claude/projects/<encoded-cwd>/*.jsonl, where the encoded-cwd is the absolute working directory with every non-alphanumeric character replaced by -. Resume from a different directory and the SDK derives a different path, finds nothing, and silently starts fresh — even with the correct ID. (b) is the opposite of what the chapter teaches: an error_max_turns session stops on a budget, not a wall, so its transcript is intact and resumable. (c) misfires because the stem says the specific ID was passed; and continue reopening the most-recent session would still return history, not an empty one. (d) invents session-ID expiry, a mechanism the chapter never mentions — the lookup is by encoded path, not a time-limited token.

You launched the resume from a different working directory, so the encoded-cwd lookup path no longer matches ✓
The original session hit error_max_turns, which clears the transcript and leaves nothing to resume
You passed continue_conversation=True instead of the specific session ID, so the wrong session was reopened
The session ID expired between runs, so the SDK had to mint a fresh empty session in its place

Correct: You launched the resume from a different working directory, so the encoded-cwd lookup path no longer matches

Chapter d2-01

d2-01-consolidate-selection-ambiguity Tool Design & MCP Integration ↑ question in bank

Your toolset exposes create_pr, review_pr, and merge_pr as three separate tools, and the agent frequently invokes the wrong one. Which redesign does the chapter document for exactly this set, and on what stated rationale?

Answer & rationale

The chapter uses this exact trio as its example of the documented default — consolidate related operations into a single tool with an action parameter, because “fewer, more capable tools reduce selection ambiguity” — so (a) is the documented redesign and rationale. (b) addresses name collisions, but these three names already differ; the problem is too many near-equivalent tools, which namespacing does not reduce. (c) shows argument shape per tool yet leaves three selectable tools in place, so the model can still misroute among them. (d) improves each response’s signal but again preserves the three-way choice the consolidation is meant to remove.

Fold them into one tool with an action parameter, because fewer, more capable tools reduce selection ambiguity ✓
Keep three tools but namespace each by service so their names no longer collide
Keep three tools and add input_examples to each so the model picks the right one
Keep three tools but make each response return only high-signal identifiers

Correct: Fold them into one tool with an action parameter, because fewer, more capable tools reduce selection ambiguity

d2-01-description-highest-leverage Tool Design & MCP Integration ↑ question in bank

A tool you inherited carries the entire description “Gets data for a record,” and the agent keeps reaching for it at the wrong moments. Of the changes available, which one targets the single highest-leverage surface the chapter identifies?

Answer & rationale

The chapter names the description as “by far the most important factor in tool performance” — the surface the model selects from — and prescribes saying what the tool does, when (and when not) to use it, and what each parameter means (3-4 sentences). Since the model chooses tools by description alone and never reads the implementation, an opaque description is a performance bug it cannot route around, so (a) is the highest-leverage fix. (b) helps disambiguate a growing library but does not tell the model what the tool does or its boundary. (c) compounds the gain by showing argument shape, yet the chapter frames examples as a complement to a good description, not a substitute for one. (d) improves the response half of the contract, which shapes the next call rather than fixing why the model misroutes to this tool.

Rewrite the description to say what it does, when (and when not) to use it, and what each parameter means ✓
Rename it from a bare verb to a service-namespaced name like crm_get_record
Add input_examples so the model can infer the intent from sample calls
Trim the response to high-signal fields so the model wastes fewer tokens reading it

Correct: Rewrite the description to say what it does, when (and when not) to use it, and what each parameter means

d2-01-input-examples-400 Tool Design & MCP Integration ↑ question in bank

You add three input_examples to a working client tool, and the API now rejects the whole request with a 400 — your description and your code are untouched and fine. Which explanation matches the one hard rule the chapter states for input_examples?

Answer & rationale

The chapter’s one hard rule is that each example must validate against the tool’s input_schema, or the request returns a 400 — so a single example with an illegal enum value (the chapter’s own typo’d “kelvin” case) fails the whole request, making (a) correct. (b) invents a numeric ceiling the chapter never states; the cost of examples is a token cost paid deliberately, not a hard cap. (c) misreads the client/server distinction: input_examples are precisely for client (user-defined) tools, so they are permitted here. (d) describes a soundness concern but not a 400 — a description that disagrees with the examples is a quality problem, while the schema-validation failure is what actually returns the error.

One example sets unit to a value the enum does not allow, so it fails to validate against input_schema ✓
Three examples exceed the per-tool limit, so the request is rejected for too many examples
input_examples are not permitted on this client tool and must be removed entirely
The description disagrees with the examples, so the model cannot reconcile the two surfaces

Correct: One example sets unit to a value the enum does not allow, so it fails to validate against input_schema

Chapter d2-02

d2-02-error-flag-casing Tool Design & MCP Integration ↑ question in bank

A tool fails. In the Claude Messages API (direct, not MCP), which field on the tool_result block flags the failure — and in which casing?

Answer & rationale

The casing is the regime tell: is_error (snake_case) is the Claude Messages API’s single tool-failure signal, while isError (camelCase) is MCP’s flag — conflating the two spellings is the most common mistake here. The direct API has no in-band JSON-RPC channel (a protocol-level problem surfaces as an HTTP error such as 400), and error_code is not a tool_result field.

`is_error` (snake_case) ✓
`isError` (camelCase)
`error_code` (snake_case)
a JSON-RPC `-32602` error

Correct: `is_error` (snake_case)

d2-02-validation-channel-routing Tool Design & MCP Integration ↑ question in bank

Your booking tool, exposed over MCP, receives a departure date that lies in the past — a business-logic failure the model could fix by asking for a new date. The MCP spec gives you two channels for reporting failures. Which channel should this failure travel down, and why?

Answer & rationale

A past departure date is an input-validation / business-logic failure the model can self-correct, so per SEP-1303 it belongs in the isError: true channel — an execution error returned inside a successful result, addressed to the model so it can read the text and retry. The routing rule is who can act on the error, not how severe it feels. (b) and (c) both mis-route a recoverable failure to JSON-RPC: that channel is reserved for protocol errors the model cannot fix (unknown tool, malformed request), and sending a validation error there silently denies the model its chance to correct and retry. (c) adds the right-sounding rationale of “let the host decide,” but the host audience is exactly wrong here — the failure should be addressed to the model. (d) hides the failure behind a success result, leaving the model nothing to flag or recover against.

Return it as an execution error inside a successful result, so the model can read the message and self-correct ✓
Return a JSON-RPC -32602 protocol error, since invalid parameters are a protocol-level problem
Return a JSON-RPC error so the host can decide whether to surface the failure to the user
Return a successful empty result and rely on the model to notice the missing booking

Correct: Return it as an execution error inside a successful result, so the model can read the message and self-correct

Chapter d2-03

d2-03-mcp-scope-vs-bypass Tool Design & MCP Integration ↑ question in bank

An agent should be able to call every tool on one MCP server but gain no other broad privileges. Which configuration matches the chapter’s documented guidance for scoping that MCP access?

Answer & rationale

The documented guidance is to scope with allowedTools: a mcp__server__* wildcard “grants exactly the MCP server you want and nothing more,” leaving every other safety gate intact. (b) inverts the recommendation — bypassPermissions auto-approves the MCP tools but disables every other safety prompt across the whole agent, broader than necessary. (c) is the wrong axis: tool_choice steers which of the available tools fires on one request; it does not define which tools the agent has at all. (d) misuses disallowedTools, which removes a tool from the request entirely (the model never sees it) — the opposite of granting the server’s tools.

Scope with an `allowedTools` wildcard like `mcp__github__*`, which grants exactly that server and nothing more ✓
Set `permissionMode: "bypassPermissions"`, which auto-approves the server's tools and is the documented MCP path
Set `tool_choice` to `any` so the model is steered toward the MCP server's tools each request
Add the server to `disallowedTools` so its tools are removed from the request and then re-approved on demand

Correct: Scope with an `allowedTools` wildcard like `mcp__github__*`, which grants exactly that server and nothing more

d2-03-thinking-forces-auto Tool Design & MCP Integration ↑ question in bank

A teammate wants the model to reason before acting, so they enable extended thinking, and they also want a specific tool always called, so they set tool_choice to a forced {"type": "tool", "name": …}. What happens when this request runs?

Answer & rationale

Only auto and none are compatible with extended (or adaptive) thinking; any and a forced tool return an error — the two requirements are mutually exclusive, so you cannot both force a tool and let the model reason with thinking on. (b) is the wrong resolution: the conflict surfaces as an error, not as a silent demotion of thinking. (c) misreads the prefill behavior — forcing prefills the assistant message and suppresses the natural-language preamble; it does not produce a schema-violating one, and with thinking on the request never gets that far. (d) confuses axes: parallel execution is governed by disable_parallel_tool_use, a separate knob from tool_choice, and it is not what breaks here.

The forced mode is incompatible with extended thinking, so the request errors before any tool fires ✓
Forcing one named tool silently disables thinking, so the model answers but never reasons first
The request runs, but the model emits a reasoning preamble that violates the strict schema
Parallel tool use kicks in, so the model calls several tools at once instead of the one named tool

Correct: The forced mode is incompatible with extended thinking, so the request errors before any tool fires

Chapter d2-04

d2-04-local-scope-not-local-settings Tool Design & MCP Integration ↑ question in bank

A teammate added an MCP server at local scope, but it never loads. They have been editing .claude/settings.local.json to fix it, with no effect. Which file should you open instead to find and repair the server’s definition?

Answer & rationale

The chapter’s most-confusing collision is the word local: a local-scoped MCP server is stored in ~/.claude.json (home directory, under a per-project key), whereas general local settings live in .claude/settings.local.json (the project directory). They are different files in different directories, so (a) is right — that is where the definition actually lives. (b) edits .claude/settings.local.json, the exact file the chapter calls a silent no-op for a local-scoped server. (c) is the Project scope’s committed file; local-scoped servers are never written there, and scopes are not all resolved through one file. (d) confuses Local with User scope — a local-scoped server is current-project-only and not shared across all projects.

Open the home-directory "~/.claude.json", under that project's key, where local-scoped servers actually live ✓
Open ".claude/settings.local.json" and add the missing server block there yourself
Open the committed ".mcp.json" at the repo root, since every scope resolves through that file
Open the user settings, because a local-scoped server is shared across all of your projects

Correct: Open the home-directory "~/.claude.json", under that project's key, where local-scoped servers actually live

d2-04-mcp-project-scope Tool Design & MCP Integration ↑ question in bank

You want an MCP server to be shared with everyone who checks out a repository — scoped to that project, not to your user account. Where do you declare it?

Answer & rationale

A project-scoped MCP server is declared in .mcp.json at the repository root, which is committed to version control so every collaborator inherits it. ~/.claude.json is user-global — it applies across all of your projects but is not shared through the repo. .claude/settings.local.json holds personal, git-ignored overrides. There is no mcpServers field in package.json.

`.mcp.json` at the repository root ✓
`~/.claude.json`
`.claude/settings.local.json`
the `mcpServers` field in `package.json`

Correct: `.mcp.json` at the repository root

d2-04-scope-precedence-wins Tool Design & MCP Integration ↑ question in bank

A repo ships a notion server in its committed .mcp.json (Project scope). A teammate also defines a server named notion at Local scope in their own ~/.claude.json. On the teammate’s machine, which definition actually connects?

Answer & rationale

When the same server name appears in more than one scope, Claude Code connects once, using the highest-precedence source, and the order is Local -> Project -> User -> plugin -> claude.ai connectors (the first three match duplicates by name). Local outranks Project, so the teammate’s ~/.claude.json definition wins on their machine, making (a) correct. (b) inverts the order — the shared Project config does not override a personal Local one; this is the intended path for letting personal credentials take over a team default. (c) is wrong: there is no field-by-field merge — only one definition connects. (d) is wrong: a name appearing in two scopes is resolved by precedence, not flagged as a conflict that blocks both.

The Local-scope definition in their `~/.claude.json`, because precedence runs Local before Project ✓
The committed Project `.mcp.json`, because a shared, version-controlled config always overrides a personal one
Both connect, and Claude Code merges the two definitions field by field
Neither connects until the duplicate name is resolved, because matching names is treated as a conflict

Correct: The Local-scope definition in their `~/.claude.json`, because precedence runs Local before Project

d2-04-verify-system-init-status Tool Design & MCP Integration ↑ question in bank

Your agent “doesn’t seem to have” its MCP tools, yet the config looks correct and no error is thrown. Before letting the agent act, what should the program do to turn this silent gap into an explicit failure?

Answer & rationale

A wired server that never connected is this chapter’s silent failure: the agent just runs without the tools. The documented fix is to read the system:init message — each server’s status is connected | failed | needs-auth | pending | disabled — and gate the run on it, so (a) is right. (b) assumes an exception is raised, but the chapter’s whole point is that no error is thrown; the failure surfaces only in the status field. (c) misreads the 60-second timeout: shortening it does not reveal the problem and would make slow-but-healthy servers fail — the chapter says pre-warm or use a lighter package instead. (d) strictMcpConfig controls which servers are considered (clean-room config), not whether a configured server actually connected.

Read the "system:init" message and refuse to run if any server's status is not "connected" ✓
Catch the exception the server raises on connection failure and retry the query
Lower the 60-second initialization timeout so a failing server reports sooner
Set "strictMcpConfig: true" so only your declared servers can connect

Correct: Read the "system:init" message and refuse to run if any server's status is not "connected"

Chapter d2-05

d2-05-deny-beats-bypass Tool Design & MCP Integration ↑ question in bank

A headless agent runs under permission_mode="bypassPermissions" to suppress prompts, with one guardrail: disallowed_tools=["Bash(rm -rf *)"]. The model requests Bash(rm -rf /data). Will the deletion be blocked?

Answer & rationale

The SDK checks permissions in a fixed order: Hooks, then Deny rules, then Permission mode, then Allow rules, then canUseTool. Because deny rules (step 2) sit above the permission mode (step 3), the Bash(rm -rf *) rule matches and denies the call before the mode is ever consulted — so the deletion is blocked even though the mode is bypassPermissions. (b) inverts the order: bypassPermissions is step 3, below deny rules, so it cannot approve first. (c) confuses the two instruments — allowed_tools only pre-approves and lives at step 4 (below the mode), so under bypassPermissions it is ignored entirely; the allowlist would not block anything. (d) invents a restriction: deny rules apply in every mode, which is exactly why they are the tool for forbidding.

Yes — deny rules are checked before the permission mode, so the rule binds even under bypassPermissions ✓
No — bypassPermissions sits above deny rules in the order, so it approves the call first
No — the rule must instead go in allowed_tools, which is the list bypassPermissions honors
Only if you also drop the mode to default, since deny rules apply in default mode alone

Correct: Yes — deny rules are checked before the permission mode, so the rule binds even under bypassPermissions

d2-05-parallel-read-only Tool Design & MCP Integration ↑ question in bank

In one turn an agent issues two Grep calls, one Glob, and one Write. The chapter says the deciding property is whether a tool reads state or changes it. Which of these calls may the SDK run concurrently?

Answer & rationale

The roster splits on whether a tool is read-only or state-modifying: read-only tools — Read, Glob, Grep — may run concurrently because they cannot conflict, while state-modifying tools (Edit, Write, Bash) run sequentially to avoid clobbering each other. So the two Grep calls and the Glob can fan out together; the Write must run on its own. (b) overgeneralizes: the runtime does not parallelize every call — state-modifying tools are held sequential. (c) misidentifies the property as “file-system tool” rather than read-versus-write — Write is a state-modifying file tool and is excluded, while Grep/Glob qualify by being read-only. (d) confuses this with a request-level cap: parallelism here is a property of the tool’s nature, not a switch the developer flips per request.

The two Grep calls and the Glob, because read-only tools may run concurrently while Write modifies state ✓
All four, because the SDK fans out every tool call within a single turn by default
The Glob and the Write, because file-system tools are the ones cleared to run in parallel
None of the three reads, because parallelism is a per-request switch the developer must enable first

Correct: The two Grep calls and the Glob, because read-only tools may run concurrently while Write modifies state

Chapter d3-01

d3-01-claude-md-concatenate Claude Code Configuration & Workflows ↑ question in bank

Two CLAUDE.md files in the load chain give contradictory instructions. Which one takes precedence?

Answer & rationale

Unlike the five-level settings precedence (where the highest scope wins), discovered CLAUDE.md files concatenate root-down without overriding — both contradictory lines load into context at once (a smell to fix at the source, not something proximity resolves). The rule of thumb: settings override; instructions accumulate. Reasoning about CLAUDE.md the way you reason about settings.json predicts the wrong behavior every time.

Neither — they concatenate; both stay in context ✓
The Project file overrides the User file
The Local file overrides every other scope
The Managed-policy file overrides the rest

Correct: Neither — they concatenate; both stay in context

d3-01-import-decline-permanent Claude Code Configuration & Workflows ↑ question in bank

Your CLAUDE.md pulls in shared standards with @./standards/api.md. The first time the import runs, Claude Code shows an approval dialog and the developer clicks decline. According to the chapter, what is the lasting consequence for that environment?

Answer & rationale

The first time a session encounters an @import, Claude Code shows an approval dialog, and declining it disables imports permanently — the dialog does not reappear, so the referenced file silently will not expand until the choice is reset. (b) contradicts the documented behavior: the consequence is permanent, not re-prompted each session. (c) treats the decline as a one-time skip, but the chapter is explicit that the dialog does not come back. (d) confuses a declined approval with a hard failure — declining suppresses imports, it does not block the session from starting, and the chapter notes the overflow behavior past depth 5 (silent truncation versus error) is itself undocumented.

Imports are disabled permanently — the approval dialog does not reappear, so the file never expands until the choice is reset ✓
Claude re-prompts on the next session start, since approval is tracked per session
The import is skipped only this once, and the referenced file expands automatically next launch
Claude raises a build error and refuses to start until the import line is removed

Correct: Imports are disabled permanently — the approval dialog does not reappear, so the file never expands until the choice is reset

d3-01-settings-precedence-cli-wins Claude Code Configuration & Workflows ↑ question in bank

A developer sets “model”: “sonnet” in their user settings and “model”: “opus” in the project’s committed settings, then launches the session with —model haiku. Given how Claude Code resolves the settings hierarchy, which model actually runs, and why?

Answer & rationale

Settings resolve by a strict five-level precedence where the highest scope wins: Managed → CLI arguments → Local → Project → User. The --model haiku flag is a CLI argument (level 2), which beats both the project opus (level 4) and the user sonnet (level 5), so haiku runs. (b) inverts the ladder — being committed gives Project no power over a CLI argument, which sits a full two levels higher. (c) imports the CLAUDE.md mental model (broadest-first concatenation) onto settings; for settings the broadest scope, User, is the lowest, not the winner. (d) reaches the right model but by the wrong mechanism: settings override to a single value, and only permission allow/ask/deny rules are the merge exception — model is not.

haiku — the CLI argument sits above both file scopes on the precedence ladder ✓
opus — the project settings file is committed, so it outranks a session-only flag
sonnet — the user file is the broadest scope, and broadest wins as it does for CLAUDE.md
haiku — but only because all three values happen to merge, like permission rules do

Correct: haiku — the CLI argument sits above both file scopes on the precedence ladder

Chapter d3-02

d3-02-commands-merged-into-skills Claude Code Configuration & Workflows ↑ question in bank

A colleague insists the only way to register a /deploy trigger is a flat file at .claude/commands/deploy.md. Per this chapter, which placement(s) will actually create a working /deploy?

Answer & rationale

Custom commands have merged into skills: the chapter states that .claude/commands/deploy.md and .claude/skills/deploy/SKILL.md both create /deploy and work the same way, so (a) is right. (b) repeats the colleague’s misconception — the flat file is not the sole way; it is the legacy shape. (c) inverts the truth: existing .claude/commands/ files keep working, so the skill form is not the only one that runs. (d) confuses built-in commands (behavior coded into the CLI) with the user-authored prompt mechanism that both files use; a custom /deploy need not be built-in.

Either form works — a file at .claude/commands/deploy.md and a skill at .claude/skills/deploy/SKILL.md both create /deploy ✓
Only .claude/commands/deploy.md, because a slash command is the sole way to register a /deploy trigger
Only .claude/skills/deploy/SKILL.md, because the legacy flat-file commands no longer run
Neither — /deploy must be a built-in command whose behavior is coded into the CLI

Correct: Either form works — a file at .claude/commands/deploy.md and a skill at .claude/skills/deploy/SKILL.md both create /deploy

d3-02-disable-model-invocation-description Claude Code Configuration & Workflows ↑ question in bank

You mark a risky /force-release skill disable-model-invocation: true, then ask Claude to “just release it” — and it acts as if the skill does not exist. According to this chapter, what did that flag do to the skill’s description?

Answer & rationale

disable-model-invocation: true keeps the skill’s description out of context entirely — it loads only when the user types /name — so the model is never told the skill exists and cannot auto-invoke; (a) matches the chapter exactly. (b) is the wrong mechanism: the chapter does not gate a seen description, it removes the description so there is nothing to match. (c) invents a load-on-first-use rule the chapter never states. (d) describes the separate budget-overflow path, where least-invoked descriptions drop first; that is a different cause from this flag, which always withholds the description.

Its description is kept out of startup context entirely, so Claude is never told the skill exists and cannot auto-invoke it ✓
Its ~100-token description still loads at startup, but Claude is forbidden from acting on the match
Its description loads only after the user runs it once, so the first auto-invoke is what is blocked
Its description is dropped only when the skill-listing budget overflows, otherwise Claude can still auto-invoke

Correct: Its description is kept out of startup context entirely, so Claude is never told the skill exists and cannot auto-invoke it

Chapter d3-03

d3-03-scope-load-order Claude Code Configuration & Workflows ↑ question in bank

Your machine has ~/.claude/rules/style.md (“prefer tabs”), and a repo you are working in ships .claude/rules/style.md (“prefer spaces”). Neither rule uses a paths glob, so both are in context. When the two instructions tension, which one does Claude favor and why?

Answer & rationale

The documented order is that “user-level rules are loaded before project rules, giving project rules higher priority” — rules concatenate, but the more-specific scope is read last and dominates, the same recency model as the CLAUDE.md hierarchy. So the project rule (“spaces”) wins. (b) inverts the order: a user-level preference is a default that a project rule can override by being read after it, not the other way around. (c) imports a priority mechanism that does not exist — paths controls whether a rule is present, not its precedence once loaded, and here neither rule is scoped anyway. (d) invents a cancellation behavior the chapter never describes; the conflict is resolved by load order, not by mutual nullification.

The project rule wins, because user-level rules load first and the project rule is read last at higher priority ✓
The user-level rule wins, because a machine-wide preference outranks anything a single repository ships
Whichever rule carries a paths glob wins, since scoping a rule raises its priority over an unscoped one
Neither applies — two rules giving conflicting instructions cancel out, so Claude falls back to its default

Correct: The project rule wins, because user-level rules load first and the project rule is read last at higher priority

d3-03-scoped-rule-silent Claude Code Configuration & Workflows ↑ question in bank

A rule at .claude/rules/backend/api.md carries paths: ["src/api/**/*.ts"]. During a session, Claude edits files under src/frontend/ and never reads anything under src/api/. Regarding that API rule, what is the effect of this session on its presence in context?

Answer & rationale

Path-scoped rules “trigger when Claude reads files matching the pattern, not on every tool use” — so a rule scoped to src/api/** has no effect during work that never touches src/api/, which is exactly the value of scoping: file-specific guidance that costs nothing on unrelated turns. (b) describes an unconditional rule (no paths), which loads at launch at the same priority as .claude/CLAUDE.md; the whole point of paths is that it does not load that way. (c) confuses any-tool-use with file-read: the chapter is explicit that the trigger is reading a matching file, “not on every tool use.” (d) invents a one-shot launch check the chapter never describes — the rule is dormant, not disabled, and would activate the moment Claude does read a matching API file.

Nothing happens — the rule stays out of context, because it activates only when Claude reads a file matching its glob ✓
The rule loads at session start, since every rule under .claude/rules/ is read into context when Claude launches
The rule loads as soon as Claude runs any tool during the session, even before touching a matching file
The rule is ignored permanently this session, because a path-scoped rule that misses at launch never gets a second chance

Correct: Nothing happens — the rule stays out of context, because it activates only when Claude reads a file matching its glob

Chapter d3-04

d3-04-approve-flips-write-mode Claude Code Configuration & Workflows ↑ question in bank

In plan mode, Claude presents a plan and you select the “accept edits” approve option. Considering only what the chapter says happens at approval, what permission mode is the session now in, and what does that mean for editing?

Answer & rationale

Approving a plan exits plan mode and switches the session to the write mode the chosen approve option names — here “accept edits” lands the session in acceptEdits, where edits auto-approve and Claude starts editing. To plan again you must re-enter plan mode (cycle with Shift+Tab or prefix /plan); approval is one-way. (b) misses the whole point: the approval ends the read-only guarantee rather than staying in plan mode. (c) invents a second confirmation step — approval itself is the decision point, with no further gate. (d) confuses approving a plan with skipping permission checks; the approve options choose a write mode, not bypassPermissions.

It has exited plan mode into acceptEdits, so Claude can now apply edits; re-enter plan mode to research again ✓
It remains in plan mode, with the approval recorded as a note for the next edit
It is paused awaiting a second confirmation before any write mode takes effect
It has switched to bypassPermissions, since approving a plan waives later checks

Correct: It has exited plan mode into acceptEdits, so Claude can now apply edits; re-enter plan mode to research again

d3-04-opusplan-200k Claude Code Configuration & Workflows ↑ question in bank

You run claude --model opusplan for a refactor whose planning step must hold roughly 400K tokens of context at once. Before picking the alias, decide whether opusplan’s plan phase can provide that much context.

Answer & rationale

The automatic 1M-context upgrade applies to the opus alias only, not opusplan — its Opus plan phase runs with the standard 200K window. So a ~400K planning context is out of reach under opusplan; you would pin a 1M model such as opus[1m] for that phase instead. (b) is the trap the chapter says to memorize against: 1M is opus-only. (c) invents a size-triggered window that does not exist; the cap is fixed at 200K for this alias. (d) misattributes context to the model switch — Sonnet runs execution, not planning, and switching models does not grant the planning step a larger window.

No — opusplan's plan phase runs at the standard 200K; reach for opus[1m] for that phase instead ✓
Yes — opusplan's Opus plan phase receives the automatic 1M-context upgrade
Yes — the 1M window activates once the planning context exceeds 200K
No — but switching to Sonnet for execution would give the planning step the larger window

Correct: No — opusplan's plan phase runs at the standard 200K; reach for opus[1m] for that phase instead

d3-04-prompts-still-apply Claude Code Configuration & Workflows ↑ question in bank

A teammate switches into plan mode expecting a silent run with no interruptions, then is surprised when Claude still pauses to ask before a shell command. What does plan mode actually do to the permission prompts?

Answer & rationale

Plan mode is a no-edit mode, not a prompt-free one: Claude can read files, run shell commands to explore, and write a plan, but it does not edit your source — and “permission prompts still apply the same as default mode.” So the prompts behave exactly as in default; only the edits to source are withheld. (b) is the precise misconception the chapter corrects — read-only is the guarantee, silence is not. (c) invents a selective-suppression rule the chapter never states; prompts apply uniformly, as in default. (d) misreads writing the plan as a state that quiets prompts, but plan mode still gates the actions it allows throughout.

Tool-permission prompts still appear the same as in default mode; only source edits are withheld ✓
All prompts are suppressed, since plan mode runs as a quiet read-only sandbox
Prompts are suppressed for shell commands but still appear for file reads
Prompts disappear once Claude begins writing the plan, signalling research is underway

Correct: Tool-permission prompts still appear the same as in default mode; only source edits are withheld

Chapter d3-05

d3-05-clear-after-third-correction Claude Code Configuration & Workflows ↑ question in bank

Three times this session you have corrected Claude on the same issue, and each fix drifts back toward the wrong approach. According to the chapter’s threshold for course-correction, what should you do next?

Answer & rationale

The chapter sets a hard threshold: corrected more than twice on the same issue in one session, “the context is cluttered with failed approaches. Run /clear and start fresh with a more specific prompt that incorporates what you learned” — a clean session with a better prompt “almost always outperforms a long session with accumulated corrections.” (b) is the move the chapter explicitly warns against: past the second correction each nudge fights a context polluted with the very approaches you are steering away from. (c) keeps the polluted context in place and still ends in another correction, so it does not reset the loop. (d) misdiagnoses the problem — the failed approaches in the window are pulling the model back, so giving them more room makes the drift worse, not better; the fix is to clear them, not accommodate them.

Run /clear and restart with a more specific prompt that incorporates what the failed rounds taught you ✓
Issue a third, more forceful correction since the feedback loop is finally getting tight
Open plan mode and have Claude re-explore the codebase before you correct it once more
Raise the context budget so the failed approaches and the new instruction both fit in the window

Correct: Run /clear and restart with a more specific prompt that incorporates what the failed rounds taught you

d3-05-fresh-session-after-interview Claude Code Configuration & Workflows ↑ question in bank

You ran the interview pattern for a rate limiter: Claude questioned you with the AskUserQuestion tool and wrote a complete SPEC.md, which you reviewed and corrected. To begin implementation, which move does the chapter prescribe?

Answer & rationale

The chapter is explicit that a fresh session implements from the written spec: the interview session is “full of question-answer thrashing,” while the implementation session should “start with clean context whose only input is the written spec.” So the correct move is to open a new session (or /clear) and prompt it to implement per SPEC.md. (b) is the exact mistake the split exists to prevent — the interview context is full of half-formed options and back-and-forth, and the implementation session should not inherit that noise. (c) re-injects precisely the thrash the fresh session is meant to escape, defeating the purpose; the spec file is the clean hand-off, so the transcript is not needed. (d) discards the reviewable artifact that serves as both review gate and context bootstrap, and leans on a cluttered session’s memory instead of the corrected spec.

Open a fresh session and prompt it to implement from the spec, so the work starts on clean context ✓
Keep going in the same session, since it already holds the full design discussion in context
Start a fresh session but re-paste the interview transcript so no decisions are lost
Delete the spec file and have the interview session implement from its own memory of the answers

Correct: Open a fresh session and prompt it to implement from the spec, so the work starts on clean context

Chapter d3-06

d3-06-exit-code-masked Claude Code Configuration & Workflows ↑ question in bank

A CI step runs claude -p "review the diff" --max-turns 8 --output-format json > out.json || true. The agent thrashed past eight turns, out.json is truncated, yet the step shows green and the next step consumes the partial output. What went wrong?

Answer & rationale

In headless mode the process exit code is the CI success contract — 0 passes the step, non-zero fails it. Hitting --max-turns “limits agentic turns and exits with error when reached,” i.e. a non-zero exit; but the trailing || true swallows that status, so the shell reports success and the next step consumes the truncated output as if it were real — exactly the pitfall the chapter warns against (“don’t mask the status”). (b) is wrong: --output-format json decides what the run prints, not its exit code; the two are independent. (c) inverts the documented behavior — --max-turns exits with error, it does not exit 0 with a warning. (d) is false: claude -p hands its real exit code straight to the runner (0 on success, non-zero on failure); total_cost_usd is a spend field, not a failure signal.

The run hit `--max-turns` and exited non-zero, but the trailing `|| true` swallowed the status, so the shell reported success and the step stayed green ✓
`--output-format json` overrides the process exit code with the `result` field, so a parseable payload always reports the step as passing
`--max-turns` only prints a warning and exits `0`, so the step was genuinely successful and the empty result is a separate stdout bug
Headless `claude -p` always exits `0`; CI must inspect `total_cost_usd` to infer failure, which this pipeline failed to do

Correct: The run hit `--max-turns` and exited non-zero, but the trailing `|| true` swallowed the status, so the shell reported success and the step stayed green

d3-06-headless-flag Claude Code Configuration & Workflows ↑ question in bank

In a CI pipeline you need Claude Code to run a single prompt non-interactively, print the result to stdout, and exit. Which invocation does this?

Answer & rationale

claude -p "<prompt>" (alias --print) is headless / print mode: it runs the prompt, writes the final result to stdout, and exits without opening the interactive REPL — the basis for CI/CD use. Pair it with --output-format json (or stream-json) for machine-parseable output. The other three flags are not real Claude Code modes.

`claude -p '…'` ✓
`claude --interactive`
`claude --watch`
`claude --serve`

Correct: `claude -p '…'`

d3-06-json-schema-print-mode Claude Code Configuration & Workflows ↑ question in bank

An engineer wants a downstream step to receive a schema-conforming payload, so inside a normal interactive terminal session they reach for --json-schema with --output-format json — and it doesn’t behave as documented. Which statement correctly diagnoses the constraint and says where a valid result would arrive?

Answer & rationale

The chapter is explicit on both points: --json-schema is “documented as print-mode-only — it works under -p, not in an interactive session,” and when used correctly with --output-format json “the schema-conforming result then arrives in structured_output.” So the engineer’s error is reaching for a print-mode flag in a terminal session; the fix is to run it under -p. (b) inverts the cause — the flag itself is print-mode-only, not merely the field it populates, so the chapter does not describe an interactive run that validates-but-hides the result. (c) misnames the required format: structured output uses --output-format json with --json-schema, not stream-json (which carries newline-delimited events, not a schema-validated payload). (d) invents a per-turn enforcement mechanism the chapter never states; --json-schema produces “validated JSON output matching a JSON Schema after agent completes its workflow,” not a per-turn check, and the reason it fails interactively is simply that it is print-mode-only.

`--json-schema` is print-mode only, so it works under `-p` but not in an interactive session; pairing it with `--output-format json` lands the conforming result in `structured_output` ✓
`--json-schema` validates output in any session, but only `-p` exposes the `structured_output` field, so the interactive run silently dropped it
`--json-schema` requires `--output-format stream-json` to emit a schema-validated payload; `json` alone returns only prose
`--json-schema` enforces the shape on every model turn, so an interactive session rejects it because it has no fixed turn count

Correct: `--json-schema` is print-mode only, so it works under `-p` but not in an interactive session; pairing it with `--output-format json` lands the conforming result in `structured_output`

Chapter d4-01

d4-01-positive-over-negative Prompt Engineering & Structured Output ↑ question in bank

A teammate’s prompt keeps producing bulleted answers when the team wants paragraphs, so they ask you whether to add “do not use markdown lists” or “respond in smoothly flowing prose.” Which instruction steers more reliably toward paragraphs, and on what reasoning?

Answer & rationale

The chapter is explicit that a positive instruction outperforms a prohibition when steering format or tone: “Respond in smoothly flowing prose” points straight at the destination, while a prohibition only names a forbidden region without locating the target inside the still-vast permitted one. (b) is the negative form the chapter contrasts against — banning lists rules out one failure but leaves a thousand other acceptable-but-unwanted shapes. (c) is still negative and adds a second vague term (“too structured”) with no defined target. (d) treats firmness as the fix, but absoluteness does not address the real defect: a prohibition shrinks the output space without aiming within it, so even an emphatic “never” fails to specify the paragraph form you meant.

"Respond in smoothly flowing prose" — it names the target shape instead of a forbidden region ✓
"Do not use markdown lists" — it removes the specific failure the reviewers keep flagging
"Avoid bullets and don't be too structured" — it rules out more than one unwanted form at once
"Never use lists or headers under any circumstances" — its firmness leaves no room to drift

Correct: "Respond in smoothly flowing prose" — it names the target shape instead of a forbidden region

d4-01-stop-at-the-rung Prompt Engineering & Structured Output ↑ question in bank

A summarization job returns a JSON object whose fields are all handled well by plain instruction except sentiment, which must be one of three labels and will crash the downstream pipeline if a fourth value ever appears. Following the chapter’s escalation discipline, how should you harden the prompt?

Answer & rationale

The chapter’s discipline is to “stop at the rung the stakes require”: most fields never leave Rung 1 (explicit instruction), and only the crash-on-violation field earns Rung 3 (structured outputs / strict tools), which makes a non-conforming value unrepresentable rather than merely discouraged. (b) over-climbs — promoting every field to the strongest rung just buys setup cost and latency the lower-stakes fields don’t need. (c) under-climbs: the premise is that a fourth value crashes the pipeline, and instruction-plus-retries only makes the shape likely-correct, not unviolatable. (d) escalates to the wrong rung — few-shot (Rung 2) disambiguates ambiguous judgments but still cannot prevent an out-of-set label from being emitted.

Move only sentiment to an enum-constrained tool or structured output, leaving the other fields at explicit instruction ✓
Move every field to a structured output, so the whole contract is enforced uniformly at the strongest rung
Keep all fields on explicit instruction and add retries, since a capable model usually matches the schema when told
Add few-shot examples for every field so each one is demonstrated before the parser ever runs

Correct: Move only sentiment to an enum-constrained tool or structured output, leaving the other fields at explicit instruction

Chapter d4-02

d4-02-edge-case-in-the-set Prompt Engineering & Structured Output ↑ question in bank

Your few-shot extraction returns the empty string for order_id when an input omits the order number, but the downstream consumer needs null. Following the chapter’s “target the ambiguous case” guidance, what is the most reliable fix?

Answer & rationale

The ambiguous case — an input with no order number — is exactly what an example should demonstrate: place one in the set whose output shows the field as null, in the middle, and the model generalizes from that treatment. (b) is the D4.1 instinct and it helps, but the chapter is explicit that a prose rule beside the examples teaches the case less reliably than a demonstration on it — the model generalizes from the example treatment, not the written rule next to it. (c) adds volume without diversity: more clean inputs never show the missing-field case, so the model still has nothing to copy. (d) is a band-aid that hard-codes one symptom — the model may emit "none" or "n/a" next, leaving you maintaining a translation table instead of fixing the prompt at its root.

Add an example whose input has no order number and whose output shows the field as null, placed in the middle of the set ✓
Add a sentence beside the examples instructing the model to use null, not an empty string, when no order number appears
Add several more clean examples that all include an order number, giving the model more data to generalize from
Leave the prompt alone and post-process every empty string in the output into null after the model returns

Correct: Add an example whose input has no order number and whose output shows the field as null, placed in the middle of the set

d4-02-single-example-quirk Prompt Engineering & Structured Output ↑ question in bank

A teammate steers an extraction task with a single carefully chosen example, and the model starts wrapping the first field in quotes on every output even when the input has no quotes. Which change does the chapter prescribe to stop the model copying that incidental trait?

Answer & rationale

With one example the model cannot separate the pattern from incidental traits, so it copies first-field-quoting as if it were required — the documented failure of 1-2 examples. The fix is diversity-via-count: move into the 3-5 range with examples that vary the incidental traits while holding the intended pattern constant, so the quirk no longer appears in every example and stops reading as the rule. (b) is the D4.1 instinct (write a rule), but the chapter’s point is that a prose rule beside the examples is weaker than letting contrasting demonstrations break the spurious pattern. (c) removes the demonstration entirely; zero examples breaks on exactly the kind of case the teammate is trying to pin. (d) misdiagnoses the cause — the model isn’t short on budget, it’s short on contrast.

Move into the 3-5 range with examples that vary the incidental traits, so the wrapping is no longer common across the set ✓
Add a prose instruction telling the model not to quote the first field unless the input quotes it
Cut back to zero examples and let the model infer the shape from the instruction alone
Keep the one example but raise the model's token budget so it has room to reconsider the quirk

Correct: Move into the 3-5 range with examples that vary the incidental traits, so the wrapping is no longer common across the set

Chapter d4-03

d4-03-classic-output-slot Prompt Engineering & Structured Output ↑ question in bank

In the classic structured-output pattern you force Claude to call a print_summary tool. Where is your extracted JSON, and what do you do with the tool’s “result”?

Answer & rationale

You rent the typed tool-call slot as an output channel: the forced call’s input object is your shaped JSON, and the “tool” never does anything — you discard its result. Parsing JSON out of a free-text block with a regex is the brittle approach this pattern replaces, and output_config.format is the modern grammar-constrained feature — a different generation, not the classic pattern.

In the tool call's `input` — you discard the tool's result ✓
In the tool's result — you return it to the model
In the assistant text block — you parse it with a regex
In `output_config.format` — you read it from the response

Correct: In the tool call's `input` — you discard the tool's result

d4-03-strict-dropped-openai Prompt Engineering & Structured Output ↑ question in bank

You set strict: true on a tool to guarantee schema-valid inputs. Through which integration path is that guarantee silently lost?

Answer & rationale

strict: true (grammar-constrained sampling) is honored only on the native Claude API. Calling through the OpenAI-SDK compatibility layer silently drops it — the request still succeeds, so a type mismatch you thought impossible can reappear. The native SDK honors it; batching is orthogonal; and combining strict: true with tool_choice: {type: "any"} is the documented “call one of N tools and validate its inputs” pattern, not a conflict.

The OpenAI SDK compatibility layer ✓
The native Anthropic SDK
The Message Batches API
Any call that also sets `tool_choice` to `any`

Correct: The OpenAI SDK compatibility layer

d4-03-truncation-not-retry Prompt Engineering & Structured Output ↑ question in bank

Your strict extraction normally works, but on unusually long inputs the returned JSON is occasionally cut off mid-object and your parser throws — and stop_reason comes back as max_tokens. What is the correct fix?

Answer & rationale

This is truncation, not a grammar failure: max_tokens is a hard cap on output, and generation hit it partway through writing the object. The grammar did its job — every emitted token was schema-valid — but it cannot guarantee the structure finishes within the budget, so the JSON is cut off before its closing braces. The fix is to enlarge the room: raise max_tokens (or shrink the schema). (b) re-runs against the same budget and truncates at the same place — the chapter is explicit that a plain retry is not the fix. (c) addresses a different failure: a missing additionalProperties: false is the top schema-compilation 400, not a mid-write cutoff. (d) misreads the cause — the grammar was working per-token; abandoning it forfeits the type guarantee without giving the generation any more room.

Raise `max_tokens` (or shrink the schema) — generation ran out of room to close the object ✓
Retry the identical request — a fresh sample will usually complete the object
Add `additionalProperties: false` to every object node so the grammar can compile
Switch off `strict` and parse the response text instead, since the grammar is failing

Correct: Raise `max_tokens` (or shrink the schema) — generation ran out of room to close the object

Chapter d4-04

d4-04-stated-vs-calculated-total Prompt Engineering & Structured Output ↑ question in bank

An invoice extractor returns schema-valid JSON, but once in a while the total it reports does not match the sum of the line items, and those bad totals slip through to billing. Following the chapter’s semantic-hook approach, what schema-level design makes such a total mechanically checkable?

Answer & rationale

A wrong-but-well-typed total is a semantic error — valid JSON, incorrect data — that no schema or type guarantee can touch. The chapter’s fix is to make the model show enough work for a mechanical test: emit both the document’s stated_total and its own re-summed calculated_total, then let application code compare them and route any mismatch to human review. (b) bounds magnitude, not internal consistency, and the chapter notes minimum is unsupported by the structured-outputs subset anyway. (c) only guarantees the total is present and a number — which it already was; a guaranteed type does nothing about a number being wrong. (d) addresses arithmetic effort, not the verification gap; the catch exists only because the verification field pair was added to the schema.

Have the model emit both a stated_total and a calculated_total, then let application code compare them and route mismatches to review ✓
Add a minimum constraint on the total field so a value below the line-item floor is rejected by the schema
Mark the total field required and strict so its type and presence are guaranteed on every record
Raise the model's reasoning budget so it has room to add up the line items more carefully before answering

Correct: Have the model emit both a stated_total and a calculated_total, then let application code compare them and route mismatches to review

d4-04-subtype-before-payload Prompt Engineering & Structured Output ↑ question in bank

A structured-output run sometimes succeeds and sometimes exhausts its retry budget, but downstream code occasionally processes garbage on the failure path instead of falling back. Per the chapter, how should the caller consume an Agent SDK structured-output result so the failure path is handled correctly?

Answer & rationale

Exhaustion is a result you inspect, not an exception that throws, so the caller must branch on subtype: success carries the typed payload in message.structured_output, while error_max_structured_output_retries means the budget ran out and you fall back (simpler schema, simpler prompt, or human review). (b) assumes a throw — but the SDK returns a result on exhaustion, so a try/catch never fires and the failure path is still unhandled. (c) reads the payload before discriminating; on the error path message.structured_output is undefined, which is exactly how garbage reaches downstream code. (d) relies on a retry count the chapter says is undocumented, so it must never be hard-coded.

Branch on the result's subtype first, reading message.structured_output only on success and falling back otherwise ✓
Wrap the call in a try/catch so the exhausted-retries case is caught as an exception and handled there
Read message.structured_output and treat an empty object as the signal that the retries were exhausted
Hard-code the documented retry count so the caller knows exactly how many attempts preceded the result

Correct: Branch on the result's subtype first, reading message.structured_output only on success and falling back otherwise

d4-04-truncation-not-retries Prompt Engineering & Structured Output ↑ question in bank

Your validate-retry loop keeps reaching error_max_structured_output_retries on your longest documents, which come back as cut-off JSON, while shorter documents extract cleanly. Per the chapter, what is the actual failure and the fix?

Answer & rationale

A cut-off response on the longest inputs is truncation, confirmed by stop_reason: "max_tokens" — the one failure retries cannot fix on their own, because re-prompting on the same output budget truncates at the same place and silently burns the whole retry allowance. The fix is to detect that stop reason and raise the cap (or shrink the schema). (b) adds attempts against the same budget, so every one truncates identically — more retries make it worse, not better. (c) loosening the checks lets a partial object pass and ships incomplete data; it treats truncation as a validation problem, which it is not. (d) mutating the schema between attempts invalidates the grammar cache and re-pays compilation — the opposite of the chapter’s advice to keep the schema fixed and vary only the feedback message.

The responses hit the max_tokens cap, so detect that stop_reason and raise the cap or shrink the schema before retrying ✓
The retry budget is too small, so increase the number of attempts until one of them completes the object
The semantic cross-checks are too strict, so loosen them so a partial object can pass validation
The grammar cache is stale, so vary the schema between attempts to force a fresh compilation each retry

Correct: The responses hit the max_tokens cap, so detect that stop_reason and raise the cap or shrink the schema before retrying

Chapter d4-05

d4-05-streaming-dashboard Prompt Engineering & Structured Output ↑ question in bank

A teammate proposes a streaming batch so a dashboard can update ticket-by-ticket as each of 80,000 classifications arrives. Setting aside whether streaming is the right tool, which statement about this proposal does the chapter support?

Answer & rationale

The chapter states plainly that “streaming is not supported for batch requests,” and the exercise solution rejects the live-updating dashboard for exactly that reason — so the proposal cannot be built as described. (b) is the distractor’s trap: custom_id is the mandatory join key for unordered results pulled after the batch ends, but it does nothing to enable live per-ticket streaming, which the batch surface does not offer at all. (c) misattributes the limit — the output-300k-2026-03-24 beta raises the max_tokens cap, not anything about streaming. (d) invents a false size constraint: a single batch is limited to 100,000 requests, so 80,000 fits comfortably; the real blocker is the streaming feature, not the count.

It cannot be built — streaming is not supported for batch requests ✓
It works if every request carries a unique custom_id to order the live updates
It works only after raising max_tokens with the extended-output beta header
It cannot be built — a single batch is capped at far fewer than that many requests

Correct: It cannot be built — streaming is not supported for batch requests

d4-05-succeeded-not-usable Prompt Engineering & Structured Output ↑ question in bank

Your overnight batch finishes and you iterate the results, keeping every entry whose result type is succeeded and treating it as a finished classification. Some downstream records turn out to hold refused or truncated text. Per the chapter, what must the per-result handler do that this design omits?

Answer & rationale

The four result types (succeeded, errored, canceled, expired) are batch-level outcomes — they say the request ran, not that its answer is good. A succeeded result carries a message with its own stop_reason, and two values still bite: a refusal (stop_reason: "refusal") returns a 200, is billed, and may not match your schema; a truncation ("max_tokens") is incomplete output. So handling must inspect each succeeded message’s stop_reason, not stop at the result type. (b) inverts the chapter’s central warning: unusable refusals and truncations arrive precisely as succeeded, not as errored. (c) misreads the lifecycle — re-polling cannot reclassify a finished result, and stop_reason lives on the message, not the batch status. (d) is backwards on billing and content: succeeded is the type that includes the message result, while canceled/expired are the unbilled failures.

Inspect each succeeded message's own stop_reason, because a refusal or truncation arrives as succeeded ✓
Treat every succeeded result as a valid classification, since unusable outputs come back as errored
Skip stop_reason and instead re-poll processing_status until the batch reports usable results
Filter out succeeded results, since only canceled and expired results carry a real message

Correct: Inspect each succeeded message's own stop_reason, because a refusal or truncation arrives as succeeded

Chapter d4-06

d4-06-fleet-false-positives Prompt Engineering & Structured Output ↑ question in bank

A team runs several parallel specialist reviewers with no filtering step and finds the output noisy — each agent independently flags plausible-but-wrong issues that accumulate into untrustworthy review comments. Which change directly addresses this failure mode while keeping the fan-out?

Answer & rationale

The failure mode is false-positive amplification: parallel reviewers each flag plausible-but-wrong issues, and with no filter those candidates accumulate, so more reviewers means more noise, not just more signal. The fix is the verification step, which re-checks each candidate against actual code behavior to filter out false positives before anything is posted — it is what makes fan-out a net gain rather than a faster false-positive generator. (b) discards the fan-out itself, which the chapter calls the direct architectural answer to attention dilution; shrinking to one reviewer surrenders the benefit instead of fixing the noise. (c) collapses the specialists’ single-class mandates back into one over-loaded checklist, reintroducing the dilution the fleet was built to avoid, and overlapping wrong findings still survive unverified. (d) misdiagnoses the cause as a budget shortfall; the problem is the absent behavioral filter, not how much room each reviewer has to second-guess itself.

Add a verification pass that re-checks each candidate finding against actual code behavior before any are posted ✓
Cut the fleet back to a single reviewer so there are fewer independent sources of flagged issues
Give every specialist the same broad mandate so their overlapping findings confirm each other
Raise each reviewer's token budget so it has room to recheck its own findings before posting

Correct: Add a verification pass that re-checks each candidate finding against actual code behavior before any are posted

d4-06-gate-on-neutral-check Prompt Engineering & Structured Output ↑ question in bank

A reviewer asks why a merge still went through even though Code Review posted an Important finding, and wants real bugs to block the merge queue going forward. How should the team make Code Review’s advisory output actually gate a merge?

Answer & rationale

Code Review’s check run “always completes with a neutral conclusion so it never blocks merging through branch protection rules” — it advises, it does not gate. The documented way to enforce its signal is to read the severity breakdown from the check-run output in your own CI and fail the step yourself, exiting non-zero when the Important count is positive so your own required check goes red. (b) misunderstands the neutral conclusion: branch protection can require the check, but a neutral result never counts as a failing status, so the merge is never blocked. (c) confuses trigger mode with gating — after-every-push only controls when reviews run; every run still completes neutral and blocks nothing. (d) treats the symptom as a finding-volume problem; more reviewers produce more findings (and more false positives without the verification pass), but none of them can flip a check that is neutral by design.

Read the severity breakdown from the check-run output in your own CI and fail the step when the Important count is positive ✓
Tighten the branch protection rule so Code Review's neutral check is treated as a required passing status
Switch the trigger from manual to after-every-push so a fresh review runs and blocks each new commit
Add more specialist reviewers so the fleet produces enough Important findings to halt the merge

Correct: Read the severity breakdown from the check-run output in your own CI and fail the step when the Important count is positive

Chapter d5-01

d5-01-accumulation-not-degradation Context Management & Reliability ↑ question in bank

Turns in a session keep growing until the window is nearly full and the loop is about to compact, but nothing the model has produced is wrong or misremembered yet. Which response matches the failure mode actually in play?

Answer & rationale

This is accumulation pressure: nothing is degraded — the budget is simply spent — so the fix is to reduce what accumulates (tighter tool outputs, or /clear and restart with a focused prompt). (b) is the lost-in-the-middle remedy, but here no buried fact has been misremembered; nothing is degraded to re-surface. (c) addresses post-compaction loss of an early rule, which hasn’t happened — the session has not compacted and no instruction has been dropped. (d) misreads spent budget as a quality bug; the chapter’s discipline is to name the mode first, and accumulation is not a degradation to instruct away.

Reduce what accumulates — tighter tool outputs, or /clear and restart with a focused prompt ✓
Re-surface the key facts near the end of the context to counter lost-in-the-middle
Move the standing instructions into CLAUDE.md so they survive the next summary
Treat it as a quality bug and add stricter instructions to make the model attend better

Correct: Reduce what accumulates — tighter tool outputs, or /clear and restart with a focused prompt

d5-01-buried-fact-no-compaction Context Management & Reliability ↑ question in bank

A session is comfortably under the token limit and has never compacted, yet Claude misremembers a detail established about sixty turns ago in a long stretch of context. Which fix most directly addresses the failure mode at work here?

Answer & rationale

Under the limit with no compaction, the failure is lost-in-the-middle: a model attends least reliably to material buried in the middle of a long window, so the fix is to re-surface the fact near the end of the context where it is attended to most reliably. (b) targets post-compaction loss, but the scenario never compacted — re-injection solves a problem that isn’t present. (c) is the bigger-window instinct, which does nothing for this mode: a larger context still loses its middle. (d) governs output length, not which earlier material the model attends to, so it is unrelated to the failure.

Restate the fact near the end of the context so it sits where the model attends most reliably ✓
Move the fact into CLAUDE.md so it is re-injected and survives the next compaction
Switch to a model with a larger context window so the conversation has more room
Raise the output token budget so the model has room to recall the detail

Correct: Restate the fact near the end of the context so it sits where the model attends most reliably

Chapter d5-02

d5-02-subagent-cannot-escalate Context Management & Reliability ↑ question in bank

A coordinator dispatches a subagent to implement a feature, but the subagent discovers the spec is ambiguous and has no way to ask the user. How should the work be restructured so the ambiguity is handled correctly?

Answer & rationale

Subagents cannot call AskUserQuestion, so a subagent that hits an ambiguous requirement can only guess or fail — the exact failure escalation exists to prevent. The fix is to restructure the decomposition so the part that needs a human stays with the agent that can reach one: the coordinator resolves the open questions first, then delegates a fully-specified task. (b) assumes the limit is a missing callback, but the constraint is on the subagent’s side — it cannot raise AskUserQuestion at all, so wiring a callback gives it nothing to surface. (c) is exactly the guess-on-intent the chapter warns against; a recorded assumption is still a coin flip on a decision the model cannot make. (d) misdiagnoses the cause: the subagent is blocked by an absent escalation channel, not by a shortage of reasoning budget.

Have the coordinator resolve the open questions first, then hand the subagent a fully-specified task ✓
Give the subagent its own canUseTool callback so it can surface the question itself
Let the subagent pick a reasonable default and record the assumption in its final report
Raise the subagent's token budget so it has room to reason the ambiguity away

Correct: Have the coordinator resolve the open questions first, then hand the subagent a fully-specified task

d5-02-suggest-alternative-pattern Context Management & Reliability ↑ question in bank

Your canUseTool callback intercepts a Bash(curl … | sh) call. You want to block this specific command but also tell Claude to download, verify the checksum, and then run, so it can correct course on its own. Which response pattern does this?

Answer & rationale

Suggest-alternative is a deny whose message carries guidance; Claude reads the guidance and adjusts its next step, so the command is blocked and the agent is steered toward the safe path in one move. (b) approve-with-changes would run the call after silently editing updatedInput — but the goal is to block this command and let Claude re-derive the safe approach itself, not to execute a rewritten version. (c) a plain reject blocks the command but supplies no guidance, so Claude is stopped without being redirected. (d) approve-and-remember allows the call and suppresses future prompts — the opposite of blocking a pipe-to-shell you consider unsafe.

Suggest-alternative — deny with guidance in the message, which Claude reads and uses to adjust its next step ✓
Approve-with-changes — rewrite the command in updatedInput so the pipe-to-shell becomes a safe download
Reject — return a plain deny so the blocked command simply does not run
Approve-and-remember — echo a PermissionUpdate so this command is auto-allowed next time

Correct: Suggest-alternative — deny with guidance in the message, which Claude reads and uses to adjust its next step

Chapter d5-03

d5-03-green-evals-not-healthy Context Management & Reliability ↑ question in bank

A multi-stage system shows broad, inconsistent quality degradation in production, yet every per-component evaluation stays green and neither internal usage nor the existing eval suite reproduces the problem. Following the April-23 postmortem’s conclusion, what kind of testing does the chapter say is actually needed?

Answer & rationale

The chapter states a compounding failure “is invisible to component tests precisely because it lives between components” — each part passes its own eval and the degradation appears only in the interaction, on traffic slices no single test exercises. The postmortem’s remedy is integration-level: “per-model evaluations on every prompt change, ablation testing, and soak periods before rollout — exercising the system as it actually runs.” (b) misses the point: the failure is not inside any one component, so more exhaustive component suites still can’t see a between-components effect. (c) is a real defense the chapter lists for error propagation across boundaries, but structured error context routes faults that surface — it is not the detection method the postmortem concluded was needed for a degradation no eval reproduced. (d) repeats the weakest-link fallacy the chapter rejects: the degradation lived in the interaction, not in a single least-reliable component, so swapping one agent does not address it.

Run integration-level testing — per-model evals on every prompt change, ablation testing, and soak periods that exercise the system as it actually runs ✓
Raise the coverage threshold on the existing component evals until each agent's suite is exhaustive
Add a structured-error-context schema to every agent boundary so faults are reported in machine-readable form
Replace the slowest agent in the chain, since the degradation must originate in whichever component is least reliable

Correct: Run integration-level testing — per-model evals on every prompt change, ablation testing, and soak periods that exercise the system as it actually runs

d5-03-silent-ambiguity-resolution Context Management & Reliability ↑ question in bank

In a planner-to-coder-to-reviewer pipeline, the planner emits a spec with one ambiguous requirement. The coder produces clean, working code, the reviewer rates it fine, and the wrong output ships with no warning anywhere. According to the chapter, what does the coder do at the moment it meets the ambiguity that makes the error go silent?

Answer & rationale

The chapter’s propagation mechanism is specific: a mid-pipeline agent “usually has no one to ask, so it resolves the ambiguity — and a quietly-wrong resolution is handed downstream as if it were settled fact,” and “the next agent has no signal that the input it received was a guess.” That is why the reviewer rates plausible code fine: it never sees that a guess was made. (b) is wrong because the chapter’s failure is not forwarding unclear wording — the coder commits to one interpretation and emits clean code, so the reviewer sees a confident artifact, not ambiguity. (c) inverts the chapter: the whole point is that a mid-pipeline agent cannot escalate the way D5.2’s interactive agent can — it has no one to ask, so it does not wait on a human reply. (d) invents a retry-until-clear loop the chapter never describes; the coder does not re-query the planner, it silently picks an interpretation and moves on.

The coder has no one to ask, so it resolves the ambiguity itself and hands the guess downstream as settled fact ✓
The coder copies the planner's ambiguous wording forward verbatim, so the reviewer inherits the same unclear spec
The coder escalates the ambiguity to a human, but the reply arrives too late to change the committed output
The coder retries the planner step until the spec comes back unambiguous, masking the original defect

Correct: The coder has no one to ask, so it resolves the ambiguity itself and hands the guess downstream as settled fact

Chapter d5-04

d5-04-compact-vs-clear Context Management & Reliability ↑ question in bank

A long task’s history is cluttered but still relevant; your next task is unrelated. Which command fits which?

Answer & rationale

The decision rule is continuity. /compact condenses the conversation in place, so you keep going on the same task with a shorter (lossy) history; /clear starts a fresh conversation — the previous one still reachable via /resume — for switching to an unrelated task. Both free context, and neither changes the model.

`/compact` to continue this task; `/clear` to switch to the new one ✓
`/clear` to continue this task; `/compact` to switch
`/compact` for both — it's the only one that frees context
`/clear` for both — `/compact` only changes the model

Correct: `/compact` to continue this task; `/clear` to switch to the new one

d5-04-scratchpad-survives Context Management & Reliability ↑ question in bank

You write a PLAN.md scratchpad, then run /compact, then later /clear. Is the plan still available to the agent?

Answer & rationale

Compaction summarizes the window and /clear wipes the window — neither touches the filesystem. State written to a file lives on disk, so the agent re-reads PLAN.md after a compaction, or in a freshly-cleared session, exactly as it left it. The durable layer of a long task doesn’t live in the conversation at all; /clear does not delete files.

Yes — a disk scratchpad survives both ✓
No — `/compact` summarizes it away
No — `/clear` wipes the working directory
Only until `/clear`, which deletes the file

Correct: Yes — a disk scratchpad survives both

Chapter d5-05

d5-05-funnel-judge-tier Context Management & Reliability ↑ question in bank

Cheap automated checks pass an extraction — valid shape, no flagged conflict — yet it is wrong-but-plausible, and shipping it as-is is costly. The chapter’s funnel puts one tier between the auto-checks and the human to catch exactly this case. Which tier is it, and why does it catch what the auto-checks miss?

Answer & rationale

Review is a funnel, not a gate: cheap auto-checks, then an isolated judge, then the human, with each tier escalating only what it cannot resolve. The judge catches wrong-but-plausible errors precisely because a fresh, independent context has no authorship bias toward the output it is checking — it can judge correctness where a regex or equality test cannot. (b) retries with the same model, but a confidently-wrong extraction reproduces on the second run, so agreement is not evidence of correctness. (c) trusts self-reported confidence, which a confidently-wrong output also reports as “high” — a claim, not a measurement. (d) is safe but uncalibrated: it spends the most expensive reviewer on every record and does not scale.

An isolated judge in a fresh context, because it has no authorship bias toward the output it is checking ✓
A second run of the same model, accepting the record when the two runs agree
The model's own self-reported confidence score, gating the human queue directly
Sending every record straight to the human reviewer to be safe

Correct: An isolated judge in a fresh context, because it has no authorship bias toward the output it is checking

d5-05-measure-before-routing Context Management & Reliability ↑ question in bank

An extraction pipeline emits a confidence field of high, medium, or low, and you want to use it to decide which records auto-accept and which route onward for a high-stakes clinical field. Before you trust the label to route, what does the chapter say you must do?

Answer & rationale

A confidence signal earns its routing role by measurement, not by its name: you measure, over real labeled data, the accuracy at each stated-confidence level and set the bar where the signal reliably predicts correctness. (b) trusts the word “high” as if it meant “certain” — but a measured “high” can be 94%, not ~100%, so roughly six in a hundred high-confidence extractions are wrong, an unacceptable residual on a high-stakes clinical field. (c) discards a signal the measurement shows is informative (accuracy falls monotonically high to low) and routes on frequency, which carries no information about correctness. (d) ignores the chapter’s warning that calibration is not permanent — you re-measure when the model, prompt, or input distribution changes.

Measure, on labeled past extractions, how often each stated level is actually correct, then route by that observed accuracy ✓
Auto-accept every record marked high, since high monotonically outranks medium and low
Drop the confidence field and route purely on whichever label the model emits most often
Treat the field as calibrated and reuse the same thresholds after the model is upgraded

Correct: Measure, on labeled past extractions, how often each stated level is actually correct, then route by that observed accuracy

Chapter d5-06

d5-06-citations-schema-conflict Context Management & Reliability ↑ question in bank

An extraction service enables the Citations API on a document and also sets a structured-output format on the same request, expecting JSON with per-claim citations. The API returns a 400 instead. According to the chapter, what is the root cause of this rejection?

Answer & rationale

The chapter is explicit that “Citations cannot be used together with Structured Outputs … the API will return a 400 error,” because cited text must interleave with the response prose, which a strict JSON schema forbids. You cannot get span-bound attribution and a grammar-constrained JSON shape in the same call; the fallback is the provenance triple. (b) describes a real limitation — scanned images without extractable text are not citable — but a scan failure is not what produces this 400; the conflict is structural, not a document-quality problem. (c) names the all-or-none enablement rule, a separate constraint that governs mixing cited and uncited documents, not the Citations-plus-schema collision. (d) misreads the cost model: cited_text “does not count towards output tokens,” so it cannot overflow the output budget.

The two cannot coexist, because cited text must interleave with the response prose that a strict JSON schema forbids ✓
The PDF was scanned without extractable text, so its pages could not be cited under page_location
Citations were enabled on only some of the request's documents rather than all of them
The cited_text field overflowed the output token budget once the schema was attached

Correct: The two cannot coexist, because cited text must interleave with the response prose that a strict JSON schema forbids

d5-06-reliable-cutoff-bounds-trust Context Management & Reliability ↑ question in bank

A teammate notes that Sonnet 4.6 trained on data through January 2026 but is reliable only to August 2025, and asks which date should decide whether a time-sensitive fact may be answered from the model’s own memory. Per the chapter, which date governs that decision, and why?

Answer & rationale

The chapter states the reliable knowledge cutoff is earlier than (or equal to) the training-data cutoff because data near the training cutoff is sparse, so the model’s reliable knowledge stops before its training does — Sonnet 4.6 trained through January 2026 but is reliable only to August 2025, and the earlier date bounds trust. (b) picks the training cutoff, the exact trap the chapter warns against: training reach is not reliability, so January 2026 overstates what the model dependably knows. (c) denies the gap the chapter builds the whole section on — the two dates are distinct, not the same boundary. (d) is true only after a dated source is supplied; the question is precisely whether memory is trustworthy first, which the reliable cutoff decides.

August 2025, because the reliable knowledge cutoff is earlier than the training cutoff and is the date that bounds trust ✓
January 2026, because the model trained on data through that date and can be trusted to that point
Either date is fine, since both cutoffs describe the same boundary of dependable knowledge
Neither date applies, because supplying a dated source removes any cutoff from consideration

Correct: August 2025, because the reliable knowledge cutoff is earlier than the training cutoff and is the date that bounds trust