Part IV’s final chapter is about checking the work — at scale. The instinct to “ask the model to double-check itself” is exactly the wrong one, for a structural reason: a model reviewing in the same window that wrote the code is both dilated by a full context and biased toward what it just produced. The fix is independence, and it scales from two sessions to a fleet. The principle is stable; the product that embodies it is the illustration — so this is an architectural pattern.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

Why is asking a model to review its own code in the same session the weakest review — for which two independent reasons?
Name the three scales of the independent-reviewer pattern, cheapest first.
What does the verification pass do in a fleet review, and what fails without it?
Name Code Review’s three severity tags, and which one you would block a merge on.
Code Review’s check run never blocks a merge on its own — so how do you actually gate on its findings?

Check your answers

Attention dilution (performance degrades as the context window fills) and implementer bias (a fresh context won’t be biased toward code it just wrote) — self-review is the most dilated and most biased reviewer you could pick.
Verification subagent (one session, child in its own context window), Writer/Reviewer (two independent sessions), fleet (many parallel specialists, one issue class each).
It re-checks each candidate finding against actual code behavior to filter out false positives; without it, parallel reviewers’ plausible-but-wrong findings accumulate — more reviewers means more noise.
🔴 Important / 🟡 Nit / 🟣 Pre-existing — block the merge on Important (a bug that should be fixed before merging).
Read the severity breakdown from the check-run output in your own CI and fail the step yourself — exit non-zero when the 🔴 Important count is positive, so your own required check blocks the merge.

Why a fresh context beats self-review

Two independent forces make same-session self-review weak. The first is attention dilution: “LLM performance degrades as context fills. When the context window is getting full, Claude may start ‘forgetting’ earlier instructions or making more mistakes. The context window is the most important resource to manage.” [Official] Best practices for Claude Code · AnthropicT1-official original The second is implementer bias: “A fresh context improves code review since Claude won’t be biased toward code it just wrote.” [Official] Best practices for Claude Code · AnthropicT1-official original A reviewer that never watched the code get written carries neither the polluted context nor the sunk-cost instinct to defend it.

The Writer/Reviewer pattern and its lightweight form

The canonical realization is two sessions. One writes; a second, with no inherited context, reviews; the first then addresses the feedback. The docs give a worked example: Session A implements a rate limiter, Session B reviews @src/middleware/rateLimiter.ts “looking for edge cases, race conditions, and consistency with existing middleware patterns,” and Session A applies the result. [Official] Best practices for Claude Code · AnthropicT1-official original The same shape works for tests: “have one Claude write tests, then another write code to pass them.” [Official] Best practices for Claude Code · AnthropicT1-official original When spinning up a second session is too heavy, the single-session analog is a verification subagent — “use a subagent to review this code for edge cases” — which runs in its own context window and so inherits none of the parent conversation’s assumptions. [Official] Best practices for Claude Code · AnthropicT1-official original

The fleet: parallelism plus a verification pass

At the top of the scale, the pattern becomes a fleet. In Anthropic’s Code Review product, “when a review runs, multiple agents analyze the diff and surrounding code in parallel on Anthropic infrastructure. Each agent looks for a different class of issue, then a verification step checks candidates against actual code behavior to filter out false positives.” [Official] Code Review · AnthropicT1-official original The surviving findings “are deduplicated, ranked by severity, and posted as inline comments.” [Official] Code Review · AnthropicT1-official original Fanning out to specialists is the direct architectural answer to attention dilution: rather than ask one reviewer to hold every bug class in one window at once, each agent owns a single class — and the isolated-context, lead-plus-specialists shape is the same one Anthropic’s multi-agent research system uses. [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original

Code Review in practice: availability, triggers, severity

The productized form of the pattern has a concrete surface the exam can probe. Availability is gated: “Code Review is in research preview, available for Team and Enterprise subscriptions. It is not available for organizations with Zero Data Retention enabled.” [Official] Code Review · AnthropicT1-official original

Triggers come in three modes per repo: once after PR creation, after every push, or manual — invoked by commenting @claude review (which then subscribes to subsequent pushes) or @claude review once (a single one-off pass). [Official] Code Review · AnthropicT1-official original The one-off is the lever for “review this PR now, but don’t enroll it in re-review on every push.”

Severity is a fixed three-tag taxonomy on every finding:

And one subtlety that catches people: the check run “always completes with a neutral conclusion so it never blocks merging through branch protection rules.” [Official] Code Review · AnthropicT1-official original Code Review advises; it does not gate. To actually block a merge on findings, read the severity breakdown from the check-run output in your own CI and fail the step yourself. [Official] Code Review · AnthropicT1-official original

Convergence: keep multi-pass from spamming

More passes is not strictly better — multi-pass review needs convergence rules, and attention dilution applies to the instructions as much as the code. The docs are explicit that “a long REVIEW.md dilutes the rules that matter most,” [Official] Code Review · AnthropicT1-official original so the broadcast instruction block stays short. And re-review on every push needs a damping rule so trivial diffs don’t draw endless commentary: an instruction like “after the first review, suppress new nits and post Important findings only” stops “a one-line fix from reaching round seven on style alone.” [Official] Code Review · AnthropicT1-official original Production-quality fleet review is real work, not free — Code Review averages roughly $15–25 and 20 minutes per review at current figures [Official] Code Review · AnthropicT1-official original — so the convergence rules are also cost control.

Making an advisory review actually block a merge Worked example

Your team wants a merge blocked when Code Review finds a real bug — but its check run is neutral by design and never blocks through branch protection. So you build the gate yourself, on top of the surfaces above:

Trigger once. On a PR you want checked but not re-reviewed on every push, comment @claude review once. The fleet runs, the verification pass filters false positives, and findings post as inline comments tagged 🔴 Important / 🟡 Nit / 🟣 Pre-existing.
Read the severity breakdown. Code Review writes a machine-readable severity count into the check-run output; a CI step parses it (e.g. gh + jq) to pull the 🔴 Important count.
Fail on Important. Exit non-zero when it is positive, so your own required check (not Code Review’s) goes red and branch protection blocks the merge:

important=$(gh ... --jq '... | .important')
[ "$important" -gt 0 ] && { echo "Code Review: $important Important finding(s)"; exit 1; }

The reasoning ties three threads together. Code Review deliberately advises rather than gates, so a research-preview tool can never wedge a team’s merge queue. The 🔴/🟡/🟣 taxonomy is what makes a self-built gate precise — block on Important, let Nits and Pre-existing through. And the exit-code discipline from D3.6 is what converts an advisory signal into an enforced one. Note also the cost lever: @claude review once runs a single pass, where “after every push” multiplies the ~$15–25 per-review cost by your push count.

Practice

Exercise solutions

Solution ↑ Exercise

B. A fresh session or a verification subagent reviews with no inherited context, which removes both weaknesses of self-review at once — the reviewer is at peak performance in a clean window and has no authorship bias toward the code, exactly the conditions the docs credit for catching what self-review misses. A is self-review: most dilated (the implementation already fills the context) and most biased (the model defends what it wrote). C generates a second implementation, not a review, and gives you two artifacts to reconcile rather than a found bug. D re-pastes code into an already-polluted context; freshening the code does nothing about the accumulated context or the authorship bias — only a fresh, independent context fixes those.

Solution ↑ Exercise

Attention dilution says performance degrades as the context window fills — so a single agent asked to find every class of bug must hold the diff, the surrounding code, and a long checklist of bug categories in one window, and its attention to any one class thins as the others crowd in. Fanning out gives each specialist its own fresh context with a single mandate — race conditions, or injection, or edge cases — so none of them is operating dilated, and each brings full attention to its one class. The fleet trades one over-loaded reviewer for many focused ones; that is the direct architectural answer to dilution (paired, crucially, with a verification pass so the extra candidates don’t become noise).

Solution ↑ Exercise

The failure mode is false-positive amplification: parallel reviewers each independently flag plausible-but-wrong issues, and with no filter those candidates accumulate, so adding reviewers adds noise, not just signal — five agents surface five streams of unverified guesses. The verification pass re-checks each candidate against actual code behavior to filter out false positives before anything is posted, so only findings that survive a behavioral check reach the human. It is what makes fan-out a net gain rather than a faster false-positive generator; the surviving findings are then deduplicated and ranked by severity.

Exam essentials

Fresh context beats self-review for two independent reasons: attention dilution (performance degrades as the window fills) and implementer bias (a model defends code it just wrote). Independence removes both.
Writer/Reviewer — one session writes, a second independent session reviews, the writer addresses feedback; the test/code split is a variant; the verification subagent is the single-session, isolated-context form.
Fleet + verification pass — parallel specialists each own one issue class (the answer to attention dilution); a verification step filters false positives, then dedupe + severity ranking. A fleet without verification amplifies false positives.
Convergence rules — keep the broadcast instruction block short (“a long REVIEW.md dilutes the rules that matter most”) and damp re-review (“suppress new nits, post Important findings only”) so trivial diffs don’t draw endless passes; this is also cost control.
Code Review surfaces — research preview, Team & Enterprise only, not under Zero Data Retention; three trigger modes (once after PR creation / after every push / manual via @claude review or one-off @claude review once); severity taxonomy 🔴 Important / 🟡 Nit / 🟣 Pre-existing; the check run is neutral and never blocks a merge — gate by reading the severity breakdown in your own CI.

Part 4 · D4 Review

6 exercises across 6 chapters — interleaved review.

d4-01-explicit-criteria

d4-01-ex-vague-vs-explicit A nightly job asks Claude to "summarize each support ticket." The summaries come back wildly inconsistent — some one line, some three paragraphs, some with a sentiment label and some without — and a downstream parser keeps breaking. What is the most direct fix? - **A.** Lower the sampling temperature so the model decodes more deterministically and repeats one shape. - **B.** Add "be consistent and don't write too much" so the prompt tells it to self-regulate length. - **C.** Specify the exact output contract — a fixed set of typed, length-bounded fields the parser can rely on. - **D.** Switch to a larger model that infers the intended format more reliably from the same prompt.

d4-02-few-shot-prompting

d4-02-ex-ambiguous-edge You are extracting `{invoice_id, amount, due_date}` from emails. Most are clean, but some have no due date at all, and the pipeline keeps emitting `"due_date": "unknown"` for those — which the downstream date parser rejects. You want missing dates to come back as `null`. What is the most reliable fix? - **A.** Add a sentence to the instruction: "if there is no due date, use null, not 'unknown'." - **B.** Include 3-5 examples, one of which is an email with no due date whose output shows `"due_date": null`. - **C.** Raise the example count to 10 clean invoices so the model has more data to generalize from. - **D.** Post-process every `"unknown"` string into `null` after the model returns.

d4-03-structured-output-tool-use

d4-03-ex-pick-the-primitive You extract a fixed record — `{customer, plan, seats}` with `seats` an integer — and a downstream billing function will crash on a string where it expects a number. You want exactly this one extraction, type-guaranteed, on the native Claude API. Which approach fits best? - **A.** Force a single `print_record` tool with `tool_choice: {type: "tool", name: "print_record"}` and set `strict: true` on it. - **B.** Use the classic cookbook pattern with `additionalProperties: true` so the model can include whatever it finds. - **C.** Call through the OpenAI SDK compatibility layer with `strict: true` for portability. - **D.** Ask in the prompt for "valid JSON with an integer seats field" and parse the response text.

d4-04-validation-retry-feedback

d4-04-ex-semantic-catch An invoice extractor uses structured outputs and never returns malformed JSON, but once in a while it reports a `total` that doesn't match the line items — and those slip through to billing. You want to catch them automatically. What is the most effective design? - **A.** Add `strict: true` to the extraction tool so the totals are type-guaranteed. - **B.** Add both `stated_total` and `calculated_total` to the schema and have application code compare them, routing any mismatch to human review. - **C.** Raise `max_tokens` so the model has room to compute the total more carefully. - **D.** Tighten the JSON schema with a `minimum` constraint on `total`.

d4-05-batch-processing

d4-05-ex-batch-or-realtime You must classify 80,000 archived support tickets overnight to populate an analytics dashboard read the next morning. Cost matters; latency does not. Which design fits best? - **A.** A real-time loop calling the Messages API once per ticket, parallelized for throughput. - **B.** One Message Batch of 80,000 requests, each with a unique `custom_id`, results matched by `custom_id` after `processing_status` reaches `ended`. - **C.** A streaming batch so the dashboard can update ticket-by-ticket as results arrive. - **D.** A single Messages API request containing all 80,000 tickets in one prompt.

d4-06-multi-pass-review

d4-06-ex-independent-review Claude has just implemented a subtle concurrent cache in one long session, and you want the strongest possible bug-catch before merge. Which approach is most likely to surface a race condition? - **A.** Ask the same session, in the next turn, to carefully review its own implementation for bugs. - **B.** Open a fresh session (or dispatch a verification subagent) that reviews the cache with no context from how it was written. - **C.** Re-run the same prompt at a higher temperature to get a different implementation to compare. - **D.** Keep reviewing in the same session but paste the code back in so it is fresh in context.