Part IV’s final chapter is about checking the work — at scale. The instinct to “ask the model to double-check itself” is exactly the wrong one, for a structural reason: a model reviewing in the same window that wrote the code is both dilated by a full context and biased toward what it just produced. The fix is independence, and it scales from two sessions to a fleet. The principle is stable; the product that embodies it is the illustration — so this is an architectural pattern.
Why a fresh context beats self-review
Two independent forces make same-session self-review weak. The first is attention dilution: “LLM performance degrades as context fills. When the context window is getting full, Claude may start ‘forgetting’ earlier instructions or making more mistakes. The context window is the most important resource to manage.” [Official] Best practices for Claude Code · AnthropicT1-official original The second is implementer bias: “A fresh context improves code review since Claude won’t be biased toward code it just wrote.” [Official] Best practices for Claude Code · AnthropicT1-official original A reviewer that never watched the code get written carries neither the polluted context nor the sunk-cost instinct to defend it.
The Writer/Reviewer pattern and its lightweight form
The canonical realization is two sessions. One writes; a second, with no inherited context, reviews; the first then addresses the feedback. The docs give a worked example: Session A implements a rate limiter, Session B reviews @src/middleware/rateLimiter.ts “looking for edge cases, race conditions, and consistency with existing middleware patterns,” and Session A applies the result.
[Official]
Best practices for Claude Code · AnthropicT1-official original The same shape works for tests: “have one Claude write tests, then another write code to pass them.”
[Official]
Best practices for Claude Code · AnthropicT1-official original When spinning up a second session is too heavy, the single-session analog is a verification subagent — “use a subagent to review this code for edge cases” — which runs in its own context window and so inherits none of the parent conversation’s assumptions.
[Official]
Best practices for Claude Code · AnthropicT1-official original
The fleet: parallelism plus a verification pass
At the top of the scale, the pattern becomes a fleet. In Anthropic’s Code Review product, “when a review runs, multiple agents analyze the diff and surrounding code in parallel on Anthropic infrastructure. Each agent looks for a different class of issue, then a verification step checks candidates against actual code behavior to filter out false positives.” [Official] Code Review · AnthropicT1-official original The surviving findings “are deduplicated, ranked by severity, and posted as inline comments.” [Official] Code Review · AnthropicT1-official original Fanning out to specialists is the direct architectural answer to attention dilution: rather than ask one reviewer to hold every bug class in one window at once, each agent owns a single class — and the isolated-context, lead-plus-specialists shape is the same one Anthropic’s multi-agent research system uses. [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original
Code Review in practice: availability, triggers, severity
The productized form of the pattern has a concrete surface the exam can probe. Availability is gated: “Code Review is in research preview, available for Team and Enterprise subscriptions. It is not available for organizations with Zero Data Retention enabled.” [Official] Code Review · AnthropicT1-official original
Triggers come in three modes per repo: once after PR creation, after every push, or manual — invoked by commenting @claude review (which then subscribes to subsequent pushes) or @claude review once (a single one-off pass).
[Official]
Code Review · AnthropicT1-official original The one-off is the lever for “review this PR now, but don’t enroll it in re-review on every push.”
Severity is a fixed three-tag taxonomy on every finding:
And one subtlety that catches people: the check run “always completes with a neutral conclusion so it never blocks merging through branch protection rules.” [Official] Code Review · AnthropicT1-official original Code Review advises; it does not gate. To actually block a merge on findings, read the severity breakdown from the check-run output in your own CI and fail the step yourself. [Official] Code Review · AnthropicT1-official original
Convergence: keep multi-pass from spamming
More passes is not strictly better — multi-pass review needs convergence rules, and attention dilution applies to the instructions as much as the code. The docs are explicit that “a long REVIEW.md dilutes the rules that matter most,”
[Official]
Code Review · AnthropicT1-official original so the broadcast instruction block stays short. And re-review on every push needs a damping rule so trivial diffs don’t draw endless commentary: an instruction like “after the first review, suppress new nits and post Important findings only” stops “a one-line fix from reaching round seven on style alone.”
[Official]
Code Review · AnthropicT1-official original Production-quality fleet review is real work, not free — Code Review averages roughly $15–25 and 20 minutes per review at current figures
[Official]
Code Review · AnthropicT1-official original — so the convergence rules are also cost control.
Practice
Exercise solutions
B. A fresh session or a verification subagent reviews with no inherited context, which removes both weaknesses of self-review at once — the reviewer is at peak performance in a clean window and has no authorship bias toward the code, exactly the conditions the docs credit for catching what self-review misses. A is self-review: most dilated (the implementation already fills the context) and most biased (the model defends what it wrote). C generates a second implementation, not a review, and gives you two artifacts to reconcile rather than a found bug. D re-pastes code into an already-polluted context; freshening the code does nothing about the accumulated context or the authorship bias — only a fresh, independent context fixes those.
Attention dilution says performance degrades as the context window fills — so a single agent asked to find every class of bug must hold the diff, the surrounding code, and a long checklist of bug categories in one window, and its attention to any one class thins as the others crowd in. Fanning out gives each specialist its own fresh context with a single mandate — race conditions, or injection, or edge cases — so none of them is operating dilated, and each brings full attention to its one class. The fleet trades one over-loaded reviewer for many focused ones; that is the direct architectural answer to dilution (paired, crucially, with a verification pass so the extra candidates don’t become noise).
The failure mode is false-positive amplification: parallel reviewers each independently flag plausible-but-wrong issues, and with no filter those candidates accumulate, so adding reviewers adds noise, not just signal — five agents surface five streams of unverified guesses. The verification pass re-checks each candidate against actual code behavior to filter out false positives before anything is posted, so only findings that survive a behavioral check reach the human. It is what makes fan-out a net gain rather than a faster false-positive generator; the surviving findings are then deduplicated and ranked by severity.
Exam essentials
- Fresh context beats self-review for two independent reasons: attention dilution (performance degrades as the window fills) and implementer bias (a model defends code it just wrote). Independence removes both.
- Writer/Reviewer — one session writes, a second independent session reviews, the writer addresses feedback; the test/code split is a variant; the verification subagent is the single-session, isolated-context form.
- Fleet + verification pass — parallel specialists each own one issue class (the answer to attention dilution); a verification step filters false positives, then dedupe + severity ranking. A fleet without verification amplifies false positives.
- Convergence rules — keep the broadcast instruction block short (“a long REVIEW.md dilutes the rules that matter most”) and damp re-review (“suppress new nits, post Important findings only”) so trivial diffs don’t draw endless passes; this is also cost control.
- Code Review surfaces — research preview, Team & Enterprise only, not under Zero Data Retention; three trigger modes (once after PR creation / after every push / manual via
@claude reviewor one-off@claude review once); severity taxonomy 🔴 Important / 🟡 Nit / 🟣 Pre-existing; the check run is neutral and never blocks a merge — gate by reading the severity breakdown in your own CI.