Once the agent loop of D1.1 can run tools, the next architectural question is whether to run more than one agent. This chapter develops the canonical multi-agent shape — a coordinator that spawns isolated subagents — and, just as importantly, the discipline of not reaching for it. The exam tests judgment here: when the pattern wins, what it costs on every axis, how to cut the work, and which single variant is reliable.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

What single property defines a subagent and separates it from a plain tool call?
Name the three conditions under which coordinator–subagent earns its token cost.
Roughly what token multiplier does multi-agent cost versus a single agent, and does it buy speed or thoroughness?
Besides cost, name two axes on which a multi-agent system is worse than a single agent.
A teammate proposes splitting a task into planner → implementer → tester → reviewer. What is the name of this anti-pattern and the metaphor for its failure?

Check your answers

Isolated context — a subagent runs in its own context window and does not see the parent’s state, where a tool call returns directly into the calling agent’s context.
Context protection (large, mostly-irrelevant intermediate data stays out of the main window), parallelization (genuinely independent paths), and specialization (tool-set overload, conflicting personas, or deep domain expertise).
3–10× more tokens than a single agent — and it buys thoroughness, not speed (coordination often makes wall-clock slower).
Any two of: latency (often slower despite parallelism), reliability (multiple failure points), maintainability (multiple prompt sets to keep in sync), context coherence (fragmented at handoffs).
Role-based (problem-centric) decomposition — its failure metaphor is “the telephone game”: context loss at every handoff.

Why run more than one agent

A single agent (D1.1) is one model reasoning over one finite context window. That window is the bottleneck: everything the agent reads, every tool result, every intermediate thought accumulates in it, and a model attends less reliably as it fills. So the motivation for a second agent is not “more brains” — it is more windows. When a subtask would flood the main window with data the final answer doesn’t need, or when independent paths could be explored at once, splitting the work across separate context windows relieves the constraint a single loop cannot.

That is the line Building Effective Agents draws between a workflow and an agent: an agent is a “system where LLMs dynamically direct their own processes and tool usage.” [Official] Building effective agents · Erik Schluntz and Barry Zhang (2024)T1-official original A coordinator of subagents is one such system — an agent whose tool is “spawn another agent.”

The orchestrator and its workers

The canonical multi-agent shape is hub-and-spoke: a lead agent analyzes the task, plans a strategy, and spawns subagents that explore parts of it independently. [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original The lead synthesizes their results and decides whether more work is needed — it is an orchestrator, and the subagents are workers.

This is a real architecture with measured stakes, not a toy. In Anthropic’s research system, “a multi-agent system with Claude Opus 4 as the lead agent and Claude Sonnet 4 subagents outperformed single-agent Claude Opus 4 by 90.2% on our internal research eval.” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original The number is real but specific to that model pairing and that eval — read it as evidence the pattern can pay off, not as a portable benchmark.

Hub-and-spoke: comparing three regions' climate policy Worked example

A lead agent receives: “Compare the climate-disclosure rules of the EU, the US, and Japan.” Decomposed hub-and-spoke:

Lead plans — three regions are independent research paths with no shared state, so it spawns one subagent per region.
Subagents run in isolation — the EU subagent searches EU sources in its own context window; it never sees the US subagent’s intermediate pages, and neither pollutes the lead’s window.
Artifacts, not transcripts — each subagent writes its full findings to a file and returns a compact reference, so 2,000 tokens of raw sources per region don’t stream back through the lead. [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original
Lead synthesizes — it reads the three references and writes the comparison.

The win conditions stack here: context protection (raw sources stay out of the lead) and parallelization (three regions at once). Note what is not split — the final synthesis stays in one agent, because comparing the three regions needs all three in one window.

Isolated context is the whole point

The property that defines the pattern is that each subagent runs in its own context window and does not see the parent’s state. [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original A subagent is given a task and returns a result; the intermediate tokens it generates never touch the coordinator’s window. That isolation is the feature: it keeps a subtask’s noise out of the agent that has to reason over the whole problem.

Because results must cross a context boundary, large outputs use the artifacts pattern — a subagent writes its full output to the filesystem or external storage and passes a lightweight reference back, rather than streaming everything through the coordinator. [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original The coordinator stays lean; the high-fidelity output lives outside its window until needed.

When the pattern earns its cost

The capability is bought with tokens. “In our data, agents typically use about 4× more tokens than chat interactions, and multi-agent systems use about 15× more tokens than chats,” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original and against a single agent on an equivalent task, “multi-agent implementations typically use 3-10x more tokens than single-agent approaches.” [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original So the official guidance leads with restraint: “Start with the simplest approach that works, and add complexity only when evidence supports it.” [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original Try improved prompting, context compaction, and the Tool Search Tool on one agent first.

Reach for coordinator–subagent only when one of three conditions holds: [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original

Win condition	The signal	What it buys
Context protection	A subtask generates large, mostly-irrelevant intermediate data (>1000 tokens) that would pollute the main agent’s reasoning	A clean main-agent window
Parallelization	Genuinely independent paths to explore concurrently	Thoroughness, not speed (coordination often makes wall-clock slower)
Specialization	Tool-set overload (avoid 20+ tools on one agent), conflicting personas, or deep domain expertise	Focused agents that outperform an overloaded generalist

Cost is only the first axis. The full trade-off — the one a scenario question makes you weigh — is worse for multi-agent on most rows, and the architect must be able to name them:

Concept ·

Dimension	Single agent	Multi-agent
Token usage	Baseline	3–10× higher
Latency	Fast, sequential	Often slower despite parallelism (coordination + slowest-subagent)
Reliability	One point of failure	Multiple failure points — more places an error can enter
Maintainability	One prompt set	Multiple prompt sets to keep in sync
Context coherence	Unified	Fragmented at handoffs

Multi-agent is not “more advanced and therefore better”; it trades cost, latency, reliability, and maintainability for capability on tasks that genuinely need separate windows. Three of those five rows are downsides — which is why the exam frames the decision as restraint first.

Decompose by context, not by role

When you do split, how you cut the work is the most-tested judgment in this domain — and the most common way to get it wrong. The anti-pattern is role-based / problem-centric decomposition: planner → implementer → tester → reviewer. It feels organized but “creates constant coordination overhead and context loss at handoffs — the telephone game,” spending more tokens coordinating than executing. [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original

The reliable alternative is context-centric decomposition: split only at true context boundaries.

The verification subagent

One multi-agent shape “consistently succeeds across domains”: the verification subagent. [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original The main agent does the work; a separate agent blackbox-tests the result with clear success criteria and minimal context transfer. The isolation is the strength — the verifier has no stake in, and no memory of, how the work was produced.

Its failure mode is early victory: verifiers tend to declare success after one or two checks. The documented mitigation is an explicit instruction — “You MUST run the complete test suite before marking as passed.” [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original

Practice

Exercise solutions

Solution ↑ Exercise

Multi-agent is plausibly warranted — for specialization — but the proposed split is the role-based anti-pattern. The real signal is tool-set overload (40 tools on one agent; the guidance says avoid 20+ and prefer focused agents). So the justified cut is by tool/domain context — e.g. a CRM-and-orders agent vs a messaging-and-analytics agent — each carrying a focused tool set. The proposed intake → diagnosis → resolution → follow-up split is problem-centric: those are sequential phases of one tightly-coupled ticket, so they would lose fidelity at every handoff (the telephone game) and add coordination cost. Decompose by what context is independent (tool domains), not by what step comes next. And first confirm the Tool Search Tool alone can’t relieve the tool overload on a single agent.

Solution ↑ Exercise

The failure mode is context loss / information-fidelity degradation at handoffs, plus constant coordination overhead — the guidance calls it “the telephone game.” It happens because planner/implementer/tester/reviewer are sequential phases of one tightly-coupled task, not independent contexts. The rule: place a split only at a true context boundary — independent paths, clean-interface components, or blackbox verification — never by “what step comes next.”

Solution ↑ Exercise

Any two of: Reliability — single (one point of failure) → multi (multiple failure points). Maintainability — single (one prompt set) → multi (multiple prompt sets to keep in sync). Latency — single (fast sequential) → multi (often slower despite parallelism). Context coherence — single (unified) → multi (fragmented at handoffs). “More scalable” is not free: three of the five trade-off rows move the wrong way when you add agents.

Exam essentials

Why multi-agent at all: a single agent has one finite context window; extra agents buy more windows, not more intelligence.
Coordinator–subagent = hub-and-spoke: a lead decomposes, spawns subagents, and synthesizes; subagents run in isolated context windows and do not see parent state.
Isolation is the feature (context protection); large outputs use the artifacts pattern — write to storage, pass a reference back.
Cost is 3–10× tokens (and ~15× vs a chat). The 90.2% gain was Opus 4 lead + Sonnet 4 subagents vs single Opus 4 — not a portable number.
The full trade-off is mostly worse for multi-agent: higher tokens, often higher latency, multiple failure points, multiple prompt sets, fragmented coherence. Start single-agent; split only for context protection, parallelization, or specialization.
Decompose by context, not role. planner/implementer/tester/reviewer is the telephone-game anti-pattern; split at true context boundaries.
The verification subagent is the reliable variant — blackbox-test the result; mitigate early victory with “run the complete test suite before marking as passed.”

Why run more than one agent

The orchestrator and its workers

Isolated context is the whole point

When the pattern earns its cost

Decompose by context, not by role

The verification subagent

Practice

Exercise solutions

Exam essentials

Further reading