Once the agent loop of D1.1 can run tools, the next architectural question is whether to run more than one agent. This chapter develops the canonical multi-agent shape — a coordinator that spawns isolated subagents — and, just as importantly, the discipline of not reaching for it. The exam tests judgment here: when the pattern wins, what it costs on every axis, how to cut the work, and which single variant is reliable.

Why run more than one agent

A single agent (D1.1) is one model reasoning over one finite context window. That window is the bottleneck: everything the agent reads, every tool result, every intermediate thought accumulates in it, and a model attends less reliably as it fills. So the motivation for a second agent is not “more brains” — it is more windows. When a subtask would flood the main window with data the final answer doesn’t need, or when independent paths could be explored at once, splitting the work across separate context windows relieves the constraint a single loop cannot.

That is the line Building Effective Agents draws between a workflow and an agent: an agent is a “system where LLMs dynamically direct their own processes and tool usage.” [Official] Building effective agents · Erik Schluntz and Barry Zhang (2024)T1-official original A coordinator of subagents is one such system — an agent whose tool is “spawn another agent.”

The orchestrator and its workers

The canonical multi-agent shape is hub-and-spoke: a lead agent analyzes the task, plans a strategy, and spawns subagents that explore parts of it independently. [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original The lead synthesizes their results and decides whether more work is needed — it is an orchestrator, and the subagents are workers.

This is a real architecture with measured stakes, not a toy. In Anthropic’s research system, “a multi-agent system with Claude Opus 4 as the lead agent and Claude Sonnet 4 subagents outperformed single-agent Claude Opus 4 by 90.2% on our internal research eval.” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original The number is real but specific to that model pairing and that eval — read it as evidence the pattern can pay off, not as a portable benchmark.

Isolated context is the whole point

The property that defines the pattern is that each subagent runs in its own context window and does not see the parent’s state. [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original A subagent is given a task and returns a result; the intermediate tokens it generates never touch the coordinator’s window. That isolation is the feature: it keeps a subtask’s noise out of the agent that has to reason over the whole problem.

Because results must cross a context boundary, large outputs use the artifacts pattern — a subagent writes its full output to the filesystem or external storage and passes a lightweight reference back, rather than streaming everything through the coordinator. [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original The coordinator stays lean; the high-fidelity output lives outside its window until needed.

When the pattern earns its cost

The capability is bought with tokens. “In our data, agents typically use about 4× more tokens than chat interactions, and multi-agent systems use about 15× more tokens than chats,” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original and against a single agent on an equivalent task, “multi-agent implementations typically use 3-10x more tokens than single-agent approaches.” [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original So the official guidance leads with restraint: “Start with the simplest approach that works, and add complexity only when evidence supports it.” [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original Try improved prompting, context compaction, and the Tool Search Tool on one agent first.

Reach for coordinator–subagent only when one of three conditions holds: [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original

Win conditionThe signalWhat it buys
Context protectionA subtask generates large, mostly-irrelevant intermediate data (>1000 tokens) that would pollute the main agent’s reasoningA clean main-agent window
ParallelizationGenuinely independent paths to explore concurrentlyThoroughness, not speed (coordination often makes wall-clock slower)
SpecializationTool-set overload (avoid 20+ tools on one agent), conflicting personas, or deep domain expertiseFocused agents that outperform an overloaded generalist

Cost is only the first axis. The full trade-off — the one a scenario question makes you weigh — is worse for multi-agent on most rows, and the architect must be able to name them:

Multi-agent is not “more advanced and therefore better”; it trades cost, latency, reliability, and maintainability for capability on tasks that genuinely need separate windows. Three of those five rows are downsides — which is why the exam frames the decision as restraint first.

Decompose by context, not by role

When you do split, how you cut the work is the most-tested judgment in this domain — and the most common way to get it wrong. The anti-pattern is role-based / problem-centric decomposition: planner → implementer → tester → reviewer. It feels organized but “creates constant coordination overhead and context loss at handoffs — the telephone game,” spending more tokens coordinating than executing. [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original

The reliable alternative is context-centric decomposition: split only at true context boundaries.

The verification subagent

One multi-agent shape “consistently succeeds across domains”: the verification subagent. [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original The main agent does the work; a separate agent blackbox-tests the result with clear success criteria and minimal context transfer. The isolation is the strength — the verifier has no stake in, and no memory of, how the work was produced.

Its failure mode is early victory: verifiers tend to declare success after one or two checks. The documented mitigation is an explicit instruction — “You MUST run the complete test suite before marking as passed.” [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original

Practice

Exercise solutions

Solution ↑ Exercise

Multi-agent is plausibly warranted — for specialization — but the proposed split is the role-based anti-pattern. The real signal is tool-set overload (40 tools on one agent; the guidance says avoid 20+ and prefer focused agents). So the justified cut is by tool/domain context — e.g. a CRM-and-orders agent vs a messaging-and-analytics agent — each carrying a focused tool set. The proposed intake → diagnosis → resolution → follow-up split is problem-centric: those are sequential phases of one tightly-coupled ticket, so they would lose fidelity at every handoff (the telephone game) and add coordination cost. Decompose by what context is independent (tool domains), not by what step comes next. And first confirm the Tool Search Tool alone can’t relieve the tool overload on a single agent.

Solution ↑ Exercise

The failure mode is context loss / information-fidelity degradation at handoffs, plus constant coordination overhead — the guidance calls it “the telephone game.” It happens because planner/implementer/tester/reviewer are sequential phases of one tightly-coupled task, not independent contexts. The rule: place a split only at a true context boundary — independent paths, clean-interface components, or blackbox verification — never by “what step comes next.”

Solution ↑ Exercise

Any two of: Reliability — single (one point of failure) → multi (multiple failure points). Maintainability — single (one prompt set) → multi (multiple prompt sets to keep in sync). Latency — single (fast sequential) → multi (often slower despite parallelism). Context coherence — single (unified) → multi (fragmented at handoffs). “More scalable” is not free: three of the five trade-off rows move the wrong way when you add agents.

Exam essentials

Further reading

The environment angle on isolation — how bounding what each agent loads is the same discipline that makes a large codebase legible — is developed in the Agentic Systems Design book, Chapter 7, Environments at Scale. Optional depth; this chapter stands on its own.