Most real agent work is several steps, not one. The question this chapter answers is not what the steps are but who enforces the order — your application code, or the model itself — and how you keep a bad step from poisoning the next one. That choice is an Evaluate-level judgment: the exam gives you a workflow and asks where the control flow belongs, how the handoff is specified, and where the gate goes.
Two places to enforce a workflow
A multi-step workflow’s control flow lives in exactly one of two places. Either your code drives the sequence — run a step, take its output, decide the next call — or the model drives it, having been told the steps in a prompt. The Agent SDK frames the split directly: “With the Client SDK, you implement a tool loop. With the Agent SDK, Claude handles it.” [Official] Agent SDK overview · AnthropicT1-official original
These are not rival products — Anthropic notes the same workflow “translate[s] directly” between the CLI and the SDK. [Official] Agent SDK overview · AnthropicT1-official original The architect’s decision is which layer holds the control flow, and it turns on how much the workflow needs determinism versus flexibility.
Every step boundary is a handoff
Whichever layer enforces the steps, each transition between them is a handoff — the output of step N becomes the input of step N+1 — and a handoff is where information is lost. Chapter D1.2 named the worst case: dividing a tightly-coupled task by role (planner → implementer → tester → reviewer) “creates constant coordination overhead and context loss at handoffs — the telephone game,” spending more tokens coordinating than executing. [Official] Building multi-agent systems: When and how to use them · Anthropic (2026)T1-official original Prompt-based handoffs across a long sequential chain are the most fidelity-fragile arrangement: each step re-narrates the last, and detail erodes at every retelling.
The Writer/Reviewer handoff that works
Not every handoff leaks — the canonical multi-step quality workflow depends on one. In the Writer/Reviewer pattern, one session implements and a second reviews: “A fresh context improves code review since Claude won’t be biased toward code it just wrote.” [Official] Best practices for Claude Code · AnthropicT1-official original Session A writes the rate limiter; Session B reviews the file for edge cases, race conditions, and consistency; Session A then addresses the feedback. The same shape works for tests — “have one Claude write tests, then another write code to pass them.” [Official] Best practices for Claude Code · AnthropicT1-official original
The handoff contract and its artifact
When work does cross a boundary, what crosses must be specified, not assumed. Anthropic’s research system makes each handoff an explicit contract: “Each subagent needs an objective, an output format, guidance on the tools and sources to use, and clear task boundaries.” [Official] How we built our multi-agent research system · Anthropic (2025)T1-official original That is the same discipline as D1.3’s rule that everything the subagent needs goes in the prompt — applied to every step of a multi-step flow.
The most robust way to carry that contract across a boundary is as a written artifact — a file the next step reads, not prose it re-narrates. Two concrete forms appear in the best-practices guidance:
- A spec file. After an interview/planning phase, “start a fresh session to execute it … and you have a written spec to reference.” [Official] Best practices for Claude Code · AnthropicT1-official original The spec, not the conversation, is what crosses to the implementation step.
- A test file. In the test/code split, the tests are the contract: one Claude writes them, another writes code to pass them. The implementer’s target is the file, not a description of it.
The validation gate: reject before propagating
A precise contract is also what lets a programmatic pipeline put a gate between steps — a check the step-N output must pass before step N+1 is allowed to consume it. The gate does two kinds of check, and the distinction is the one Domain 4 builds on (D4.4):
On failure, the gate does not pass the bad output downstream — it rejects and retries: re-prompt the failing step with the specific errors, and only advance when the output passes. That is the difference between a programmatic pipeline and a prompt-based one: the gate is enforced in your code, where a malformed step cannot quietly become the next step’s input.
Choosing where the control flow lives
The Evaluate-level call: enforce programmatically when the workflow needs determinism, an audit trail, validation gates between steps, or a fixed and repeatable sequence — the steps are known in advance and you want them to run the same way every time. Stay prompt-based when the path is flexible, the model can sensibly self-direct, and the orchestration code would cost more than it saves.
This pairs with two neighboring decisions: whether to split into multiple agents at all (D1.2) and whether the decomposition is a fixed pipeline or an adaptive one (D1.6). Enforcement locus is how the steps are driven; those chapters cover whether and into what shape. The mechanics of the validation/retry loop itself — schema vs semantic errors, bounded retries — are developed in D4.4.
Practice
Exercise solutions
Choose (b), but split at one boundary only. The fact-check is failing for the exact reason the Writer/Reviewer pattern addresses — a context biased toward the draft it just produced rationalizes its own claims. Hand the fact-check to a fresh context (a second session or a verification subagent), passing an explicit handoff contract: the draft, the claims to verify, the success criteria, the output format. That is a programmatic handoff — your code routes the draft to the reviewer and the verdict back. Keep research → draft coupled in one context: they are tightly coupled and share state, so a handoff there would only leak fidelity. And do not split all four steps into role-agents — that is the telephone-game pipeline, four lossy handoffs where you needed one. The skill is placing the single split where fresh context buys independence.
Programmatic enforcement puts the control flow in your code — you sequence the steps, pass each output to the next, and can gate between them; choose it when you need determinism, an audit trail, validation gates, or a fixed repeatable sequence. Prompt-based enforcement puts the control flow in the model — it is told the procedure and self-directs; choose it when the path is flexible, the model can sensibly adapt, and orchestration code would cost more than it saves.
The schema is doing a structural check — right shape, fields present, types valid — and is missing the semantic check: whether the content is actually correct (valid JSON can still carry fabricated or contradictory data). The gate should not pass a failing output along; it should reject and retry — re-prompt the failing step with the specific error and only advance once the output passes both the structural and semantic checks.
Exam essentials
- Two enforcement loci: a multi-step workflow’s control flow lives in your code (programmatic — you sequence steps and gate between them; deterministic/auditable) or the model (prompt-based — told the steps, self-directs; flexible).
- Every step boundary is a handoff, and handoffs leak. A sequential chain of role-agents is the most fidelity-fragile arrangement — the telephone game.
- Writer/Reviewer is the handoff that works because the reviewer has fresh context — it can’t defend code it never wrote. Don’t let an author review its own work.
- Carry the contract in a written artifact — a spec file or a test file the next step reads from disk — so the handoff doesn’t depend on re-narration.
- A programmatic validation gate runs a structural check and a semantic check between steps, and on failure rejects and retries rather than propagating the bad output. (The loop in depth: D4.4.)
- Choose programmatic for determinism / audit / gates / fixed sequence; prompt-based for flexible self-direction. (Whether to split = D1.2; pipeline vs adaptive = D1.6.)