A single agent fails visibly; a pipeline of agents fails by degrees. An error introduced at one stage rarely announces itself — it rides the handoff to the next stage as if it were sound input, and concurrent faults aggregate into a degradation that looks like nothing in particular. This chapter is about that propagation and the architecture that contains it. The pattern is durable; the percentages that illustrate it are evidence, so this is an architectural pattern.
Failures compound across agent boundaries
Multi-agent systems fail far more often than their individual agents do. One practitioner analysis, drawing on the MAST taxonomy of 1,600-plus execution traces, reports that “multi-agent LLM systems fail at rates between 41-86.7% in production because specification ambiguity and unstructured coordination protocols cause agents to misinterpret roles, duplicate work, and skip verification.” [Practitioner] Why multi-agent LLM systems fail and how to fix them · Augment CodeT3-practitioner original That taxonomy sorts the breakdowns into three categories covering most of them: specification problems (41.77%), coordination failures (36.94%), and verification gaps (21.30%). [Practitioner] Why multi-agent LLM systems fail and how to fix them · Augment CodeT3-practitioner original
An upstream ambiguity becomes a downstream wrong decision
The propagation mechanism is specific. “Agents cannot read between lines, infer context, or ask clarifying questions during execution. Every ambiguity becomes a decision point where agents explore all possible interpretations, selecting suboptimal ones.” [Practitioner] Why multi-agent LLM systems fail and how to fix them · Augment CodeT3-practitioner original This is the multi-agent counterpart to D5.2’s escalation problem: an interactive agent can pause and ask, but a mid-pipeline agent usually has no one to ask, so it resolves the ambiguity — and a quietly-wrong resolution is handed downstream as if it were settled fact. The next agent has no signal that the input it received was a guess.
Compounding failures evade per-component testing
When multiple faults run at once, their aggregate is not the sum of their symptoms — and that is what makes them hard to catch. Anthropic’s April-23 postmortem describes three production bugs with distinct, partly-overlapping windows: a reasoning-effort default change (Mar 4–Apr 7), a caching bug that broke thinking blocks (Mar 26–Apr 10), and a verbosity-reduction prompt (Apr 16–Apr 20). [Official] An update on recent Claude Code quality reports · AnthropicT1-official original Those windows union to a roughly seven-week span (Mar 4–Apr 20) — but that is the aggregate reach, not a stretch in which all three ran at once: the first two overlapped, while the third began only after both had been fixed. Even so, the combined effect “looked like broad, inconsistent degradation” that no single bug’s symptom resembled, and the most stubborn one — the caching bug — crossed context management, the API layer, and the extended-thinking system. [Official] An update on recent Claude Code quality reports · AnthropicT1-official original Detection was the central lesson: the bugs hit different traffic slices, and neither internal usage nor the existing eval suite reproduced them. [Official] An update on recent Claude Code quality reports · AnthropicT1-official original
Defenses: structured context, independent validation, circuit breakers
The countermeasures all push against under-specified, unchecked boundaries. The practitioner remedies are to convert prose specs into machine-validatable schemas, to enforce typed and schema-validated messages between agents (with MCP named as the schema-enforced substrate), to deploy isolated judge agents for independent validation, and to add circuit breakers that isolate a misbehaving agent before it cascades. [Practitioner] Why multi-agent LLM systems fail and how to fix them · Augment CodeT3-practitioner original The payoff is concrete: “PwC achieved a 7x accuracy improvement (10% to 70%) through structured validation loops.” [Practitioner] Why multi-agent LLM systems fail and how to fix them · Augment CodeT3-practitioner original
The single-agent equivalent is the postmortem’s own remedy — broad per-model evaluations on every prompt change, ablation testing, and soak periods before rollout [Official] An update on recent Claude Code quality reports · AnthropicT1-official original — the same instinct as independent validation, applied to a pipeline of one.
Practice
Exercise solutions
B. The failure is a boundary failure: an ambiguous handoff that no stage is positioned to catch. Fixing the boundary — a typed, schema-validated spec so there is less to misinterpret, plus an independent validation step that checks the coder’s output against that spec — intercepts the wrong interpretation before it reaches the reviewer. A makes one agent smarter but leaves the ambiguous interface intact; a better coder still has to guess at an under-specified spec. C asks the reviewer to work harder while still reading only the code, blind to the spec it was meant to satisfy. D doubles cost and gives two artifacts to compare with no oracle for which is right — a wrong-but-consistent interpretation reproduces on the retry.
The three MAST categories are specification problems (41.77%), coordination failures (36.94%), and verification gaps (21.30%); specification problems are the largest. Specification problems propagate especially badly in a pipeline because a mid-pipeline agent cannot pause to ask — an under-specified handoff becomes a guess the agent resolves and passes downstream as settled fact, so a single ambiguity at the top seeds a wrong decision that every later stage treats as valid input.
A compounding cross-system failure can pass every per-component evaluation because it lives between components, not inside any one: each part passes its own eval in isolation, and the degradation emerges only from their interaction, on traffic slices no single test exercises. In the April-23 case three bugs with overlapping windows (the union running Mar 4 – Apr 20) produced “broad, inconsistent degradation” that no single bug’s symptom resembled, and neither internal usage nor the existing eval suite reproduced it. The postmortem concluded that integration-level testing was needed instead — broad per-model evaluations on every prompt change, ablation testing, and soak periods before rollout — testing the system as it actually runs rather than each unit alone.
Exam essentials
- Failures compound across boundaries — multi-agent systems fail at much higher rates than their parts (the MAST taxonomy: specification 41.77%, coordination 36.94%, verification 21.30%); a chain’s reliability is the product of its handoffs, not its best agent.
- Ambiguity propagates silently — a mid-pipeline agent cannot pause to ask (unlike D5.2’s interactive escalation), so it resolves an ambiguity and passes the guess downstream as settled fact.
- Compounding failures evade unit tests — they live between components, on traffic slices no single test exercises; all-green component evals do not prove a multi-stage system healthy.
- Defenses — structured error context across boundaries (D2.2), independent validation / isolated judges (D4.6), circuit breakers to stop cascades, and keeping the escalable decision at the coordinator (D5.2); the single-agent analog is broad evals + ablation + soak periods.