A single agent fails visibly; a pipeline of agents fails by degrees. An error introduced at one stage rarely announces itself — it rides the handoff to the next stage as if it were sound input, and concurrent faults aggregate into a degradation that looks like nothing in particular. This chapter is about that propagation and the architecture that contains it. The pattern is durable; the percentages that illustrate it are evidence, so this is an architectural pattern.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

Why is a pipeline of agents less reliable than its single most reliable agent?
Why does a mid-pipeline agent turn an upstream ambiguity into a silent wrong decision, where an interactive one would not?
The MAST taxonomy sorts multi-agent failures into which three categories — and which is the largest?
Overlapping production bugs degraded quality for weeks, yet every component eval stayed green. Why couldn’t the unit tests see it?
Name two boundary defenses that contain error propagation across agents.

Check your answers

Because a chain’s reliability is the product of its handoffs, not its best agent’s — each boundary is both a place an error can enter and a place an existing error passes through unexamined.
A mid-pipeline agent has no one to ask, so where an interactive agent would pause and clarify, it resolves the ambiguity itself and hands the guess downstream as settled fact.
Specification problems (41.77%) — the largest — then coordination failures (36.94%) and verification gaps (21.30%).
The degradation lived between components, not inside any one — each part passed its own eval, and the combined effect appeared only in the interaction, on traffic slices no single test exercises.
Any two of: structured error context across boundaries, independent validation by an isolated judge, and circuit breakers that isolate a misbehaving agent before it cascades.

Failures compound across agent boundaries

Multi-agent systems fail far more often than their individual agents do. One practitioner analysis, drawing on the MAST taxonomy of 1,600-plus execution traces, reports that “multi-agent LLM systems fail at rates between 41-86.7% in production because specification ambiguity and unstructured coordination protocols cause agents to misinterpret roles, duplicate work, and skip verification.” [Practitioner] Why multi-agent LLM systems fail and how to fix them · Augment CodeT3-practitioner original That taxonomy sorts the breakdowns into three categories covering most of them: specification problems (41.77%), coordination failures (36.94%), and verification gaps (21.30%). [Practitioner] Why multi-agent LLM systems fail and how to fix them · Augment CodeT3-practitioner original

An upstream ambiguity becomes a downstream wrong decision

The propagation mechanism is specific. “Agents cannot read between lines, infer context, or ask clarifying questions during execution. Every ambiguity becomes a decision point where agents explore all possible interpretations, selecting suboptimal ones.” [Practitioner] Why multi-agent LLM systems fail and how to fix them · Augment CodeT3-practitioner original This is the multi-agent counterpart to D5.2’s escalation problem: an interactive agent can pause and ask, but a mid-pipeline agent usually has no one to ask, so it resolves the ambiguity — and a quietly-wrong resolution is handed downstream as if it were settled fact. The next agent has no signal that the input it received was a guess.

Compounding failures evade per-component testing

When multiple faults run at once, their aggregate is not the sum of their symptoms — and that is what makes them hard to catch. Anthropic’s April-23 postmortem describes three production bugs with distinct, partly-overlapping windows: a reasoning-effort default change (Mar 4–Apr 7), a caching bug that broke thinking blocks (Mar 26–Apr 10), and a verbosity-reduction prompt (Apr 16–Apr 20). [Official] An update on recent Claude Code quality reports · AnthropicT1-official original Those windows union to a roughly seven-week span (Mar 4–Apr 20) — but that is the aggregate reach, not a stretch in which all three ran at once: the first two overlapped, while the third began only after both had been fixed. Even so, the combined effect “looked like broad, inconsistent degradation” that no single bug’s symptom resembled, and the most stubborn one — the caching bug — crossed context management, the API layer, and the extended-thinking system. [Official] An update on recent Claude Code quality reports · AnthropicT1-official original Detection was the central lesson: the bugs hit different traffic slices, and neither internal usage nor the existing eval suite reproduced them. [Official] An update on recent Claude Code quality reports · AnthropicT1-official original

Why three overlapping bugs passed every unit test Worked example

The April-23 incident is the canonical compounding failure. Three independent bugs, three windows:

Bug	Window	Layer(s) it touched
Reasoning-effort default change	Mar 4 – Apr 7	inference defaults
Caching bug (broke thinking blocks)	Mar 26 – Apr 10	context mgmt × API × extended thinking
Verbosity-reduction prompt	Apr 16 – Apr 20	system prompt

Read the dates carefully: the first two overlapped (Mar 26 – Apr 7), but the third started after both were already fixed. So “seven weeks” is the union of the windows (Mar 4 – Apr 20), not a span of three-way concurrency — a distinction worth getting right, because it changes what a responder is actually hunting for (one persistent fault versus a shifting set).

Now the reason it hid: each bug, tested in isolation, passes. The reasoning-effort change is a valid config; the cache works on most paths; the verbosity prompt is well-formed. The degradation lived in the interaction and in which traffic slices each bug touched — so neither internal usage nor the existing eval suite reproduced it, and “broad, inconsistent degradation” was all the aggregate looked like. The postmortem’s remedy is integration-level: per-model evaluations on every prompt change, ablation testing, and soak periods before rollout — exercising the system as it actually runs, because the failure was never inside any one unit.

Defenses: structured context, independent validation, circuit breakers

The countermeasures all push against under-specified, unchecked boundaries. The practitioner remedies are to convert prose specs into machine-validatable schemas, to enforce typed and schema-validated messages between agents (with MCP named as the schema-enforced substrate), to deploy isolated judge agents for independent validation, and to add circuit breakers that isolate a misbehaving agent before it cascades. [Practitioner] Why multi-agent LLM systems fail and how to fix them · Augment CodeT3-practitioner original The payoff is concrete: “PwC achieved a 7x accuracy improvement (10% to 70%) through structured validation loops.” [Practitioner] Why multi-agent LLM systems fail and how to fix them · Augment CodeT3-practitioner original

The single-agent equivalent is the postmortem’s own remedy — broad per-model evaluations on every prompt change, ablation testing, and soak periods before rollout [Official] An update on recent Claude Code quality reports · AnthropicT1-official original — the same instinct as independent validation, applied to a pipeline of one.

Practice

Exercise solutions

Solution ↑ Exercise

B. The failure is a boundary failure: an ambiguous handoff that no stage is positioned to catch. Fixing the boundary — a typed, schema-validated spec so there is less to misinterpret, plus an independent validation step that checks the coder’s output against that spec — intercepts the wrong interpretation before it reaches the reviewer. A makes one agent smarter but leaves the ambiguous interface intact; a better coder still has to guess at an under-specified spec. C asks the reviewer to work harder while still reading only the code, blind to the spec it was meant to satisfy. D doubles cost and gives two artifacts to compare with no oracle for which is right — a wrong-but-consistent interpretation reproduces on the retry.

Solution ↑ Exercise

The three MAST categories are specification problems (41.77%), coordination failures (36.94%), and verification gaps (21.30%); specification problems are the largest. Specification problems propagate especially badly in a pipeline because a mid-pipeline agent cannot pause to ask — an under-specified handoff becomes a guess the agent resolves and passes downstream as settled fact, so a single ambiguity at the top seeds a wrong decision that every later stage treats as valid input.

Solution ↑ Exercise

A compounding cross-system failure can pass every per-component evaluation because it lives between components, not inside any one: each part passes its own eval in isolation, and the degradation emerges only from their interaction, on traffic slices no single test exercises. In the April-23 case three bugs with overlapping windows (the union running Mar 4 – Apr 20) produced “broad, inconsistent degradation” that no single bug’s symptom resembled, and neither internal usage nor the existing eval suite reproduced it. The postmortem concluded that integration-level testing was needed instead — broad per-model evaluations on every prompt change, ablation testing, and soak periods before rollout — testing the system as it actually runs rather than each unit alone.

Exam essentials

Failures compound across boundaries — multi-agent systems fail at much higher rates than their parts (the MAST taxonomy: specification 41.77%, coordination 36.94%, verification 21.30%); a chain’s reliability is the product of its handoffs, not its best agent.
Ambiguity propagates silently — a mid-pipeline agent cannot pause to ask (unlike D5.2’s interactive escalation), so it resolves an ambiguity and passes the guess downstream as settled fact.
Compounding failures evade unit tests — they live between components, on traffic slices no single test exercises; all-green component evals do not prove a multi-stage system healthy.
Defenses — structured error context across boundaries (D2.2), independent validation / isolated judges (D4.6), circuit breakers to stop cascades, and keeping the escalable decision at the coordinator (D5.2); the single-agent analog is broad evals + ablation + soak periods.