D4.4 closed the validation loop with the model — detect a semantic error, feed it back, retry. This chapter closes the other loop: when automation is not enough, route to a human. The architect’s job is calibration — deciding, per output, whether to trust it, verify it automatically, or escalate it to a person. The routing-and-funnel pattern is durable; the specific fields are illustration, so this is an architectural pattern.
Not every output earns automatic trust
The cheapest reliability move is to give the model a way to check itself: “Include tests, screenshots, or expected outputs so Claude can check itself. This is the single highest-leverage thing you can do.” [Official] Best practices for Claude Code · AnthropicT1-official original But some judgments cannot be auto-verified — a wrong-but-plausible extraction, a borderline classification, a high-stakes decision with no ground truth to diff against. Those are where a human belongs. Confidence calibration is the discipline of deciding, output by output, which path each one takes.
Confidence as a routing signal
To route, you need a signal you can act on — and the reliable signals are checkable, not self-reported. The schema hooks from D4.4 double as confidence signals. (These are a design pattern this book recommends, not a built-in platform field — you add them to your schema.) When a model’s calculated_total disagrees with the document’s stated_total, both demand human review; when a conflict_detected flag is true, the record routes to a person; and a structured confidence field (high / medium / low) on an extraction gives the caller an explicit value to threshold on. Each is a place where the system can say “I am not sure” in a form a router can read.
Two senses of “calibration”
The word is doing double duty in this chapter, and the distinction is worth making sharp. There is the routing calibration the chapter is built on — which output goes to which tier — and there is measurement calibration: whether a confidence value actually tracks accuracy. A model is well-calibrated, in the measurement sense, only if its “90% confident” outputs are right about 90% of the time. Self-reported confidence usually fails this test — models tend to be over-confident, reporting high certainty on answers that are wrong — which is precisely why a raw “high” cannot gate the human queue on its own.
Independent validation before the human
Between auto-accept and the human sits an automated reviewer tier: an isolated judge. Independent validation — “deploy isolated judge agents” — is one of the practitioner-recommended defenses, and the gains are real: “PwC achieved a 7x accuracy improvement (10% to 70%) through structured validation loops.” [Practitioner] Why multi-agent LLM systems fail and how to fix them · Augment CodeT3-practitioner original The judge works for the same reason the D4.6 reviewer does: a fresh, isolated context has no authorship bias toward the output it is checking. Its job is to filter — resolve what it can, escalate only what it cannot — so the human queue stays small and high-value.
Calibrating the threshold
Where the human-review line falls is the calibration, and it is set by stakes times uncertainty. A low-stakes, high-confidence output proceeds automatically; a high-stakes, low-confidence one goes to a person; the middle is where the judge tier earns its keep. This makes human review the downstream half of human-in-the-loop, with escalation (D5.2) as the upstream half — the agent asks before acting when intent is unclear, and a human checks after producing when confidence is low. Calibrate the thresholds so a reviewer sees the few outputs where their judgment changes the outcome, and nothing it does not.
Practice
Exercise solutions
B. The design calibrates: checkable signals (confidence, conflict_detected) select which records are uncertain, an isolated judge clears the merely-plausible ones, and only what the judge cannot resolve reaches a human — so high-stakes errors are caught without reviewing everything. A retries with the same model, and a confidently-wrong extraction reproduces on the second run, so agreement is not evidence of correctness. C trusts self-reported confidence, which a confidently-wrong output also reports as “high” — the exact trap. D is safe but uncalibrated: it spends the most expensive reviewer on every record, most of which need no human, and does not scale.
Three routable signals: a cross-check mismatch (calculated_total ≠ stated_total), a self-flagged conflict_detected: true, and a low stated confidence (a failed provenance check — a cited span absent from the source — is a fourth). A model’s self-reported “high confidence” is not reliable on its own because it is a claim, not a measurement: a confidently-wrong output reports high confidence too, and models tend to be over-confident, so stated confidence often fails to track actual accuracy. The signals worth routing on are the checkable ones — a cross-check either matches or it does not, independent of what the model believes — whereas self-reported confidence must first be empirically calibrated against observed accuracy before it can gate anything.
The funnel is auto-check → isolated judge → human. Tier 1 (cheap automated checks) handles the obvious cases — a cross-check mismatch or a thresholded signal — and escalates anything it cannot clear. Tier 2 (an isolated judge agent) reviews the merely-plausible cases in a fresh, independent context, resolving what it can and escalating only what it cannot. Tier 3 (the human) sees only what survived both. The isolated judge catches errors the cheap auto-checks miss because those errors are wrong-but-plausible — they pass the mechanical checks (valid shape, no flagged conflict) yet are semantically wrong, and a fresh-context reviewer with no authorship bias can judge correctness where a regex or equality test cannot. Each tier escalates only what it cannot resolve, so the most expensive reviewer spends attention only on the decisions that truly need human judgment.
Exam essentials
- Calibrate, don’t blanket-trust or blanket-review — decide per output by weighing the cost of a wrong auto-accept against a human glance; verification is the highest-leverage habit where it is possible.
- Route on checkable signals — cross-check mismatches (
calculated_total≠stated_total), self-flaggedconflict_detected, lowconfidence, and failed provenance are routable signals; self-reported confidence alone is a claim, not a measurement. Two senses of calibration: routing (which output to which tier) and measurement (does “90% confident” mean 90% correct?). Empirically calibrate aconfidencefield — accuracy per stated level — or prefer checkable signals that need no calibration. - Tiered funnel — cheap auto-checks → isolated judge (fresh context, no authorship bias) → human; each tier escalates only what it cannot resolve, keeping the human queue small (structured validation loops drove a documented 7× accuracy gain).
- Threshold by stakes × uncertainty — high-stakes + low-confidence routes to a human; human review is the downstream half of human-in-the-loop, escalation (D5.2) the upstream half.