D4.4 closed the validation loop with the model — detect a semantic error, feed it back, retry. This chapter closes the other loop: when automation is not enough, route to a human. The architect’s job is calibration — deciding, per output, whether to trust it, verify it automatically, or escalate it to a person. The routing-and-funnel pattern is durable; the specific fields are illustration, so this is an architectural pattern.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

What makes the decision to auto-accept vs route-to-human an economic one rather than a quality slogan?
Why is a model’s self-reported “high confidence” not a signal you can route on directly?
What does it mean for a confidence signal to be calibrated in the measurement sense, and how would you check it?
Describe the three tiers of the review funnel and what each one escalates.
What two factors set where the human-review threshold falls?

Check your answers

Because for each output you weigh the cost of a wrong auto-accept against the cost of a human glance — cheap and reversible automates, expensive or irreversible routes to a person; “review everything” and “trust everything” are both failures to calibrate.
Self-reported confidence is a claim, not a measurement — a confidently-wrong output reports high confidence too, so route on checkable signals (e.g. calculated_total ≠ stated_total, conflict_detected) instead.
Calibrated means the stated value tracks actual accuracy — “90% confident” outputs are right about 90% of the time; check it by measuring, over real labeled data, the accuracy at each stated-confidence level.
Auto-check → isolated judge → human: cheap automated checks handle the obvious cases, the isolated judge (fresh context, no authorship bias) catches the wrong-but-plausible ones, and each tier escalates only what it cannot resolve, so the human sees only what survives both.
Stakes × uncertainty — low-stakes, high-confidence proceeds automatically; high-stakes, low-confidence goes to a person; the middle is where the judge tier earns its keep.

Not every output earns automatic trust

The cheapest reliability move is to give the model a way to check itself: “Include tests, screenshots, or expected outputs so Claude can check itself. This is the single highest-leverage thing you can do.” [Official] Best practices for Claude Code · AnthropicT1-official original But some judgments cannot be auto-verified — a wrong-but-plausible extraction, a borderline classification, a high-stakes decision with no ground truth to diff against. Those are where a human belongs. Confidence calibration is the discipline of deciding, output by output, which path each one takes.

Confidence as a routing signal

To route, you need a signal you can act on — and the reliable signals are checkable, not self-reported. The schema hooks from D4.4 double as confidence signals. (These are a design pattern this book recommends, not a built-in platform field — you add them to your schema.) When a model’s calculated_total disagrees with the document’s stated_total, both demand human review; when a conflict_detected flag is true, the record routes to a person; and a structured confidence field (high / medium / low) on an extraction gives the caller an explicit value to threshold on. Each is a place where the system can say “I am not sure” in a form a router can read.

Two senses of “calibration”

The word is doing double duty in this chapter, and the distinction is worth making sharp. There is the routing calibration the chapter is built on — which output goes to which tier — and there is measurement calibration: whether a confidence value actually tracks accuracy. A model is well-calibrated, in the measurement sense, only if its “90% confident” outputs are right about 90% of the time. Self-reported confidence usually fails this test — models tend to be over-confident, reporting high certainty on answers that are wrong — which is precisely why a raw “high” cannot gate the human queue on its own.

Calibrating a confidence signal against reality Worked example

An extraction pipeline emits a confidence field (high / medium / low). Before trusting it to route, you measure it on a labeled sample of 1,000 past extractions:

Stated confidence	Count	Actually correct
high	700	94%
medium	220	71%
low	80	38%

(Illustrative numbers — the method is the point.) Two readings follow. First, the signal is informative: accuracy falls monotonically from high to low, so it does carry real information about correctness. Second, it is not perfectly calibrated: “high” is 94%, not ~100%, so roughly six in a hundred high-confidence extractions are wrong — and on a high-stakes clinical field that residual is unacceptable. The routing decision now follows from the numbers, not the label: auto-accept high only if a ~6% error rate is tolerable for this field; otherwise send even high through the isolated judge, and route medium/low to a human. Had you trusted the word “high” as if it meant “certain,” you would have shipped that 6% silently.

The discipline: a confidence signal earns its routing role by measurement, not by its name — and you re-measure when the model, the prompt, or the input distribution shifts, because calibration is not permanent.

Independent validation before the human

Between auto-accept and the human sits an automated reviewer tier: an isolated judge. Independent validation — “deploy isolated judge agents” — is one of the practitioner-recommended defenses, and the gains are real: “PwC achieved a 7x accuracy improvement (10% to 70%) through structured validation loops.” [Practitioner] Why multi-agent LLM systems fail and how to fix them · Augment CodeT3-practitioner original The judge works for the same reason the D4.6 reviewer does: a fresh, isolated context has no authorship bias toward the output it is checking. Its job is to filter — resolve what it can, escalate only what it cannot — so the human queue stays small and high-value.

Calibrating the threshold

Where the human-review line falls is the calibration, and it is set by stakes times uncertainty. A low-stakes, high-confidence output proceeds automatically; a high-stakes, low-confidence one goes to a person; the middle is where the judge tier earns its keep. This makes human review the downstream half of human-in-the-loop, with escalation (D5.2) as the upstream half — the agent asks before acting when intent is unclear, and a human checks after producing when confidence is low. Calibrate the thresholds so a reviewer sees the few outputs where their judgment changes the outcome, and nothing it does not.

Practice

Exercise solutions

Solution ↑ Exercise

B. The design calibrates: checkable signals (confidence, conflict_detected) select which records are uncertain, an isolated judge clears the merely-plausible ones, and only what the judge cannot resolve reaches a human — so high-stakes errors are caught without reviewing everything. A retries with the same model, and a confidently-wrong extraction reproduces on the second run, so agreement is not evidence of correctness. C trusts self-reported confidence, which a confidently-wrong output also reports as “high” — the exact trap. D is safe but uncalibrated: it spends the most expensive reviewer on every record, most of which need no human, and does not scale.

Solution ↑ Exercise

Three routable signals: a cross-check mismatch (calculated_total ≠ stated_total), a self-flagged conflict_detected: true, and a low stated confidence (a failed provenance check — a cited span absent from the source — is a fourth). A model’s self-reported “high confidence” is not reliable on its own because it is a claim, not a measurement: a confidently-wrong output reports high confidence too, and models tend to be over-confident, so stated confidence often fails to track actual accuracy. The signals worth routing on are the checkable ones — a cross-check either matches or it does not, independent of what the model believes — whereas self-reported confidence must first be empirically calibrated against observed accuracy before it can gate anything.

Solution ↑ Exercise

The funnel is auto-check → isolated judge → human. Tier 1 (cheap automated checks) handles the obvious cases — a cross-check mismatch or a thresholded signal — and escalates anything it cannot clear. Tier 2 (an isolated judge agent) reviews the merely-plausible cases in a fresh, independent context, resolving what it can and escalating only what it cannot. Tier 3 (the human) sees only what survived both. The isolated judge catches errors the cheap auto-checks miss because those errors are wrong-but-plausible — they pass the mechanical checks (valid shape, no flagged conflict) yet are semantically wrong, and a fresh-context reviewer with no authorship bias can judge correctness where a regex or equality test cannot. Each tier escalates only what it cannot resolve, so the most expensive reviewer spends attention only on the decisions that truly need human judgment.

Exam essentials

Calibrate, don’t blanket-trust or blanket-review — decide per output by weighing the cost of a wrong auto-accept against a human glance; verification is the highest-leverage habit where it is possible.
Route on checkable signals — cross-check mismatches (calculated_total ≠ stated_total), self-flagged conflict_detected, low confidence, and failed provenance are routable signals; self-reported confidence alone is a claim, not a measurement. Two senses of calibration: routing (which output to which tier) and measurement (does “90% confident” mean 90% correct?). Empirically calibrate a confidence field — accuracy per stated level — or prefer checkable signals that need no calibration.
Tiered funnel — cheap auto-checks → isolated judge (fresh context, no authorship bias) → human; each tier escalates only what it cannot resolve, keeping the human queue small (structured validation loops drove a documented 7× accuracy gain).
Threshold by stakes × uncertainty — high-stakes + low-confidence routes to a human; human review is the downstream half of human-in-the-loop, escalation (D5.2) the upstream half.