The agent just produced 200 lines of code that looks correct. The syntax is clean, the variable names are reasonable, the logic reads well. How do you know it actually works? AI-generated code has a specific failure mode human-written code does not: it looks correct. The appearance of correctness is precisely why verification is essential — the bugs are subtle, not obvious.

Representation

The edit-test-commit loop is the quality-preservation layer around the session loop. Where the session loop handles what does the agent do next, this loop handles how do we know the agent’s output is correct. The answer, overwhelmingly, is: the agent verifies its own work against criteria you specified, and you verify the criteria were adequate.

The six-layer validation architecture

Not all verification is the same. A robust project layers defenses so that no single failure mode goes undetected:

Type safety — static checking at compile time (mypy, TypeScript, Go’s type checker, Rust’s type system and borrow checker). The cheapest layer.
Input validation — preconditions at function entry. Fail fast with explicit errors (the fail-fast principle from Ch 1).
Unit tests — each function in isolation; happy path + error cases + edge cases.
Integration tests — multi-function workflows with realistic data.
End-to-end tests — complete user workflows from input to output.
Property-based tests — invariants that should always hold; generated inputs catch edge cases you didn’t think of.

Layers 1–2 are cheap enough to add to any project today. Layers 3–4 are the production baseline. Layers 5–6 are for systems where correctness is load-bearing.

The missing seventh layer: domain correctness

The six-layer model catches structural errors. It does not catch domain-correctness failures — code that is syntactically valid, passes all tests, and produces the wrong answer.

Examples from practice that every working practitioner has seen:

Data leakage. A feature-engineering function uses future values during training. Tests pass because the function is deterministic. The model achieves unrealistic accuracy in validation and fails in production.
Wrong aggregation. A revenue calculation sums instead of averaging across time. Tests pass because the function produces a number. The number is wrong by an order of magnitude.
Survivorship bias. A cohort analysis excludes deleted records. Tests pass because the query runs. The results quietly mislead every downstream decision.
Silent unit mismatch. A function mixes daily and monthly rates. Tests pass because both are floats. The financial model is off by ~30×.

These failures share a pattern: structural tests verify that the code runs; they do not verify that it means what you intended. No amount of AI-generated unit testing catches them because the agent doesn’t know what the answer should look like — you do.

Phase-appropriate standards

Not all code needs the same rigor. Applying production standards to a prototype kills velocity; applying prototype standards to production kills reliability. The fix is to be explicit about which standard applies now, and when to transition.

Phase	Testing	Code quality	Transition criteria
Exploration	Manual OK	Long functions OK	Hypothesis validated
Development	Unit + integration	Style enforced, type hints	Coverage >60%, code review
Production	Full 6-layer + domain invariants	Strict lint, immutability	Coverage >80%, zero critical warnings

The briefing doc is where phase membership lives. When a project graduates from exploration to development, the briefing doc changes and the agent’s behavior changes with it.

Operation

The test-first workflow with an agent follows a consistent four-step pattern across all three CLIs:

Describe the interface — inputs, outputs, error cases, invariants.
Agent writes tests from that interface description.
Agent writes implementation that passes the tests.
Tests run automatically via hooks / guards / CI.

The test-first framing works because agents excel at test generation when given clear specifications. The tests then constrain the implementation, preventing the “looks correct but is subtly wrong” failure mode that plagues code-first generation.

Prompt: "Create a FeatureValidator class.
  Interface:
    - Takes a DataFrame and a schema dict
    - Validates column types, value ranges, null counts
    - Returns ValidationResult with errors list
    - Raises ValueError if required columns are missing
  Write tests first, then implementation.
  Include domain invariants:
    - Empty DataFrame raises ValueError
    - NaN-heavy inputs (>50% nulls) emit a warning but don't fail
  Verify: pytest tests/ passes before you return."

Tri-tool automation surface

Verification primitive	Claude Code	Gemini CLI	Codex CLI
Briefing-doc verification rules	`CLAUDE.md` `## Verification` section	`GEMINI.md` section	`AGENTS.md` section
Run tests after edit (hooked)	`PostToolUse` matching `Edit\|Write`	tool-level allowlist + pre/post hooks	command-approval config
Block commits on failure	`PreToolUse` matching `Bash` → gate on `git commit`	pre-run hook with exit code	commit-approval config
Per-test output filtering	hooks can summarize / filter	hooks + prompt filtering	prompt filtering
Property-based test generation	prompt-driven (Hypothesis / fast-check)	same	same

Property-based tests as a force multiplier

Property-based testing is underused in agent-assisted workflows, and it should not be. A single Hypothesis (Python) or fast-check (JavaScript) test can replace dozens of hand-written edge-case tests and catch entire classes of bugs the agent would never have generated by enumeration.

from hypothesis import given, settings
from hypothesis import strategies as st

@given(st.lists(
    st.floats(allow_nan=True, allow_infinity=False),
    min_size=1, max_size=100
))
@settings(max_examples=200)
def test_feature_builder_invariants(values):
    df = pd.DataFrame({"amount": values})
    result = build_features(df)
    # Schema invariant: output columns never change
    assert set(result.columns) == {"amount", "log_amount", "is_missing"}
    # Null invariant: NaN inputs produce is_missing=True
    assert (df["amount"].isna() == result["is_missing"]).all()
    # Range invariant: log_amount is never negative
    assert (result["log_amount"].dropna() >= 0).all()

Evolution

Verification-first is the single most convergent practice in agentic coding. The principle is universal across tools; what differs substantially is the enforcement surface — how a tool lets you make verification non-skippable at the harness level rather than just recommended at the prompt level.

Convergence: the six-layer model predates agentic coding. The validation hierarchy (types → input → unit → integration → E2E → property) is pre-AI software engineering best practice. AI changes the enforcement economics — hooks make the layers automatic rather than aspirational — but the layers themselves are stable. Practices written to the six-layer shape will hold for the foreseeable future.

Emerging: auto-repair loops. When tests fail, some recent tool builds (Claude’s Stop hook, Gemini’s agent chaining) auto-loop the failure back into the agent for repair. The pattern is promising but unreliable — the failure diagnostic often doesn’t surface the root cause, and the agent ends up repeatedly trying cosmetic fixes. Practice for 2026: if the auto-repair loop runs more than twice on the same failure, it’s telling you something structural is wrong — intervene.

Quick reference

AI-generated code looks correct. Verification is the core quality-preservation mechanism, not an optional add-on.
Six structural layers: types, input validation, unit, integration, E2E, property-based. Seventh semantic layer: domain invariants.
Domain invariants catch what the six-layer model cannot — the agent can generate structure, but only you know the invariants.
Phase-appropriate rigor: exploration / development / production each earn different standards. Make the current phase explicit in the briefing doc.
Test-first workflow outperforms code-first: interface → tests → implementation → run.
Briefing-doc verification rules improve every session; hooks enforce what can’t be forgotten.
Property-based tests deserve wider use — single tests replace dozens of enumerated cases.
Hook-system depth varies across tools; the practice of making verification non-skippable is universal.