The agent just produced 200 lines of code that looks correct. The syntax is clean, the variable names are reasonable, the logic reads well. How do you know it actually works? AI-generated code has a specific failure mode human-written code does not: it looks correct. The appearance of correctness is precisely why verification is essential — the bugs are subtle, not obvious.

Representation

The edit-test-commit loop is the quality-preservation layer around the session loop. Where the session loop handles what does the agent do next, this loop handles how do we know the agent’s output is correct. The answer, overwhelmingly, is: the agent verifies its own work against criteria you specified, and you verify the criteria were adequate.

The six-layer validation architecture

Not all verification is the same. A robust project layers defenses so that no single failure mode goes undetected:

  1. Type safety — static checking at compile time (mypy, TypeScript, Go’s type checker, Rust’s type system and borrow checker). The cheapest layer.
  2. Input validation — preconditions at function entry. Fail fast with explicit errors (the fail-fast principle from Ch 1).
  3. Unit tests — each function in isolation; happy path + error cases + edge cases.
  4. Integration tests — multi-function workflows with realistic data.
  5. End-to-end tests — complete user workflows from input to output.
  6. Property-based tests — invariants that should always hold; generated inputs catch edge cases you didn’t think of.

Layers 1–2 are cheap enough to add to any project today. Layers 3–4 are the production baseline. Layers 5–6 are for systems where correctness is load-bearing.

The missing seventh layer: domain correctness

The six-layer model catches structural errors. It does not catch domain-correctness failures — code that is syntactically valid, passes all tests, and produces the wrong answer.

Examples from practice that every working practitioner has seen:

These failures share a pattern: structural tests verify that the code runs; they do not verify that it means what you intended. No amount of AI-generated unit testing catches them because the agent doesn’t know what the answer should look like — you do.

Phase-appropriate standards

Not all code needs the same rigor. Applying production standards to a prototype kills velocity; applying prototype standards to production kills reliability. The fix is to be explicit about which standard applies now, and when to transition.

PhaseTestingCode qualityTransition criteria
ExplorationManual OKLong functions OKHypothesis validated
DevelopmentUnit + integrationStyle enforced, type hintsCoverage >60%, code review
ProductionFull 6-layer + domain invariantsStrict lint, immutabilityCoverage >80%, zero critical warnings

The briefing doc is where phase membership lives. When a project graduates from exploration to development, the briefing doc changes and the agent’s behavior changes with it.

Operation

The test-first workflow with an agent follows a consistent four-step pattern across all three CLIs:

  1. Describe the interface — inputs, outputs, error cases, invariants.
  2. Agent writes tests from that interface description.
  3. Agent writes implementation that passes the tests.
  4. Tests run automatically via hooks / guards / CI.

The test-first framing works because agents excel at test generation when given clear specifications. The tests then constrain the implementation, preventing the “looks correct but is subtly wrong” failure mode that plagues code-first generation.

Prompt: "Create a FeatureValidator class.
  Interface:
    - Takes a DataFrame and a schema dict
    - Validates column types, value ranges, null counts
    - Returns ValidationResult with errors list
    - Raises ValueError if required columns are missing
  Write tests first, then implementation.
  Include domain invariants:
    - Empty DataFrame raises ValueError
    - NaN-heavy inputs (>50% nulls) emit a warning but don't fail
  Verify: pytest tests/ passes before you return."

Tri-tool automation surface

Verification primitiveClaude CodeGemini CLICodex CLI
Briefing-doc verification rulesCLAUDE.md ## Verification sectionGEMINI.md sectionAGENTS.md section
Run tests after edit (hooked)PostToolUse matching Edit|Writetool-level allowlist + pre/post hookscommand-approval config
Block commits on failurePreToolUse matching Bash → gate on git commitpre-run hook with exit codecommit-approval config
Per-test output filteringhooks can summarize / filterhooks + prompt filteringprompt filtering
Property-based test generationprompt-driven (Hypothesis / fast-check)samesame

Property-based tests as a force multiplier

Property-based testing is underused in agent-assisted workflows, and it should not be. A single Hypothesis (Python) or fast-check (JavaScript) test can replace dozens of hand-written edge-case tests and catch entire classes of bugs the agent would never have generated by enumeration.

from hypothesis import given, settings
from hypothesis import strategies as st

@given(st.lists(
    st.floats(allow_nan=True, allow_infinity=False),
    min_size=1, max_size=100
))
@settings(max_examples=200)
def test_feature_builder_invariants(values):
    df = pd.DataFrame({"amount": values})
    result = build_features(df)
    # Schema invariant: output columns never change
    assert set(result.columns) == {"amount", "log_amount", "is_missing"}
    # Null invariant: NaN inputs produce is_missing=True
    assert (df["amount"].isna() == result["is_missing"]).all()
    # Range invariant: log_amount is never negative
    assert (result["log_amount"].dropna() >= 0).all()

Evolution

Verification-first is the single most convergent practice in agentic coding. The principle is universal across tools; what differs substantially is the enforcement surface — how a tool lets you make verification non-skippable at the harness level rather than just recommended at the prompt level.

Convergence: the six-layer model predates agentic coding. The validation hierarchy (types → input → unit → integration → E2E → property) is pre-AI software engineering best practice. AI changes the enforcement economics — hooks make the layers automatic rather than aspirational — but the layers themselves are stable. Practices written to the six-layer shape will hold for the foreseeable future.

Emerging: auto-repair loops. When tests fail, some recent tool builds (Claude’s Stop hook, Gemini’s agent chaining) auto-loop the failure back into the agent for repair. The pattern is promising but unreliable — the failure diagnostic often doesn’t surface the root cause, and the agent ends up repeatedly trying cosmetic fixes. Practice for 2026: if the auto-repair loop runs more than twice on the same failure, it’s telling you something structural is wrong — intervene.

Quick reference