The edit-test-commit loop
AI-generated code's defining failure mode — it *looks* correct. The edit-test-commit loop exists to catch the subtle bugs the agent cannot catch on its own. Verification is not a quality gate; it is the single highest-leverage practice in agentic coding.
On this page
The agent just produced 200 lines of code that looks correct. The syntax is clean, the variable names are reasonable, the logic reads well. How do you know it actually works? AI-generated code has a specific failure mode human-written code does not: it looks correct. The appearance of correctness is precisely why verification is essential — the bugs are subtle, not obvious.
Representation
The edit-test-commit loop is the quality-preservation layer around the session loop. Where the session loop handles what does the agent do next, this loop handles how do we know the agent’s output is correct. The answer, overwhelmingly, is: the agent verifies its own work against criteria you specified, and you verify the criteria were adequate.
The six-layer validation architecture
Not all verification is the same. A robust project layers defenses so that no single failure mode goes undetected:
- Type safety — static checking at compile time (mypy, TypeScript, Go’s type checker, Rust’s type system and borrow checker). The cheapest layer.
- Input validation — preconditions at function entry. Fail fast with explicit errors (the fail-fast principle from Ch 1).
- Unit tests — each function in isolation; happy path + error cases + edge cases.
- Integration tests — multi-function workflows with realistic data.
- End-to-end tests — complete user workflows from input to output.
- Property-based tests — invariants that should always hold; generated inputs catch edge cases you didn’t think of.
Layers 1–2 are cheap enough to add to any project today. Layers 3–4 are the production baseline. Layers 5–6 are for systems where correctness is load-bearing.
The missing seventh layer: domain correctness
The six-layer model catches structural errors. It does not catch domain-correctness failures — code that is syntactically valid, passes all tests, and produces the wrong answer.
Examples from practice that every working practitioner has seen:
- Data leakage. A feature-engineering function uses future values during training. Tests pass because the function is deterministic. The model achieves unrealistic accuracy in validation and fails in production.
- Wrong aggregation. A revenue calculation sums instead of averaging across time. Tests pass because the function produces a number. The number is wrong by an order of magnitude.
- Survivorship bias. A cohort analysis excludes deleted records. Tests pass because the query runs. The results quietly mislead every downstream decision.
- Silent unit mismatch. A function mixes daily and monthly rates. Tests pass because both are floats. The financial model is off by ~30×.
These failures share a pattern: structural tests verify that the code runs; they do not verify that it means what you intended. No amount of AI-generated unit testing catches them because the agent doesn’t know what the answer should look like — you do.
Phase-appropriate standards
Not all code needs the same rigor. Applying production standards to a prototype kills velocity; applying prototype standards to production kills reliability. The fix is to be explicit about which standard applies now, and when to transition.
| Phase | Testing | Code quality | Transition criteria |
|---|---|---|---|
| Exploration | Manual OK | Long functions OK | Hypothesis validated |
| Development | Unit + integration | Style enforced, type hints | Coverage >60%, code review |
| Production | Full 6-layer + domain invariants | Strict lint, immutability | Coverage >80%, zero critical warnings |
The briefing doc is where phase membership lives. When a project graduates from exploration to development, the briefing doc changes and the agent’s behavior changes with it.
Operation
The test-first workflow with an agent follows a consistent four-step pattern across all three CLIs:
- Describe the interface — inputs, outputs, error cases, invariants.
- Agent writes tests from that interface description.
- Agent writes implementation that passes the tests.
- Tests run automatically via hooks / guards / CI.
The test-first framing works because agents excel at test generation when given clear specifications. The tests then constrain the implementation, preventing the “looks correct but is subtly wrong” failure mode that plagues code-first generation.
Prompt: "Create a FeatureValidator class.
Interface:
- Takes a DataFrame and a schema dict
- Validates column types, value ranges, null counts
- Returns ValidationResult with errors list
- Raises ValueError if required columns are missing
Write tests first, then implementation.
Include domain invariants:
- Empty DataFrame raises ValueError
- NaN-heavy inputs (>50% nulls) emit a warning but don't fail
Verify: pytest tests/ passes before you return."
Tri-tool automation surface
| Verification primitive | Claude Code | Gemini CLI | Codex CLI |
|---|---|---|---|
| Briefing-doc verification rules | CLAUDE.md ## Verification section | GEMINI.md section | AGENTS.md section |
| Run tests after edit (hooked) | PostToolUse matching Edit|Write | tool-level allowlist + pre/post hooks | command-approval config |
| Block commits on failure | PreToolUse matching Bash → gate on git commit | pre-run hook with exit code | commit-approval config |
| Per-test output filtering | hooks can summarize / filter | hooks + prompt filtering | prompt filtering |
| Property-based test generation | prompt-driven (Hypothesis / fast-check) | same | same |
Property-based tests as a force multiplier
Property-based testing is underused in agent-assisted workflows, and it should not be. A single Hypothesis (Python) or fast-check (JavaScript) test can replace dozens of hand-written edge-case tests and catch entire classes of bugs the agent would never have generated by enumeration.
from hypothesis import given, settings
from hypothesis import strategies as st
@given(st.lists(
st.floats(allow_nan=True, allow_infinity=False),
min_size=1, max_size=100
))
@settings(max_examples=200)
def test_feature_builder_invariants(values):
df = pd.DataFrame({"amount": values})
result = build_features(df)
# Schema invariant: output columns never change
assert set(result.columns) == {"amount", "log_amount", "is_missing"}
# Null invariant: NaN inputs produce is_missing=True
assert (df["amount"].isna() == result["is_missing"]).all()
# Range invariant: log_amount is never negative
assert (result["log_amount"].dropna() >= 0).all()
Evolution
Verification-first is the single most convergent practice in agentic coding. The principle is universal across tools; what differs substantially is the enforcement surface — how a tool lets you make verification non-skippable at the harness level rather than just recommended at the prompt level.
Convergence: the six-layer model predates agentic coding. The validation hierarchy (types → input → unit → integration → E2E → property) is pre-AI software engineering best practice. AI changes the enforcement economics — hooks make the layers automatic rather than aspirational — but the layers themselves are stable. Practices written to the six-layer shape will hold for the foreseeable future.
Emerging: auto-repair loops. When tests fail, some recent tool builds (Claude’s Stop hook, Gemini’s agent chaining) auto-loop the failure back into the agent for repair. The pattern is promising but unreliable — the failure diagnostic often doesn’t surface the root cause, and the agent ends up repeatedly trying cosmetic fixes. Practice for 2026: if the auto-repair loop runs more than twice on the same failure, it’s telling you something structural is wrong — intervene.
Quick reference
- AI-generated code looks correct. Verification is the core quality-preservation mechanism, not an optional add-on.
- Six structural layers: types, input validation, unit, integration, E2E, property-based. Seventh semantic layer: domain invariants.
- Domain invariants catch what the six-layer model cannot — the agent can generate structure, but only you know the invariants.
- Phase-appropriate rigor: exploration / development / production each earn different standards. Make the current phase explicit in the briefing doc.
- Test-first workflow outperforms code-first: interface → tests → implementation → run.
- Briefing-doc verification rules improve every session; hooks enforce what can’t be forgotten.
- Property-based tests deserve wider use — single tests replace dozens of enumerated cases.
- Hook-system depth varies across tools; the practice of making verification non-skippable is universal.