How this book is designed
A meta-chapter used during Stage 0 to exercise every component. Also a reader's guide to the book's pedagogical structure, sidenote mechanics, and freshness discipline.
This chapter is a reader’s guide to the book’s structure — and, during Stage 0 of the scaffold build, a working example that exercises every component through the routing pipeline. If something on this page looks wrong on your device, it’s a bug in the scaffold, not in the chapter.
The book uses a uniform three-section decomposition for every chapter, a bounded vocabulary of typed callouts, always-visible sidenotes with a mobile reflow, and visible freshness stamps. The rest of this chapter explains each of those.
Representation
Every chapter has the same skeleton: Representation, Operation, Evolution. The idea is borrowed from Koller and Friedman’s Probabilistic Graphical Models — uniform chapter shape defends against drift. When every chapter looks the same structurally, a reader learns the skeleton once and reads fluently. Drift into feature documentation becomes visible because it won’t fit the shape.
The Evolution cornerstone is load-bearing: it embeds the book’s meta-goal (tracking how practices change) into every chapter, not just a methodology appendix. A convergence across tools signals a stable principle; a divergence signals open design space.
Operation
Each chapter uses a small vocabulary of typed callouts. Six are visual pedagogy blocks:
Two are comparative blocks, used in the Evolution section to surface cross-tool patterns:
Sidenotes and citations
Sidenotes live in the right margin on desktop and reflow as inline asides on mobile. Gwern’s principle applies: sidenotes should display by default, and any reader effort defeats the point. So no tap-to-reveal on mobile — just a visually distinct inline block. Citations use the same mechanism but resolve a slug from the source manifest, rendering a tier badge and links to the original and its Perma.cc archive: Tufte CSS · Dave Liepmann (2014)T3-practitioner original
Code blocks use Shiki in CSS-variables mode. The colors map to the same Warm Tol palette as the callouts, so code and prose share one visual language:
// Estimate the warm-cache cost of a CLAUDE.md pass.
export function budgetFor(path: string, maxTokens = 40_000): number {
const text = Deno.readTextFileSync(path);
const tokens = estimateTokens(text);
if (tokens > maxTokens) {
throw new Error(`${path} exceeds budget: ${tokens}/${maxTokens}`);
}
return tokens;
}
Tables are used for tool comparisons in the Operation section, as shorthand for “here’s how each tool does this”:
| Pattern | Claude Code | Gemini CLI | Codex CLI |
|---|---|---|---|
| Briefing document | CLAUDE.md | GEMINI.md | AGENTS.md |
| Compaction | /compact | /compress | /compact |
| Planning | Plan mode | Plan mode (v0.8+) | Dry-run |
Evolution
The book itself evolves over time. Three mechanisms make that visible:
Freshness stamps. The chapter header shows Last verified 2026-04-17. When a reader sees an older date on a volatile claim, they know to double-check against current tool docs before relying on it.
Version branches. Every published version stays live at its own URL (/v1.0/, /v1.1/). Readers can browse history; a version selector in the header lets them switch. Practices documented in v1.0 don’t disappear when v1.1 ships — they become historical reference.
The convergence dashboard. A dedicated page reads from changelog/tools/*.yaml + changelog/patterns.yaml and renders a timeline: when did each tool adopt each pattern? Patterns that all three have landed become Convergence boxes in chapters. See the Divergence box above for an open example: context window size has diverged sharply. When or if it converges is itself evolving data.
How to read stale chapters
The volatility badge in the header signals how likely a chapter is to date:
stable-principle— rarely changes. Read anytime.architectural-pattern— revises on major tool versions. Glance at the last-verified date.feature-surface— changes with minor versions. Check the date, and cross-reference against current tool release notes.
Meta-status
This chapter exists as the Stage 0 demo — a working instance of every component through the full routing pipeline. It will be replaced (or retitled and demoted to an appendix) when Stage 1 ports the first real chapter, Context as Currency.
If you’re reading this in a deployed version of the scaffold, the scaffold is working. The scaffold is separable from any one book — see the plan file at ~/.claude/plans/i-believe-this-project-generic-sphinx.md for the staged roadmap.
The agent mental model
What every CLI-agent actually is — an agent loop with three durable properties and four engineering principles that apply regardless of which tool you use. The foundation the rest of the book builds on.
You have just installed a CLI agent — Claude Code, Gemini CLI, or Codex CLI. Before your first prompt, take five minutes to understand what you are working with. Not a chatbot, not an autocomplete engine, but an agent with a specific architecture, specific constraints, and a predictable failure profile. The mental model you hold in this chapter will inform every decision in the rest of the book.
Representation
Every modern CLI-agent — Claude Code, Gemini CLI, Codex CLI — implements the same underlying structure. Naming varies; the shape does not.
Three system properties are universal across the category:
Context window. Everything the agent knows about your task — your prompt, files it has read, tool results, conversation history — occupies a finite token window. When the window fills, older information fades. This is not infinite memory; it is a working budget. The implications run deep enough to warrant their own chapter (Ch 2 Context as Currency).
Tool use. The agent does not type code into a file. It calls tools: Read, Edit, Write, Bash, Glob, Grep, and a varying set of specialized helpers. Each tool call costs context (the request + the result) and produces observable effects. Understanding this matters because every operation has a token price, and the agent’s strategy is shaped by that price.
Configuration layers. The agent’s behavior is not a single rule set — it is a stack. A briefing doc at the project root (CLAUDE.md / GEMINI.md / AGENTS.md), optional scoped rules, tool-level permissions, and in some tools an output-style layer. These layers are not suggestions — they are the operating system through which the agent interprets your project.
Four engineering principles that apply regardless
The principles below are older than agentic coding. They apply to code written by humans, code written by AI, and code written by both together. They matter more with AI in the loop because AI amplifies whatever patterns it finds in your codebase — good patterns propagate at the same rate as bad ones.
Never fail silently. Every error must be explicitly reported with recovery options. Silent failures are the most expensive kind of bug: they produce incorrect results that look correct, propagate through downstream systems undetected, and surface only when the damage is difficult to reverse. In AI-assisted development, vague instructions like “handle errors gracefully” invite the agent to catch and suppress. Specific instructions like “report errors with the message, your analysis of the root cause, and 2–3 options for resolution” make every failure visible and actionable.
Simplicity over complexity. Short functions, flat structure, self-documenting names. A 20-line function with a clear name is superior to a 5-line function with three levels of abstraction. Simple code is easier for the agent to reason about, test, and modify correctly. Each layer of indirection is another opportunity for misunderstanding. When the agent reads your codebase to learn conventions, simple patterns propagate accurately; complex patterns propagate errors.
Immutability by default. Return new data structures; mark mutations explicitly. Pure functions are easier to test, easier to parallelize, and easier for the agent to reason about. When mutation is necessary (performance, I/O, state), make it explicit — name the function with a verb that signals the mutation, use mutable types deliberately, and document the side effects. The reader (human or agent) should never be surprised by what a function modifies.
Fail fast with diagnostics. Stop immediately on problems with full context: what failed, what was expected, what was received, and what the caller can do about it.
def process_data(df: pd.DataFrame, min_rows: int) -> pd.DataFrame:
if len(df) < min_rows:
raise ValueError(
f"Need {min_rows} rows, got {len(df)}. "
f"Check data source or reduce min_rows parameter."
)
# ...
The error message includes what went wrong, where to look, and what to do about it. This is the difference between a ten-second fix and a thirty-minute investigation.
Operation
The three CLI-agents this book covers all run the agent loop. The surface area differs. The table maps each core primitive to its tool-specific form.
| Primitive | Claude Code | Gemini CLI | Codex CLI |
|---|---|---|---|
| Project briefing doc | CLAUDE.md | GEMINI.md | AGENTS.md |
| File read | Read tool | read_file | native read |
| File edit | Edit / Write | edit_file / write_file | native edit |
| Shell execution | Bash tool | run_shell_command | native exec |
| Search | Grep / Glob | search_file_content | native grep |
| Plan-mode entry | Shift+Tab | /plan | approval modes (--suggest, -a on-request) |
| Configuration scope | layered (global / project / local / enterprise / user) | global + project (GEMINI.md) | global + project (~/.codex/config.toml) |
Evolution
The agent-loop abstraction has converged across the CLI-agent category faster than almost any other pattern in agentic coding. What remains contested is the shape of the tool surface, the depth of configuration, and which workflow primitives graduate into first-class commands.
Convergence: engineering principles are pre-agentic. The four principles above — never fail silently, simplicity, immutability, fail fast — are not changing. They were best practice before AI-generated code existed; they remain best practice after. What changed is the amplification factor. A silent-failure pattern in a 10,000-line codebase now propagates into every new file the agent writes. The principles themselves are stable.
Emerging: plan-mode as a primitive. All three tools have shipped (or committed to) an explicit planning phase where the agent reads and proposes without writing. Claude’s was first-class first; Gemini shipped an explicit /plan command later. Codex approximates the same behavior through its approval-mode flow (--suggest / -a on-request): the agent proposes each action, waits for operator approval before executing. Not the same framing but functionally equivalent for review-before-edit workflows. Expect full first-class convergence within a year.
Emerging: delegation / subagents. Spawning a child agent with its own context for a bounded sub-task exists in Claude Code today and is signalled-but-not-shipped in Gemini. Codex has not yet. The pattern is a natural fit once the agent-loop model is internalized, but the engineering to do it safely (context isolation, result summarization, permission propagation) is non-trivial. Expect partial convergence in 2026, full convergence in 2027.
Quick reference
- The agent loop is the foundational primitive — prompt, reason, tool, observe, decide, repeat.
- Context window is finite. Tool use has a token cost. Configuration layers shape interpretation.
- Four engineering principles (fail loudly, simplicity, immutability, fail fast) predate agentic coding and matter more with AI in the loop.
- The codebase is the curriculum — AI amplifies whatever patterns are already there.
- Practices written to the loop-shape port across tools; practices written to a specific tool’s command names do not.
- Plan-mode and subagents are emerging convergences; expect full parity within 12–18 months.
Context as Currency
Context is a finite, decaying resource. This chapter explains why context degrades non-linearly, gives you a vocabulary for managing it across three CLI agents, and tracks where practices have converged and where they diverge.
Three hours into a session, the agent starts repeating itself. It forgets a rule you stated twenty minutes ago. It suggests an approach you already rejected. The code quality has noticeably dropped since the session started.
This is not a new model being trained. It is the same model, with the same weights, failing on the same codebase. The only thing that changed is the shape of the conversation. This chapter is about why.
Representation
Context — the sequence of tokens an agent has in view when it answers — is the single most consequential variable in an agentic coding session. It is also the most commonly mismanaged. Practitioners think of context as a workspace, something that holds everything needed for the job. The better mental model is a budget: finite, decaying, costly, and competitive. Every token in context is competing for the model’s attention; every additional token of noise dilutes signal on the tokens that actually matter.
The mechanism is attention. Transformer-based models distribute a fixed attention budget across all tokens in the window. Content near the start and end of context is recalled more reliably than details buried in the middle — this is not a cliff at the ceiling but a gradient that starts early. Critical instructions (your project brief, the current task spec) compete with accumulated tool outputs, failed attempts, and earlier sub-tasks. When noise dominates signal, the model responds from the noise.
A useful quantitative signal: practitioners observe a ~60–70% window-fill threshold This is a practitioner heuristic calibrated on 200K windows; it is not a hard system cutoff. On 1M-token windows, the same percentage represents five times more tokens. Absolute token count — not fill percentage — is what drives quality loss. where quality starts to drop noticeably. The threshold is softer on larger windows in percentage terms but firmer in absolute terms: 600K tokens of accumulated noise in a 1M window is qualitatively worse than 120K of noise in a 200K window, regardless of the percentage.
Three forces make context decay non-linear:
Noise accumulates faster than signal. A single file read adds ~2,000 tokens. A failed command + retry + correction adds ~2,500 tokens of failure patterns. A debugging session that loops a few times can fill 30% of a window with content that actively misleads the model. Noise grows polynomially against progress.
Some context is compaction-resistant. Extended thinking blocks (the internal reasoning traces some models emit) are immutable after generation — summarizers cannot touch them. A chapter of deep reasoning may be trapped in the window until you clear it entirely.
Attention decay is non-uniform. Tokens in the “middle third” of a long context are forgotten first. A rule stated early in session, then buried under file reads and tool outputs, is functionally absent even though it is technically still there.
These three forces produce the late-session degradation everyone notices but few proactively prevent.
Operation
Every CLI agent ships tools to manage context explicitly. The vocabulary differs; the primitives are the same: observe, compact, clear, persist.
The common primitives
Across Claude Code, Gemini CLI, and Codex CLI, four primitives recur:
- Observation — show the current window fill so you can decide.
- Compaction — summarize the conversation in place, preserving decisions, discarding noise.
- Clear / reset — discard the conversation entirely and start fresh.
- Persistence — a top-level briefing doc (CLAUDE.md / GEMINI.md / AGENTS.md) that survives compaction and is re-injected on every turn.
The table maps each tool’s surface to these primitives:
| Primitive | Claude Code | Gemini CLI | Codex CLI |
|---|---|---|---|
| Observe fill | /context | settings show; /context (proposed) | status in prompt |
| Compact | /compact (+ <focus> arg) | /compress | /compact |
| Clear | /clear | /chat clear | new session |
| Persist | CLAUDE.md | GEMINI.md | AGENTS.md |
| Reload persistence | auto on compact | /memory refresh | auto on launch |
Core protocols (tool-agnostic)
Some patterns apply regardless of which agent you use. They are consequences of how context works, not features of any tool.
The two-failure rule. After two failed corrections in a row, clear the context and try again with a better initial prompt. The second failure means the context now contains ~2,500 tokens of failure patterns — original error, first correction, apology, retry, second correction, second apology. Persisting a third round teaches the model to repeat the failures. A better opening prompt informed by what went wrong costs ~200 tokens and produces a better result.
The compaction protocol. Compaction is not a button you press when the window feels full — it is a three-step procedure.
Before: write down what matters in two or three sentences — the task, the decisions made, what remains. This becomes your compaction focus. Commit or save anything that is “done enough” to disk; compaction cannot lose what is already persisted.
During: be specific about what to preserve and what to discard. A bare compact-without-focus lets the model decide, and its priorities may not match yours.
After: verify by asking a recall question about a decision from earlier in the session. If the model cannot answer, inject the missing context from your notes or briefing doc. Do not trust compaction silently.
Durable artifacts: the primary persistence strategy
Compaction is a supplement to persistence, not a replacement. Anything you want to survive a session boundary must be on disk. Three artifacts carry most of the weight:
The briefing doc (CLAUDE.md / GEMINI.md / AGENTS.md) holds project rules, architecture, conventions. It is re-injected on every turn, so every token in it has leverage. Anthropic recommends individual files under 200 lines, combined under ~500 lines; the same bound applies across the three tools by analogy — the file is a context tax you pay on every single prompt, so every line must earn its place.
CURRENT_WORK.md holds transient state: what you are working on right now, what changed, what’s next. Written at the start and end of every session. A two-minute investment that saves ten minutes of re-discovery on return.
Git commits are the most durable artifact. Commit early, commit often. Compaction cannot summarize away what is already in the history.
Evolution
Context management is the most convergent area of agentic-coding practice — and simultaneously the area with the starkest active divergence. Tracking where those lines fall is most of what this section does.
Convergence: the briefing-document pattern. Claude Code established CLAUDE.md as a project-root convention; Gemini CLI followed with GEMINI.md, and Codex CLI adopted AGENTS.md. All three work the same way: markdown in the project root, re-injected on every turn, hierarchical loading from global → project → sub-directory. This is no longer a contested design choice — it is the standard shape.
Convergence: compaction as a primitive. Claude’s /compact, Gemini’s /compress, and Codex’s /compact all solve the same problem with broadly the same technique (summarize the history, replace it with the summary, keep the briefing doc intact). Auto-compaction thresholds vary but the mechanism converges.
Divergence: compaction implementation strategy. Three distinct approaches are in play. Claude performs two-phase compaction: it clears stale tool outputs first, only summarizing conversation if the first pass is insufficient. Gemini has shipped a union-find clustering alternative that resolves summaries asynchronously off the blocking path. Codex does a more direct summarize-and-replace. All three land at “shorter history, same briefing doc,” but the fidelity curves differ. Quality comparisons across tools are sensitive to which strategy applies.
Emerging: horizontal scaling. The pattern of running many parallel short sessions — rather than one long session — originated in Claude Code’s community and is spreading. The principle is general (conversations degrade over time, so keep them short and bridge with artifacts), but the tooling is still Claude-first: claude --worktree for git isolation, --continue / --resume / --from-pr for session management. Gemini and Codex have the primitives in pieces but not yet as a coherent workflow. Expect convergence here within 12–18 months; in the meantime, the pattern is portable if you’re willing to manage the plumbing yourself.
Emerging: subagent delegation. Spawning a child agent with its own context to handle a bounded sub-task is a Claude Code feature today. It is a natural next step for agent design — the research context does not pollute the parent session. Gemini has announced direction on this; Codex has not. Pattern not yet convergent.
Quick reference
- Context degrades non-linearly — manage it actively; do not wait for a hard limit.
- Observe before acting. Every CLI has a fill indicator; use it.
- Two-failure rule: after two corrections, clear and restart with a better prompt.
- Compaction is a three-step protocol (write down what matters → compact with focus → verify recall), not a reflex.
- Briefing document is the only context that survives every boundary. Budget it aggressively.
CURRENT_WORK.md+ frequent commits give you cheap session continuity without relying on compaction.- Horizontal scaling (many short sessions + artifacts) beats deep-context for most work.
- Window size is diverging across tools; window effectiveness is more convergent. Write for effectiveness, not ceiling.
Prompting as specification
Prompts are specifications — the input side of a stateful loop. Five levers shape how the agent interprets a prompt — precision, scope, structure, depth, and cost. This chapter treats prompting as an engineering activity, not a conversational art.
Your prompts work, but they’re inconsistent. Sometimes the agent nails it on the first try; sometimes you spend three rounds correcting misunderstandings. The gap is not randomness — it is precision. A prompt is the specification half of a contract; the agent’s output is the implementation half. Vague specs produce vague implementations. The rest follows.
Representation
A prompt is a specification, not a request. It declares the task, the constraints, the acceptable outputs, and the verification criteria — the same shape a well-written function signature has. When this framing clicks, most prompting problems dissolve: the fix is never “try a different phrasing” but “say what you actually want.”
Precision: the vocabulary problem
Natural language is ambiguous on purpose. “Clean this data” could mean drop nulls, impute, validate schema, deduplicate, standardize types, or all five. The agent picks one interpretation and runs with it; three rounds of correction later, you converge on what you actually meant.
The fix is a shared precision vocabulary — a small, stable set of verbs that mean exactly one thing in your project.
| Natural language | Precise specification |
|---|---|
| ”Clean this data” | validate schema → impute nulls with median → drop rows where target is NaN |
| ”Train a model” | fit XGBoost on train split → evaluate AUC on val → log params to MLflow |
| ”Check if this works” | run pytest → check no data leakage → verify feature distributions match prod |
| ”Add error handling” | add validation to build_features() → raise ValueError on schema mismatch → log and skip rows with >50% NaN |
Scope: the boundary problem
Agents are helpful by default — they read broadly, notice related issues, and fix them. This is useful until it’s scope creep: the agent modifies a file you weren’t working on, refactors a function you didn’t mention, or “improves” code that was intentionally written that way.
Scope is a spectrum:
- Too restrictive: “Only modify line 42 of
auth.py.” The agent can’t fix related issues the change requires. - Too permissive: “Fix the auth system.” The agent may rewrite half the codebase.
- Right scope: “Fix the session expiry bug in
src/auth/session.py. You may also modifysrc/auth/middleware.pyif the fix requires it. Don’t touch other files without asking.”
Scope creep is a prompting gap, not a model deficiency. The fix is always in the prompt — name the files, separate discovery from action, or use plan mode. Never hope the agent will guess your boundaries.
Structure: the organization problem
For complex prompts, plain prose mixes concerns. Task, context, constraints, and verification run together and the agent has to infer structure. A lightweight markup — XML tags, numbered sections, or just clear paragraph separators — gives the agent that structure for free.
<context>
The churn model uses features from src/features/customer.py.
Current AUC is 0.72 on validation.
</context>
<task>
Add recency features: days since last purchase,
days since last login. Compute from the events table.
</task>
<constraints>
- No data leakage: features must use only pre-churn data.
- Must work with the existing FeatureStore interface.
- Include unit tests for the new features.
</constraints>
<verification>
Run: pytest tests/features/ -v
All existing tests must still pass.
Verify: no future-dated features in test set.
</verification>
The XML tags are not magic — they’re a convention the agent’s training reinforced. Any consistent structure works: numbered sections, Markdown headings, a table. The point is to make the shape of the specification explicit so the agent can parse it mechanically rather than guess.
Depth: the reasoning problem
Complex tasks benefit from the agent thinking before producing output. Trivial tasks do not; extra thinking burns context and latency for no gain. Each agent exposes this differently:
- Claude Code:
low/medium/high/maxeffort levels;ultrathinkkeyword escalates one turn. - Gemini CLI: thinking is adaptive by default with less granular user override.
- Codex CLI: reasoning depth tied to model selection; less in-session control.
The principle is stable: match reasoning depth to problem complexity. Debugging a subtle concurrency issue needs deep thinking; renaming a variable does not. The controls vary; the judgment doesn’t.
Cost: the economics problem
Every prompt has a token price. The levers that reduce it:
- Prompt caching — repeat content across calls hits a cache at substantial discount.
- Batch APIs — bulk operations run asynchronously at a discount.
- Model selection — use a smaller model for simpler tasks.
The biggest wins come from workflow design, not from squeezing individual prompts: a stable briefing doc that hits the cache on every turn saves more than any single-prompt optimization.
Operation
The three CLI-agents expose the prompting levers with different surface area. The table maps what’s broadly available; check your tool’s current docs for the exact command.
| Lever | Claude Code | Gemini CLI | Codex CLI |
|---|---|---|---|
| File reference | @path/to/file | @path/to/file | @path/to/file |
| Directory reference | @dir/ | @dir/ | @dir/ |
| Stdin pipe | cat x | claude -p '...' | gemini -p "..." < x | codex exec "..." < x |
| Image input | drag, paste, or @screenshot.png | drag or @screenshot.png | @screenshot.png |
| Plan / dry-run | Shift+Tab plan mode | /plan (v0.8+) | --dry-run |
| Reasoning depth | /effort {low|medium|high|max} | adaptive; limited override | via model choice |
| Prompt caching | automatic + explicit breakpoints | automatic | automatic |
| Batch API | Anthropic Batch API (50% off) | Vertex AI Batch | OpenAI Batch API (50% off) |
Evolution
Prompting is the most convergent surface of agentic coding — the core vocabulary (precision, scope, verification, structure, depth) has been stable across tools since 2025. Active divergence lives in the reasoning-depth controls and the cost-optimization surface.
Convergence: the five-part prompt structure. Context → task → constraints → files in scope → verification. This emerged as a practitioner pattern in 2024, was reinforced by Anthropic’s official prompting guides in 2025, and has since been adopted by Gemini and OpenAI documentation. It’s no longer a contested choice.
Convergence: prompt caching as implicit optimization. All three tools cache repeated content automatically. The discounts vary (~90% for Claude, ~75% for Gemini, ~50% for Codex as of 2026), but the pattern — stable briefing doc, sticky tool definitions, repeated references — is the same.
Emerging: structured output modes. All three tools are shipping JSON-mode or schema-constrained output. Claude Code’s --output-format json landed in 2025; Gemini’s responseSchema parameter has been in the SDK since late 2025 and is reaching the CLI; Codex’s structured-output support is partial. Practices that pipe agent output into downstream automation should expect this to be fully converged within a year.
Quick reference
- A prompt is a specification, not a conversation. Treat it like a function signature — declare inputs, constraints, and verification.
- Five levers: precision, scope, structure, depth, cost. Debug a failed prompt by naming which lever was wrong.
- Precision vocabulary compounds. Stabilize it in your briefing doc; every future session benefits.
- Scope is always your responsibility. “Only modify files I explicitly name” is a reasonable default.
- Structured prompts (five-part shape) reduce correction rounds substantially. The boredom is the point.
- Plan mode (Shift+Tab /
/plan/--dry-run) separates discovery from action for unfamiliar code — cheap, universally supported. - Reasoning depth should match problem complexity. The controls vary; the judgment doesn’t.
- Caching wins come from workflow stability, not from single-prompt optimization. A stable briefing doc beats a clever phrasing.
The session loop
The atomic unit of agentic work is the session loop — prompt, observe, refine, commit. Each phase has a purpose; skipping any produces a specific failure mode. This chapter makes the rhythm explicit.
Your briefing doc is in place, your precision vocabulary is building, your scope instincts are right. You open the agent and… where do you start? How do you know when a session is going well versus slowly going off the rails? When do you course-correct, and when do you start fresh? This chapter is about the rhythm of a working session — the four-phase loop that turns a single prompt into durable output.
Representation
An agentic coding session is not a conversation. It is a repeating four-phase loop. Each phase does specific work; skipping any produces a specific and predictable failure mode. The loop is the same across all three CLI-agents this book covers.
The phases are not ceremonial — each has a role:
Prompt is where you spend your precision budget. The specification levers from Ch 3 — precision vocabulary, scope, structure, depth, verification — all live here. A sloppy prompt guarantees sloppy output; a well-structured prompt usually gets what you wanted on the first pass.
Observe is where you resist the temptation to skim. The agent produced a diff; your job is to look at it, not nod at it. Run the tests the prompt promised to run. Check that only the files you allowed got modified. If something feels off even slightly, name it. Over-trusting the observe phase is the single most common failure mode in the loop.
Refine is conditional. If the result is close but wrong in a specific way, feedback loops fast. If it’s wrong in ways that suggest the agent didn’t understand the task, stop — the context is now polluted with a failed attempt, and a third round makes it worse. Start over with a better prompt.
Commit is the durability layer. A verified change belongs on disk before the context shifts. Compaction cannot erase what’s already committed. “One logical change per commit” matters because your future self (and the agent reading your history) needs to be able to bisect.
How the loop relates to context
Each phase has a context cost, and the phases compound. A prompt is ~200 tokens; a file read is ~1,000–3,000; a tool-and-observe cycle is ~500–1,000 per iteration. A debugging session that loops a few times can fill 30% of a 200K window with accumulated noise — most of it encoding failure patterns rather than progress. The session loop is the mechanism that keeps context expenditure proportional to real work.
The corollary is the two-failure rule — after two failed corrections on the same issue, clear context and restart with a better prompt. The arithmetic (covered in Ch 2) favors restart over a third correction by roughly an order of magnitude.
The plan-mode extension
For unfamiliar codebases or complex tasks, the canonical four-phase loop extends to five phases: plan → prompt → observe → refine → commit. Plan mode is a read-only phase where the agent analyzes and proposes without writing. You evaluate the plan, adjust scope, then authorize implementation. All three CLIs support this (names vary — see Operation below) and it is one of the highest-leverage practices in the agentic toolkit.
Operation
The session loop is tool-agnostic, but each CLI-agent exposes the phase transitions differently. The table maps the important verbs:
| Action | Claude Code | Gemini CLI | Codex CLI |
|---|---|---|---|
| Stop agent mid-action | Esc | Esc | Ctrl+C |
| Open rewind menu | Esc+Esc or /rewind | /undo (partial) | manual via git |
| Discard conversation | /clear | /chat clear | new session |
| Compact in place | /compact (+ focus) | /compress | /compact |
| Enter plan mode | Shift+Tab → plan | /plan | approval modes (--suggest, -a on-request) |
| Accept-edits auto mode | Shift+Tab → acceptEdits | prompt per-tool | per-command approval |
| Resume last session | claude --continue | gemini --continue | codex resume |
| Pick from prior | claude --resume | (history in ~/.gemini/) | codex history |
| Start with PR context | claude --from-pr 123 | manual @ to PR file | manual |
A complete mini-session
A realistic loop in practice:
You: Add missing-value imputation to the preprocessing pipeline.
Strategy: median for numeric columns, mode for categorical.
Write tests first. Verify: pytest tests/preprocessing/ passes.
Agent: [writes tests, then implementation, runs tests — 3 pass, 1 fails]
"Test test_categorical_imputation is failing because the mode
computation uses the test-batch distribution rather than the
training-set distribution."
You: Mode should fit on train only. Fix the imputer to fit once on
train and reuse for all subsequent calls.
Agent: [fixes implementation, reruns — all 4 pass]
"All 4 tests pass. Ready to commit?"
You: Commit: "feat: add missing-value imputation to preprocessing"
The structure: the initial prompt specified what, how (strategy), and verification criteria. The refinement was specific (which behavior, what to change). One logical change, one commit. Total elapsed: two loop iterations, ~5 minutes.
Evolution
The session-loop shape has converged faster than almost any other pattern in agentic coding — the four-phase rhythm was already present in pair-programming and TDD literature before AI entered the loop. What’s still diverging is the course-correction toolkit and the multi-session orchestration surface.
Convergence: plan-first workflows. Plan mode was a Claude-first feature in 2025; Gemini shipped explicit plan mode later. Codex’s approval-mode flow (--suggest, -a on-request) achieves functional equivalence — the agent proposes each action and waits for approval before executing. Recommending “start in plan mode for unfamiliar code” is now tool-independent advice; the specific command differs.
Convergence: auto-accept modes. All three tools expose some form of “let the agent run a sequence of tool calls without per-step approval.” Claude’s acceptEdits / bypassPermissions, Gemini’s tool-level allowlists, Codex’s command-approval config all serve the same need: once trust is established for a specific kind of operation, stop gating it. The safety envelope differs; the mechanism is the same.
Emerging: horizontal scaling of sessions. Running 10–15 parallel short sessions instead of one long one is a Claude-community-first practice enabled by claude --worktree (git worktree isolation per session). Gemini and Codex have the primitives in pieces but not as a polished workflow. Expect full convergence within 12–18 months; in the meantime the pattern is portable (covered in Ch 2) even if the tooling isn’t.
When to skip phases
Not every task needs the full loop. A one-word variable rename doesn’t need a plan phase; a typo fix doesn’t need explicit verification criteria (git diff is the verification). The loop is a maximum, not a minimum. The judgment call: does this task have a plausible failure mode I’d want to catch? If yes, run the full loop. If no — just do it and move on. What you never skip is commit; unverified work left in a running session is work-at-risk.
Quick reference
- The session loop is four phases: prompt, observe, refine, commit. Plan mode extends it to five.
- Prompt spending: invest your specificity budget here; it pays back across all other phases.
- Observe is the highest-failure phase because it’s the easiest to skim. Slow down.
- Refine: max two rounds before starting fresh. The third round costs more than it saves.
- Commit verified work promptly. Context boundaries cannot erase what’s on disk.
- Course-correction primitives vary across tools; the phase structure doesn’t. Bet on the structure.
- Plan mode is tool-agnostic advice for unfamiliar or complex work.
- Session resume is mature in Claude, improving in Gemini, minimal in Codex. Adjust multi-session workflows accordingly.
The edit-test-commit loop
AI-generated code's defining failure mode — it *looks* correct. The edit-test-commit loop exists to catch the subtle bugs the agent cannot catch on its own. Verification is not a quality gate; it is the single highest-leverage practice in agentic coding.
The agent just produced 200 lines of code that looks correct. The syntax is clean, the variable names are reasonable, the logic reads well. How do you know it actually works? AI-generated code has a specific failure mode human-written code does not: it looks correct. The appearance of correctness is precisely why verification is essential — the bugs are subtle, not obvious.
Representation
The edit-test-commit loop is the quality-preservation layer around the session loop. Where the session loop handles what does the agent do next, this loop handles how do we know the agent’s output is correct. The answer, overwhelmingly, is: the agent verifies its own work against criteria you specified, and you verify the criteria were adequate.
The six-layer validation architecture
Not all verification is the same. A robust project layers defenses so that no single failure mode goes undetected:
- Type safety — static checking at compile time (mypy, TypeScript, Go’s type checker, Rust’s type system and borrow checker). The cheapest layer.
- Input validation — preconditions at function entry. Fail fast with explicit errors (the fail-fast principle from Ch 1).
- Unit tests — each function in isolation; happy path + error cases + edge cases.
- Integration tests — multi-function workflows with realistic data.
- End-to-end tests — complete user workflows from input to output.
- Property-based tests — invariants that should always hold; generated inputs catch edge cases you didn’t think of.
Layers 1–2 are cheap enough to add to any project today. Layers 3–4 are the production baseline. Layers 5–6 are for systems where correctness is load-bearing.
The missing seventh layer: domain correctness
The six-layer model catches structural errors. It does not catch domain-correctness failures — code that is syntactically valid, passes all tests, and produces the wrong answer.
Examples from practice that every working practitioner has seen:
- Data leakage. A feature-engineering function uses future values during training. Tests pass because the function is deterministic. The model achieves unrealistic accuracy in validation and fails in production.
- Wrong aggregation. A revenue calculation sums instead of averaging across time. Tests pass because the function produces a number. The number is wrong by an order of magnitude.
- Survivorship bias. A cohort analysis excludes deleted records. Tests pass because the query runs. The results quietly mislead every downstream decision.
- Silent unit mismatch. A function mixes daily and monthly rates. Tests pass because both are floats. The financial model is off by ~30×.
These failures share a pattern: structural tests verify that the code runs; they do not verify that it means what you intended. No amount of AI-generated unit testing catches them because the agent doesn’t know what the answer should look like — you do.
Phase-appropriate standards
Not all code needs the same rigor. Applying production standards to a prototype kills velocity; applying prototype standards to production kills reliability. The fix is to be explicit about which standard applies now, and when to transition.
| Phase | Testing | Code quality | Transition criteria |
|---|---|---|---|
| Exploration | Manual OK | Long functions OK | Hypothesis validated |
| Development | Unit + integration | Style enforced, type hints | Coverage >60%, code review |
| Production | Full 6-layer + domain invariants | Strict lint, immutability | Coverage >80%, zero critical warnings |
The briefing doc is where phase membership lives. When a project graduates from exploration to development, the briefing doc changes and the agent’s behavior changes with it.
Operation
The test-first workflow with an agent follows a consistent four-step pattern across all three CLIs:
- Describe the interface — inputs, outputs, error cases, invariants.
- Agent writes tests from that interface description.
- Agent writes implementation that passes the tests.
- Tests run automatically via hooks / guards / CI.
The test-first framing works because agents excel at test generation when given clear specifications. The tests then constrain the implementation, preventing the “looks correct but is subtly wrong” failure mode that plagues code-first generation.
Prompt: "Create a FeatureValidator class.
Interface:
- Takes a DataFrame and a schema dict
- Validates column types, value ranges, null counts
- Returns ValidationResult with errors list
- Raises ValueError if required columns are missing
Write tests first, then implementation.
Include domain invariants:
- Empty DataFrame raises ValueError
- NaN-heavy inputs (>50% nulls) emit a warning but don't fail
Verify: pytest tests/ passes before you return."
Tri-tool automation surface
| Verification primitive | Claude Code | Gemini CLI | Codex CLI |
|---|---|---|---|
| Briefing-doc verification rules | CLAUDE.md ## Verification section | GEMINI.md section | AGENTS.md section |
| Run tests after edit (hooked) | PostToolUse matching Edit|Write | tool-level allowlist + pre/post hooks | command-approval config |
| Block commits on failure | PreToolUse matching Bash → gate on git commit | pre-run hook with exit code | commit-approval config |
| Per-test output filtering | hooks can summarize / filter | hooks + prompt filtering | prompt filtering |
| Property-based test generation | prompt-driven (Hypothesis / fast-check) | same | same |
Property-based tests as a force multiplier
Property-based testing is underused in agent-assisted workflows, and it should not be. A single Hypothesis (Python) or fast-check (JavaScript) test can replace dozens of hand-written edge-case tests and catch entire classes of bugs the agent would never have generated by enumeration.
from hypothesis import given, settings
from hypothesis import strategies as st
@given(st.lists(
st.floats(allow_nan=True, allow_infinity=False),
min_size=1, max_size=100
))
@settings(max_examples=200)
def test_feature_builder_invariants(values):
df = pd.DataFrame({"amount": values})
result = build_features(df)
# Schema invariant: output columns never change
assert set(result.columns) == {"amount", "log_amount", "is_missing"}
# Null invariant: NaN inputs produce is_missing=True
assert (df["amount"].isna() == result["is_missing"]).all()
# Range invariant: log_amount is never negative
assert (result["log_amount"].dropna() >= 0).all()
Evolution
Verification-first is the single most convergent practice in agentic coding. The principle is universal across tools; what differs substantially is the enforcement surface — how a tool lets you make verification non-skippable at the harness level rather than just recommended at the prompt level.
Convergence: the six-layer model predates agentic coding. The validation hierarchy (types → input → unit → integration → E2E → property) is pre-AI software engineering best practice. AI changes the enforcement economics — hooks make the layers automatic rather than aspirational — but the layers themselves are stable. Practices written to the six-layer shape will hold for the foreseeable future.
Emerging: auto-repair loops. When tests fail, some recent tool builds (Claude’s Stop hook, Gemini’s agent chaining) auto-loop the failure back into the agent for repair. The pattern is promising but unreliable — the failure diagnostic often doesn’t surface the root cause, and the agent ends up repeatedly trying cosmetic fixes. Practice for 2026: if the auto-repair loop runs more than twice on the same failure, it’s telling you something structural is wrong — intervene.
Quick reference
- AI-generated code looks correct. Verification is the core quality-preservation mechanism, not an optional add-on.
- Six structural layers: types, input validation, unit, integration, E2E, property-based. Seventh semantic layer: domain invariants.
- Domain invariants catch what the six-layer model cannot — the agent can generate structure, but only you know the invariants.
- Phase-appropriate rigor: exploration / development / production each earn different standards. Make the current phase explicit in the briefing doc.
- Test-first workflow outperforms code-first: interface → tests → implementation → run.
- Briefing-doc verification rules improve every session; hooks enforce what can’t be forgotten.
- Property-based tests deserve wider use — single tests replace dozens of enumerated cases.
- Hook-system depth varies across tools; the practice of making verification non-skippable is universal.
Thinking together
The shift from configure-delegate-verify to think-together-discover-build-better. How to use an agent as a thinking partner rather than a configurable tool — structuring collaboration to counteract sycophancy, surface hidden assumptions, and produce better decisions than either party alone.
Your prompts are precise, your briefing doc is tuned, your tests pass. But every interaction follows the same pattern — you delegate, the agent executes, you verify. Something is missing. You are using the agent as a tool you configure, not a collaborator you think with. The techniques in this chapter are about the shift: from configure-delegate-verify to think-together-discover-build-better.
Representation
Every chapter so far has answered how do I get the agent to do what I want? This one answers a different question: how do I use the agent to think more clearly?
The shift from delegation to collaboration is subtle but consequential. A delegated task ends when the agent produces output. A collaborative task produces output and produces insight — about your code, your assumptions, your design. The insight is often the more valuable product.
Three realities shape how collaboration actually works:
The agent has no sunk cost in your approach. When a human colleague reviews your architecture decision, they inherit your preferences, your constraints, and usually some politeness. The agent inherits none of those. It will suggest the alternative you didn’t consider — not because it’s smarter, but because it has no investment in being polite about your first draft.
The agent agrees by default. This is the most dangerous property to navigate. Present an approach with a preference attached (“I’m thinking Redis for caching”) and the agent will explain why Redis is good. Present the same problem with a different preference and it will explain why that choice is good. Sycophancy is structural, not a quirk — the fix is not “ask for honesty” but “structure the prompt so honest comparison is the path of least resistance.”
The agent has no memory between sessions (and in long sessions, degraded memory within). Collaboration requires you to be explicit about context the agent cannot carry. The briefing doc is the always-loaded frame; handoff files carry session-to-session state; ADRs capture the reasoning so future sessions can re-derive decisions.
The honest caveat
The agent is a thinking partner with no memory, a tendency to agree, and occasional confident wrongness. The techniques in this chapter work because they structure the collaboration to counteract those weaknesses — not despite them. Read every recommendation below as “do this to counter that failure mode,” not “the agent is a brilliant collaborator and these are the etiquette rules.”
Operation
Five collaboration modes. Each is a prompt pattern, not a feature of any specific tool — they work across Claude Code, Gemini CLI, and Codex CLI because they operate on the shape of the conversation, not the tool’s surface.
Mode 1: Hypothesis-driven debugging
When a bug appears, the default instinct is to paste the traceback and say “fix this.” This often works for shallow bugs. For deeper bugs, it produces patches that address symptoms rather than root causes.
Structure the debugging conversation as hypothesis testing. Three hypotheses, one minimal test each, isolate which cause is real.
I see a ValueError in feature_pipeline.py:47 — negative values in a
feature that should be non-negative. Three hypotheses:
1. Log transform applied before clipping negative deltas.
2. Currency conversion introduces negatives for returns/refunds.
3. Timezone mismatch causes date subtraction overflow.
Design a minimal test for each hypothesis. Run hypothesis 1 first —
it's most likely given the stack trace.
The agent then runs the tests in order, and the first confirmation points to the root cause — not a symptom patch.
Mode 2: Tests as thinking tools
In Ch 5, tests served verification: does the code do what it claims? Here, tests serve exploration: what should the code do at all?
"Write tests for this function." ← verification framing
"I'm not sure what should happen at the boundary. Write
5 test cases exploring: empty input, single element,
duplicates, negatives, overflow. Which behaviors
surprise me?" ← exploration framing
The second framing forces you to articulate expectations you hadn’t stated. When a test case surprises you — the function does something you didn’t expect — you’ve discovered a requirement that was implicit. The test didn’t verify the code; it interrogated your assumptions.
Property-based tests are particularly powerful here. “What invariants should always hold, regardless of input?” surfaces design decisions hiding as implementation details.
Mode 3: Surfacing hidden assumptions
Every system rests on assumptions — about scale, usage patterns, what will never change. Most are invisible until they break.
"Here is my feature store schema. I designed it assuming:
(1) features are computed daily in batch, not real-time,
(2) training and serving use the same feature computation,
(3) feature drift is monitored externally.
Which assumption is most likely wrong in 6 months,
and what breaks when it does?"
Two specific prompts that compound across projects:
The pre-mortem. “Imagine this feature has failed in production six months from now. What are the three most likely causes? Work backwards from failure to the design flaw that enabled it.” Pre-mortems are more effective than post-mortems because they cost nothing and can change the design before commitment.
The Feynman test. “Explain my auth flow as if I just joined the team and need to modify it. Where did you have to guess because the code doesn’t make intent clear?” Gaps in the agent’s explanation are gaps in your documentation. What the agent cannot explain, a new hire cannot understand.
Mode 4: Anti-sycophancy structures
The most important collaboration skill. Three techniques, in increasing rigor:
Present options without a preference.
"We need a caching layer. The options are Redis, Memcached, and an
in-process LRU cache. For each option, list: (1) what it handles
well, (2) what it handles poorly, and (3) one scenario where it
would be the wrong choice. Then recommend one, with the specific
tradeoff that makes it better for our use case."
This has no obvious “right” answer for the agent to pattern-match to — it must reason about tradeoffs. The first formulation (“I think we should use Redis — what do you think?”) has a correct answer (agree), which is the one you’ll get.
Argue the other side. After the agent recommends an approach, explicitly ask it to argue against:
"Good analysis. Now argue against your recommendation. What's the
strongest case for NOT using Redis here? What would have to be
true about our workload for Memcached to be the better choice?"
This forces the agent to find real weaknesses in its own recommendation. If the counterargument is weak, the recommendation is probably sound. If it’s strong, you’ve discovered a genuine tradeoff worth investigating before committing.
The devil’s-advocate session. For critical decisions, open a separate session with an explicit adversarial role:
"You are a senior engineer who believes our current architecture
decision (Redis caching layer) is wrong. Make the strongest possible
case against it. Don't hold back — I need to hear the real risks
before we commit."
The separate session matters. The original session has accumulated context that biases toward the decision; a fresh session with an adversarial frame produces genuinely different analysis.
Mode 5: The interview pattern
For larger features, have the agent interview you before implementation.
"I want to build a feature-drift monitoring system. Interview me
in detail. Ask about:
- Technical implementation
- Data sources and schemas
- Edge cases and failure modes
- Tradeoffs I might not have considered
Keep interviewing until we've covered everything, then write a
complete spec to SPEC.md."
Once the spec is complete, start a fresh session to implement it. The new session has clean context focused on implementation; you have a written spec to reference; the ADR-style artifact captures what was decided and why.
Quick wins: making the agent a better reader
Five investments that take minutes and compound across every future session:
Type hints as contracts. Five seconds to write, five minutes of debugging saved. window: int = 30 tells the agent the type, the default, and the name in five characters. Without it, the agent may pass a string, a float, or a timedelta.
Code archaeology for brownfield. Instead of assuming legacy code is wrong, assume it’s explained by something you don’t yet see:
"Why might the original author have written this as a nested loop
instead of a join? What constraint explains this design choice?"
The agent often finds the constraint — database limitation, legacy API, performance requirement — that made the original design rational.
README-driven development. Write the README first. Then: “Read this README. What questions does a new developer still have after reading it?” Gaps in the answer are gaps in your documentation.
ADRs: capturing alternatives considered
When you make an architecture decision with the agent, the conversation captures not just what you decided but why, and what alternatives were considered. An Architecture Decision Record preserves this reasoning for your future self.
# ADR-007: Offline Feature Computation
## Context
Feature computation runs in nightly batch. Some features
are stale by 12 hours at serving time.
## Decision
Keep batch for training features. Add streaming for 3
real-time features (last-login recency, cart value,
session count).
## Alternatives Considered
1. All streaming (rejected: ~10× infrastructure cost).
2. Faster batch, hourly (rejected: still stale).
3. Feature caching with TTL (rejected: cache-invalidation
complexity).
## Consequences
- Two feature computation paths to maintain.
- Real-time features need drift monitoring.
- Training/serving skew possible for 3 features.
## Assumptions to Revisit
- 3 real-time features sufficient for next 6 months.
- Streaming infra handles peak load (Black Friday).
- Drift alerts catch training/serving skew.
The Alternatives Considered section is the most valuable. The agent suggests alternatives you wouldn’t — not because it’s smarter, but because it has no investment in your preferred approach. A human colleague might hesitate to challenge your solution; the agent doesn’t hesitate.
Evolution
Collaboration patterns are more stable than tool surfaces. The modes in this chapter — hypothesis debugging, assumption surfacing, anti-sycophancy — predate agentic coding (they come from code review culture, scientific method, devil’s-advocate traditions). What agents changed is the friction of applying them.
Convergence: the sycophancy default is universal. All three models default to agreement in under-specified prompts. The anti-sycophancy techniques — present-options-without-preference, argue-the-other-side, devil’s-advocate-session — are equally needed across tools. This is a property of instruction-following LLMs, not a tool-specific quirk; don’t expect it to be “fixed” by any single release.
Convergence: ADR-style capture is universal good practice. All three tools produce markdown naturally; all three can be asked to write an ADR; the value of the artifact is independent of which agent wrote it. ADRs are a 1990s pattern that agentic coding has quietly revived by making the marginal cost of writing them near zero.
Emerging: multi-agent critique. Instead of running a single agent in devil’s-advocate mode, some practitioners run the recommendation and critique in different models (Claude recommends, Codex critiques, or vice versa). The cross-model version produces genuinely different signal because the models have different training and biases. This is still a hand-rolled workflow in 2026 — expect tooling support (explicit “second opinion” integrations) within 12–18 months.
Quick reference
- The agent is a thinking mirror — distortions in what it understands reveal gaps in what you’ve documented.
- Five collaboration modes: hypothesis debugging, tests as thinking tools, assumption surfacing, anti-sycophancy, interview-driven spec.
- Anti-sycophancy is structural, not attitudinal. Present options without preference; ask it to argue against; run recommendation and critique in separate sessions.
- The divergence between a recommendation session and a critique session is the measure of decision quality.
- Quick wins: docstrings (the Note section), type hints, self-debugging error messages, code archaeology, README-first.
- ADRs capture the Alternatives Considered — the highest-value section, usually skipped without an agent in the loop.
- Collaboration patterns are tool-agnostic because they operate on conversation shape, not command surface.
- When both recommendation and critique sessions agree, ship with confidence. When they diverge sharply, that’s where the decision actually lives.
Briefing documents
CLAUDE.md / GEMINI.md / AGENTS.md — the industry has converged on a project-root briefing doc the agent re-reads on every turn. This chapter is about what goes in it, what doesn't, and how to structure it so every token has leverage.
A stateless agent needs a frame. It has no memory of your project, no training on your conventions, no sense of what the current phase is or why this repo is shaped the way it is. The briefing doc is how you give it one — on every single turn, at the top of context, where attention is strongest. This chapter is about what goes in it, what shouldn’t, and how to structure it so every line earns its place.
Representation
The briefing doc is the single highest-leverage artifact in your project. Not because it’s magic — because of where it sits in the conversation. Every modern CLI-agent re-reads the briefing doc at the start of every turn and injects it into context before the current prompt. This means every token in the file is paid-for on every single interaction of every session, forever. Waste is not free; density is compounding.
What belongs in the briefing doc
A briefing doc plays three roles. Confusing them produces bloated, vague, or ineffective files.
Role 1: Briefing. What the project is. Architecture in broad strokes (stack, layers, repo structure), conventions that are non-obvious, the current development phase and its graduation criteria. This is the section that prevents the agent from inventing patterns the codebase doesn’t have.
Role 2: Rules. What the agent must or must not do. Non-negotiable constraints: never log secrets, always run tests before commit, never modify generated files by hand. Rules should be specific enough that compliance is observable. Vague rules (“write clean code”) don’t rule anything out.
Role 3: Vocabulary. Precise terms that map to specific actions. “Validate” means pytest tests/ && ruff check src/. “Ship” means commit, push, open PR. This is the precision vocabulary from Ch 3 Prompting made stable across sessions.
What doesn’t belong
The briefing doc is not a kitchen sink. These go elsewhere:
- Historical decisions. ADRs live in
docs/adrs/, not in the briefing doc. The briefing doc states what the decision is, not the alternatives considered twelve months ago. - Aspirational goals. “We want to move to microservices” is not a rule the agent can follow. If it’s true today, write it as a rule today; if it’s tomorrow, it’s not in scope.
- Personal preferences that aren’t enforced. If a convention isn’t worth enforcing in code review, it isn’t worth a briefing-doc line either.
- Anything better expressed as code. A schema is better than a paragraph describing a schema. Link to the code file or the schema definition.
- Anything the agent can figure out by reading the code once. Don’t narrate what
tsconfig.jsonsays — point to it.
Size discipline
Anthropic’s published guidance on Claude Code suggests individual briefing docs stay under ~200 lines, with the combined total across all briefing-doc layers (global + project + enterprise + user) under ~500 lines. Those numbers are Claude-specific but the principle generalizes: every briefing doc has a budget, because it’s a fixed tax on every turn. Over-budget briefing docs trigger the Context overload anti-pattern from Ch 11 — rules get diluted across too much text and the agent starts picking and choosing which to follow.
The hub-and-spoke pattern
At scale, a single file is not enough. A large project naturally accumulates specialized rules: the backend uses Django conventions that don’t apply to the frontend; the mobile client has Swift conventions irrelevant to both; the ML pipeline has verification requirements specific to that domain. Stuffing all of this into one briefing doc violates the every-line-earns-its-tax rule.
The solution is hub-and-spoke: a lean core briefing doc (the hub) plus path-scoped rule files (spokes) that load only when the agent is working in the relevant directory. Claude Code implements this natively via .claude/rules/*.md with paths: frontmatter; Gemini CLI supports hierarchical GEMINI.md files (nested per directory); Codex CLI’s model is flatter but improving.
Operation
The three CLIs share the briefing-doc pattern and differ in loading mechanics.
| Property | Claude Code | Gemini CLI | Codex CLI |
|---|---|---|---|
| Project-root filename | CLAUDE.md | GEMINI.md | AGENTS.md |
| Global/user-level file | ~/.claude/CLAUDE.md | ~/.gemini/GEMINI.md | ~/.codex/AGENTS.md |
| Hierarchical nesting | 5-layer stack (global → project → local → enterprise → user) | nested GEMINI.md per directory | flat (global + project) |
| Path-scoped rules | .claude/rules/*.md with paths: frontmatter | directory-nested GEMINI.md files | no first-class mechanism |
| Imports / includes | @path/to/shared.md | @path/to/file.md | inline only |
| Reload command | auto on compact | /memory refresh | restart |
| Line budget | ~200/file, ~500 total (Anthropic guidance) | no official guidance; same heuristic applies | no official guidance; same heuristic applies |
Writing content that earns its line
A briefing-doc line should pass three tests:
-
Is it actionable? The agent can check, at any given point, whether it’s currently following the rule. “Handle errors gracefully” fails this test (what’s “gracefully”?). “Every function that can fail raises an explicit exception with the failing condition in the message” passes.
-
Is it universal within the scope it claims? If a rule really only applies to backend code, don’t write it at the hub level. Universal rules stay in the hub; scoped rules go in path-scoped files.
-
Does it give an example? An explicit “do it like this” is worth ten abstract rules. Show the pattern you want, then state the rule.
## Error handling
All functions that can fail raise explicit exceptions. Error messages
include what was expected, what was received, and what the caller
can do to recover.
Example:
raise ValueError(
f"Need {min_rows} rows, got {len(df)}. "
f"Check data source or reduce min_rows parameter."
)
Testing the briefing doc
A briefing-doc rule is an assumption until tested. Every new rule earns a two-step verification:
Smoke test. Clear the session, ask the agent to do something the rule governs, do not mention the rule, watch whether it’s followed. If yes, the rule is in force; if no, it’s too vague, too buried, or contradicted by existing code.
Adversarial test. For security or compliance rules: clear the session, ask the agent to do something the rule prohibits, watch whether it refuses. If it complies without hesitation, the rule is advisory (the agent can choose to ignore it). Move it to a hook or to harness-level deny rules — anywhere the agent cannot choose.
Evolution
The briefing-doc pattern is the most complete convergence in the agentic-coding toolchain. The filenames differ; the shape is identical.
Convergence: the four-section skeleton. Architecture / Conventions / Constraints / Verification is now standard guidance across vendor docs and community content. Teams that adopt this structure for one tool can port the same briefing doc to another tool by renaming the file.
Convergence: path-scoped rules are becoming standard. Claude Code’s .claude/rules/*.md with paths: frontmatter was first; Gemini’s hierarchical nested GEMINI.md landed later with equivalent semantics. Codex is the outlier today with no path-scoping mechanism, but community pressure and a published roadmap suggest this closes within 12 months.
Emerging: briefing-doc linting. A handful of teams have started shipping linters that check briefing docs against the heuristics in this chapter — line count, section presence, example coverage, rule specificity. These are hand-rolled in 2026 and mostly live in internal tooling. Expect first-class product support (a claude lint-rules or equivalent subcommand) by 2027.
Emerging: team-shared briefing-doc patterns. The pattern of maintaining a team-conventions.md imported by every project briefing doc is becoming standard at larger orgs. The team file encodes the org-wide precision vocabulary and non-negotiables; each project’s briefing doc imports it and adds project-specific content. This scales better than copy-paste and keeps org-wide changes one-file-to-update.
Quick reference
- The briefing doc is the highest-leverage artifact in your project because every line is paid-for on every turn of every session.
- Three roles: briefing (what the project is), rules (what must or must not happen), vocabulary (precise terms that map to actions).
- Four core sections: Architecture, Conventions, Constraints, Verification.
- Size discipline: ~200 lines per file, ~500 combined. Over budget means you’re paying rent on content that isn’t carrying its weight.
- Hub-and-spoke at scale: lean core briefing doc + path-scoped rule files that load only when relevant.
- Every rule earns: actionable, universal within scope, example-anchored.
- Test every new rule: smoke test (does the agent follow it unprompted?) + adversarial test (does it refuse violations?).
- Advisory rules can be ignored; hook/deny-rule enforcement cannot. Match the mechanism to the stakes.
- The pattern is convergent; the filenames and loading depth diverge. Write to the pattern; the filename is the easy migration.
Extending agents
Commands, skills, hooks, MCP — the four axes by which an agent becomes more than the defaults it ships with. This is the most divergent surface in the category; get the mental model right and the command names become secondary.
Out of the box, every CLI-agent is just a general-purpose shell for a model. That’s where it stops being useful. The agent that ships with defaults is the agent that loses to the agent with your repo’s conventions wired in, your team’s verification gates running automatically, your private knowledge base reachable as a tool. This chapter is about the four axes on which agents become extended — and why extensibility is also the surface where tools diverge most sharply.
Representation
Extension is how a CLI-agent stops being a general-purpose coder and becomes your agent. Every tool this book covers supports extension, but they divide the work differently. The right mental model is not “which tool has the most features” — it’s “which extension axis does my need actually live on?”
When to use which axis
The decision tree is short:
Use a command / skill when a workflow is repeatable and the user should decide when to invoke it. Deploy checks, code review templates, chapter-porting workflows. Cost: minutes to write a markdown file. Guarantee: the agent runs the workflow when asked.
Use a hook when a standard is non-negotiable and the user should not have to remember it. Tests before commit, lint on write, secrets-scan on edit. Cost: a shell script and some config. Guarantee: fires on the matching event regardless of what the agent or user wants.
Use a permission / guardrail when a prohibition is load-bearing. The agent must never read .env; commits to main require approval; writes outside src/ are forbidden. Cost: one config entry. Guarantee: the agent cannot violate the rule without operator override.
Use an MCP server when an external system needs to become part of the agent’s working surface. Your company’s ticket tracker, internal knowledge base, custom deployment platform. Cost: stand up a server implementing the protocol. Guarantee: the external system’s tools appear alongside the built-in ones and are usable via the same prompt shape.
Why this chapter is volatile
The four-axis model is stable. The specific names, file formats, and configuration surfaces are not. Expect churn quarterly — this is feature-surface in a book otherwise dominated by architectural-pattern and stable-principle. Verify commands and file paths against current docs before relying on them.
Operation
Extension-surface comparison
The tri-tool map of what’s supported where:
| Axis | Claude Code | Gemini CLI | Codex CLI |
|---|---|---|---|
| User commands | .claude/skills/<name>/SKILL.md (or a single .md) → /name | .gemini/commands/<name>.toml → /name | slash commands via config / plugins |
| Reusable skills | merged with commands — a skill-with-a-directory is the richer form of a command | via extensions | via MCP server + registered tools |
| Hooks | 9+ events: UserPromptSubmit, PreToolUse, PostToolUse, Notification, Stop, SubagentStop, PreCompact, SessionStart, SessionEnd | lighter; pre/post tool callbacks | command-approval config in ~/.codex/config.toml |
| Permissions | settings.json with allow-list / deny-rules; enforced by harness | tool-level allow-lists | approval modes (--suggest / -a on-request / --full-auto) |
| MCP support | first-class; shipped with Claude Code from early | first-class (2025+); 200+ extension ecosystem | first-class; codex mcp CLI + ~/.codex/config.toml |
| MCP transports | stdio + HTTP/SSE | stdio + HTTP | stdio + streamable HTTP |
Two observations before the details:
-
MCP is convergent. All three tools support the protocol with interoperable server implementations. An MCP server written for one client works for all three, modulo tiny transport-config differences. This is the single biggest extensibility story of 2025–2026.
-
Everything non-MCP diverges. Skills, command authoring, hooks — each tool’s surface is incompatible with the others. Practices that lean on specific command-file formats or hook-event names do not port; practices that lean on “what the extension does” do.
Commands and skills: the invoked layer
A command is the simplest extension. You write a markdown file describing a workflow; the agent reads it when invoked. Claude Code recently merged its commands/ and skills/ directories — both now produce the same /slash-command interface. A “skill” is just the richer form: a directory with a SKILL.md entry point plus supporting reference files.
# Claude Code — simple command form
.claude/skills/review.md
---
name: review
description: Run the team's PR review checklist
---
1. Run `git diff main...HEAD`
2. For each changed file, check:
- Does it match the precision-vocabulary in the briefing doc?
- Are error paths handled explicitly?
- ...
# Claude Code — richer skill form
.claude/skills/deploy/
SKILL.md # required entry point, frontmatter + instructions
checklist.md # supporting reference
examples/ # sample outputs
Gemini CLI uses TOML-based custom commands in .gemini/commands/; Codex CLI ships a built-in slash-command set and leans on MCP servers to add custom verbs. The mental model is the same everywhere: a named, invocable workflow.
Hooks: the always-fires layer
A hook is a script that runs automatically at a specific lifecycle point, regardless of what the user or agent is doing. Hooks are how you make a standard non-negotiable: “always run tests before commit” is an advisory line in the briefing doc that the agent may forget in a long session; as a PreToolUse hook matching Bash commands that contain git commit, it fires every time.
Claude Code has the richest hook surface — nine lifecycle events covering prompt submission, tool invocation (before/after), compaction, session start/end, sub-agent lifecycle, and generic notifications. Each hook can match specific tool names or patterns and run arbitrary shell commands. Gemini CLI has a lighter set focused on pre/post tool invocation. Codex CLI’s model is command-approval configuration — the agent’s operations are gated by approval policy rather than by event hooks.
{
"hooks": {
"PreToolUse": [{
"matcher": "Bash",
"hooks": [{
"type": "command",
"command": ".claude/hooks/pre-commit-gate.sh",
"description": "Block git commit if tests fail"
}]
}]
}
}
MCP: the external-system layer
Model Context Protocol is the shared wire format for agent ↔ external-system communication. An MCP server exposes tools (functions the agent can call), resources (read-only content the agent can retrieve), and prompts (pre-structured templates the agent can invoke). Once connected, the server’s capabilities appear in the agent’s surface alongside the built-ins — a Jira MCP server adds create_ticket and search_tickets tools; a Postgres MCP server adds query and describe_schema; a company-knowledge MCP server adds search_docs.
All three CLIs now support MCP as first-class. Configuration mechanisms differ:
- Claude Code:
.claude/settings.jsonor global config, with stdio + HTTP/SSE transport. - Gemini CLI:
mcpServersentry in Gemini config; 200+-extension ecosystem atgeminicli.com/extensions; includes official remote MCP servers for Google Workspace, Google Cloud services, etc. - Codex CLI:
~/.codex/config.toml+ thecodex mcpCLI for add/list/remove operations; supports stdio and streamable HTTP transports.
# ~/.codex/config.toml — adding an MCP server
[mcp_servers.jira]
command = "jira-mcp-server"
args = ["--workspace", "acme"]
env = { JIRA_API_KEY = "..." }
Evolution
Extension is simultaneously the most convergent part of the category (on MCP as protocol) and the most divergent (on everything else). Both stories are still in motion.
Convergence: slash commands as the invocation surface. All three tools converged on /command-name as the user-invocable extension syntax. The file format for defining a command differs (Claude’s skill-merged markdown, Gemini’s TOML, Codex’s plugin system), but the user experience — type a slash, pick from autocomplete, invoke — is identical across tools.
Convergence: custom commands merging with richer skills/extensions. Claude Code’s 2026 merge of .claude/commands/ into .claude/skills/ formalized a pattern that was already implicit: a “command” is just the simple form of a “reusable skill with optional supporting files.” Gemini’s extensions and Codex’s MCP-as-command-source follow the same logic at different layers of abstraction.
Emerging: cross-tool skill registries. A few community projects in 2026 are experimenting with CLI-agnostic skill formats that translate into each tool’s native format on install. None are mature; all face the fundamental divergence of hook/permission semantics that no translation layer can paper over. The write-once-everywhere story will arrive for commands (already close) before hooks (genuinely hard).
Quick reference
- Four extension axes: commands/skills (invoked), hooks (always-fires), permissions (guardrails), MCP (external systems). Match the axis to the guarantee you need.
- Commands/skills are cheap and on-demand; hooks are medium-cost and always-fire; permissions are harness-enforced; MCP is the gateway to external systems.
- MCP is the one cross-tool extension axis: write the server once, use it from any compliant CLI. Commands and hooks do not port.
- Claude Code merged commands + skills into one system; a command is just the simple form of a skill.
- Claude Code has the richest hook surface (nine events); Gemini has lighter hooks; Codex uses command-approval as its enforcement paradigm.
- Quarterly extension audit: delete what hasn’t been invoked in a month. Hooks especially — they fail silently.
- Before writing a tool-specific extension, ask: could this be an MCP server instead? If the answer is yes and it would ever be shared, write the MCP server.
- This is the most volatile chapter in the book. Verify command names, file formats, and event lists against current docs before relying on them for anything load-bearing.
Delegation and parallelism
The fix for context rot is not a bigger window — it is more, shorter conversations. This chapter is about the two mechanics that make horizontal scaling practical: subagent delegation within a session, and parallel sessions across worktrees. Go wide, not deep.
There is a ceiling on how much work a single session can do before context rot sets in. The instinct is to push the ceiling higher — use the 1M window, compact aggressively, add more discipline. The insight is the opposite: stop trying to scale the single session and scale the number of sessions instead. This chapter is about the two mechanics that make horizontal scaling practical — delegating within a session to a subagent, and running sessions side-by-side across isolated worktrees.
Representation
Context as Currency made the case that context decays non-linearly. The consequence — understated there, fleshed out here — is that the best response to “this task is too big for one session” is almost never “make the session bigger.” It is “make the tasks smaller and run more of them.”
Two mechanics enable horizontal scaling:
Subagent delegation. A child agent spawned from inside a session with its own isolated context, tools, and sometimes its own git worktree. The child works on a bounded sub-task and returns a summary; the research or exploration context does not pollute the parent. Delegation is within-session scaling.
Parallel sessions. Multiple independent agent sessions running side-by-side, each with its own context, usually in its own git worktree to prevent file conflicts. Each session owns a logical workstream. This is cross-session scaling.
When to go wide vs when to go deep
Not every task scales horizontally. Two legitimate shapes:
Go wide (multiple short sessions) when the tasks are independent, can be serialized with handoff files, or naturally factor into parallelizable sub-problems. Examples: implementing separate features, reviewing multiple PRs, batch operations across files, porting chapters.
Go deep (one long session) when the task requires continuous reasoning across many interconnected decisions that would lose coherence if split. Examples: debugging a subtle concurrency bug where every clue informs the next, designing a complex API where each endpoint’s shape depends on the others, working through a proof or derivation.
Default to wide. Most work decomposes into independent units better than practitioners expect. The two-minute overhead of writing a handoff file is almost always less than the cost of context degradation in a three-hour session.
The delegation economics
Subagent delegation has a specific economic profile worth internalizing. Spawning a child agent costs:
- Startup overhead: the child’s briefing doc re-injection, its own system prompt, initial tool registration — usually ~8–15K tokens. Not cheap; not ruinous.
- Result summarization: whatever the child did has to compress into a return value the parent can act on. Long child-sessions that produce vague summaries waste the delegation.
- Context switching: the parent pays a small attention cost when re-engaging after the child returns.
The delegation pays back when the avoided cost — the tokens the parent would have consumed doing the work itself — exceeds the startup + summary cost. For a well-scoped research task that would have taken the parent 40K tokens of file-reading, a subagent with 15K startup and a 2K summary saves ~23K tokens net. For a trivial task that would have taken the parent 3K tokens, delegation is a net loss.
Operation
Tri-tool delegation surface
| Primitive | Claude Code | Gemini CLI | Codex CLI |
|---|---|---|---|
| Spawn subagent within session | Task tool (with subagent_type) | agent-chaining via prompts | Agents SDK for programmatic; limited in-CLI |
| Subagent context isolation | yes, first-class | prompt-level | via Agents SDK |
| Subagent worktree isolation | isolation: worktree frontmatter | via explicit git worktree add | via explicit git worktree add |
| Built-in git worktree flag | --worktree / -w (v2.1.49+) | manual git worktree add | manual git worktree add |
| Agent-team coordination | Agent Teams feature | manual | Agents SDK |
| Resume parallel session | claude --continue / --resume | gemini --continue | codex resume |
Subagent patterns
The mental model: a subagent is a function call with bounded input, a clear contract, and a short return value.
You: "Delegate a search: use a subagent to find every file in
src/features/ that uses pandas (not polars). Report a
bullet list of paths plus the pattern used in each. Do
not edit anything."
[Agent spawns subagent with Task(subagent_type='general-purpose', ...)]
[Subagent reads src/features/, greps, builds list, returns summary]
Agent: "Found 7 pandas usages: ..."
[200-token summary in parent context;
parent's context is unchanged from before the task]
The contract matters — the prompt to the subagent specifies exactly what the deliverable looks like (“bullet list of paths plus pattern”). Without a contract, the subagent may return 2,000 tokens of exploration results, and you’ve turned delegation into pollution.
Parallel-session patterns
Multiple independent agent sessions, each in its own git worktree, working on separate workstreams. Each session reads the same briefing doc (so conventions propagate automatically) and writes to an isolated branch.
The canonical flow:
- Identify 3–5 independent tasks (fix bug A, add feature B, refactor module C).
- For each, spin up a worktree-isolated session:
claude --worktree fix-bug-a(orgit worktree add+ agent launch for the other CLIs). - Each session completes its task and commits. No file conflicts because worktrees isolate the filesystem.
- Merge each session’s branch into main via PR or direct merge.
Claude Code’s v2.1.49 --worktree flag makes this a single command. Gemini and Codex require the manual git worktree add incantation but the outcome is the same.
Evolution
Horizontal scaling is a practice convergence ahead of a product convergence. The insight — many short sessions beat one long one — is now widely shared; the tooling to make it frictionless is still mostly Claude-first.
Convergence: git worktrees as the filesystem-isolation primitive. All three tools treat git worktrees as the right mechanism for filesystem-level parallelism. Claude Code has built-in CLI support (--worktree / -w since v2.1.49, Feb 2026). Gemini and Codex rely on git worktree add + manual session launch. The outcome is identical; the keystrokes differ.
Emerging: handoff-file automation. A few practitioners are experimenting with auto-generated handoff files — the agent produces the Right now / Why / Next step / Context sections at session-end automatically, based on the session transcript. The output is uneven today; the human-written version is usually better. But as models improve at meta-cognition over their own sessions, expect automated handoff to become a viable shortcut.
When delegation goes wrong
Three failure modes to watch for:
Over-delegation. Spawning subagents for tasks the parent should have done directly. Signal: the parent’s context keeps growing from summaries rather than work-in-progress. Fix: the decision rule above — delegate research, not implementation.
Under-contracted delegation. Subagents that return too much. Signal: the parent gets a 2,000-token summary and has to re-summarize it. Fix: specify the deliverable format in the subagent prompt; treat it like an API call with a declared return shape.
Premature parallelization. Running five sessions on tasks that turn out to be coupled. Signal: sessions start invalidating each other’s analysis mid-work. Fix: decomposition gate. If you can’t write a one-sentence summary of what each session will produce without referencing the others, they are not parallelizable — sequence with handoffs.
Quick reference
- Horizontal scaling — many short sessions — beats vertical scaling — one long session — for most work.
- Two mechanics: subagent delegation (within-session context isolation) and parallel sessions (cross-session filesystem isolation via worktrees).
- Delegate when the task has high context cost (lots of files to read) and a compressible result (a short summary). Don’t delegate when the output is itself what the parent needs to keep working with.
- Subagent prompts are tighter than main prompts — specify the deliverable format explicitly.
- Parallel sessions require worktree isolation. Without worktrees, “parallel” becomes “overlapping edits.”
- Handoff files (
CURRENT_WORK.md) bridge between sessions cheaply. Two minutes to write saves ten minutes of re-discovery. - Claude Code has the most ergonomic surface for delegation + worktree workflows today. The practice ports to Gemini and Codex; the plumbing is more manual.
- Go wide by default. Go deep only when the task genuinely needs continuous reasoning across interconnected decisions.
Starting and refactoring projects
Projects have lifecycles. Agent collaboration works differently on a week-old greenfield repo than on a five-year-old brownfield codebase. This chapter is the protocols — day-one bootstrap for new projects, characterization-first onboarding for existing ones, incremental refactoring for anything mid-life.
Projects have ages. A week-old greenfield repo, a two-year-old codebase in stabilization, a ten-year-old legacy system with accumulated patterns — each demands a different collaboration strategy with the agent. The practices that work at inception break at scale; the practices that work on legacy code suffocate a prototype. This chapter is about matching the protocol to the phase.
Representation
Every project moves through recognizable phases. Agent collaboration works differently at each. The mistake is treating the agent as phase-agnostic — applying inception practices to legacy code (over-generation, surprises) or legacy practices to inception (premature rigor that kills exploration).
The three practical modes
Eight phase-mode combinations collapse cleanly into three protocols:
Greenfield inception. You start from nothing; the agent helps you build the initial shape. Goal: establish conventions and rigor-boundaries early so later phases inherit them.
Brownfield onboarding. You inherit an existing codebase; the agent must learn the codebase before editing it. Goal: avoid the biggest failure mode — the agent edits according to patterns that don’t match the real code.
Incremental refactoring. The codebase is alive and in use; you’re improving it without breaking it. Goal: each step commit-sized, always-working, reversible.
Why this is architectural-pattern, not stable-principle
The phase concept is old — software engineering has recognized lifecycle stages for decades. What’s new is the protocols — how to set up an agent-assisted greenfield in 2026, how to onboard an agent to a brownfield codebase in 2026. Those specifics will drift as tooling evolves (auto-scaffolding agents, automated briefing-doc generation, first-class characterization-test helpers are all in flight). The phase taxonomy is durable; the protocols have a half-life of a year or two.
Operation
Three protocols, each with a specific goal and a specific failure mode.
Protocol 1: Greenfield inception
Goal: establish convention and rigor-boundaries early.
Day one is the single highest-leverage day of a project’s lifecycle. The decisions you make now — which tests are mandatory, where the briefing doc sits, which anti-patterns are pre-blocked — compound for the project’s entire future. Three concrete steps:
1. Create CLAUDE.md (or GEMINI.md / AGENTS.md) before the first
real feature.
- Architecture: 10 lines. Stack, layers, non-obvious shape.
- Conventions: 15 lines. Style, error handling, testing posture.
- Constraints: 10 lines. Hard bounds.
- Verification: 10 lines. Exact commands.
- Phase declaration: "Phase: INCEPTION (until <date>).
Tests manual OK. Graduation criteria: ..."
2. Configure minimum viable hooks.
- PreToolUse on Bash: gate `git commit` on lint passing.
(Tests not required yet; phase is inception.)
- PostToolUse on Edit|Write: auto-format.
3. First commit.
- Message the shape of future commits.
- Sets the agent's example for tone.
What you don’t do on day one: coverage thresholds, exhaustive type hints, production-grade error handling. Inception phase is for shape-finding, not rigor. The anti-pattern from Ch 11 — premature rigor killing exploration — is specifically warning against front-loading this.
Protocol 2: Brownfield onboarding
Goal: agent learns the codebase before editing it.
The single biggest brownfield failure mode: the agent edits according to conventions that don’t match the actual code, because nobody told it what the actual conventions are. The codebase-is-the-curriculum principle from Ch 1 means the agent will learn from the codebase — either from your guided tour, or from whatever files it happens to read first. A guided tour is cheaper.
Session 1 — Discovery (plan mode; read-only).
"Read the top-level structure. Report:
- What each top-level directory is for.
- Which test runner / lint config / build tool is used.
- The three most-modified files in the last 90 days.
- Any CODE_OF_CONDUCT, CONTRIBUTING, or arch-docs that
describe the project's conventions.
Do not edit anything."
Session 2 — Conventions.
"Based on session 1, write a CLAUDE.md (or equivalent) that
captures: architecture, conventions, constraints, verification.
Flag claims you're uncertain about. Do not edit anything else."
Session 3 — Characterization tests.
"Pick the 3 highest-risk functions (most callers, most recent
bugs, most business-critical). Write characterization tests
that pin current behavior. Don't refactor. The goal is a
safety net, not cleanup."
Session 4+ — Productive edits.
Now the agent has a briefing doc reflecting reality, and a
test safety net for the most important functions. Productive
edits can begin without surprise.
This three-session onboarding feels expensive. It’s not — it’s three sessions that pay back dozens. The alternative is a single productive session that ships a subtle regression because the agent didn’t know pattern B was mandatory.
Protocol 3: Incremental refactoring
Goal: each step commit-sized, always-working, reversible.
The big-bang refactoring anti-pattern from Ch 11 is the specific failure this protocol prevents. The counterpattern is four steps, applied repeatedly to small scopes:
Extract. Pull one function or one module out of the existing shape. Same behavior, new location. Run tests; confirm nothing broke. Commit.
Test. Add characterization tests for the extracted unit. Focus on the interface, not the internals. Commit.
Harden. Improve the extracted unit — better types, tighter error handling, clearer naming. Tests continue to pass because the interface is pinned. Commit.
Promote. If other callers should use the new pattern, migrate them one at a time. Each migration is its own commit. Old pattern stays until migration is complete.
Four steps, four commits, always-shippable state. If step 3 or 4 goes wrong, step 2’s test catches it immediately; if they go right, the improvement lands without the codebase ever being broken.
Four failure layers of a refactor
When a refactor goes sideways, diagnose by layer — same pattern as the four-layer diagnosis from Ch 11, applied to refactoring specifically:
- Scope. Is the refactor too big? If the change touches more than ~5 files in a single step, it’s almost certainly not going to land cleanly. Decompose.
- Dependencies. Do the changed files have un-characterized downstream consumers? Run
grep -rfor the public interface; write characterization tests for each consumer before editing. - Test coverage. Are the tests pinning the right invariants? A refactor can pass unit tests and still change observable behavior (performance, ordering, error timing). If you’re unsure, add tests for the invariant that worries you.
- Rollback. Can you revert cleanly if something breaks after deploy? If no — if this refactor is entangled with feature work or mixed-concern commits — stop. Finish the current feature, then do the refactor in isolation.
Evolution
Lifecycle phases are an old concept; agent-specific protocols are new.
Convergence: characterization tests as the brownfield-onboarding primitive. The insight that you should pin current behavior before trying to change it is older than agentic coding, but agent-assisted refactoring makes it more important because the agent’s “helpful improvements” can alter observable behavior in ways no human reviewer would miss in code review but that a test run catches immediately. All three CLIs handle characterization-test generation well when prompted for it.
Emerging: automated briefing-doc generation. The two-step “discover the codebase → write a briefing doc reflecting what you found” pattern is an obvious automation target. Early versions exist — scripts that prompt an agent to walk a codebase and emit a draft briefing doc — but the outputs are uneven. Good briefing docs encode judgments about what’s important; agents can draft the factual shell (directory purposes, test tooling, build setup) but still need human editing for the convention + constraint sections. Expect this to mature substantially in 2026-2027.
Emerging: repo-level phase enforcement. Some teams are experimenting with CI hooks that verify phase-appropriate standards are met before allowing merge — no coverage check for inception phase, strict coverage check for production. The infrastructure is ad-hoc today; first-class product support would turn “phase-appropriate rigor” from a discipline into a tooling guarantee.
When none of these protocols fits
Edge cases worth naming:
- Exploration spikes. You’re not really starting a project; you’re building a throwaway to test an idea. Skip the briefing doc; skip the hooks; just use the agent. The output is the insight, not the code. If the spike survives, promote to greenfield inception; if it doesn’t, delete.
- Fork-and-diverge. You’re branching a codebase to take it in a different direction. Neither greenfield (existing history matters) nor classic brownfield (existing conventions are being deliberately violated). The protocol is hybrid — keep the old codebase’s briefing doc as a reference; write a new briefing doc describing what’s deliberately different.
- Agent-to-agent handoffs. You’re inheriting a codebase that was largely agent-written by a previous team. The existing briefing doc might actually be better than average (recent teams write good ones), but the code may encode more agent-generated patterns than human ones — characterization tests are more important here, not less.
Quick reference
- Projects have phases (inception → stabilization → maintenance → legacy) and modes (greenfield / brownfield). Match the protocol to the phase-and-mode.
- Greenfield inception: 60-minute day-one setup — briefing doc with explicit phase declaration, minimum viable hook set, first commit.
- Brownfield onboarding: three-session protocol — discovery (read-only), briefing-doc authorship, characterization tests on highest-risk code. Don’t edit productively until all three are done.
- Incremental refactoring: four steps — extract, test, harden, promote. Each commit-sized, always-working, reversible.
- Four failure layers of a refactor: scope, dependencies, test coverage, rollback. Diagnose by layer.
- Characterization tests pin current behavior. Write before refactoring, not after. They catch the unintended behavior changes agent-helpful “improvements” introduce.
- Phase-appropriate rigor: every failure recipe in Ch 11 maps to a phase-mismatch.
- Premature rigor kills exploration; missing rigor kills production. The phase declaration in the briefing doc is how you avoid both.
Anti-patterns and recovery
Every tool has characteristic misuse patterns. This chapter catalogs eight of them with concrete recovery procedures, and introduces a four-layer diagnostic framework for when the agent keeps failing — so you fix the right layer instead of the wrong one.
Three hours in, the agent is hallucinating function names and ignoring rules that were working fine at session start. What went wrong? More importantly: how do you escape? This chapter is reactive — the catalog of characteristic failures every practitioner eventually hits, with a concrete recovery guide for each.
Representation
Every agentic-coding tool has characteristic misuse patterns. They’re not model bugs or tool deficiencies — they arise from treating the agent as infinitely capable when it is in fact a bounded system with a finite context window, degrading attention, and no memory across sessions.
The four failure layers
When the agent keeps failing on the same problem across sessions, the temptation is to blame the model. Usually wrong. Work through the four layers in order — most failures resolve at Layer 1 or Layer 2.
-
The prompt. Is the request ambiguous? Does it assume context the agent doesn’t have? Test: rephrase with explicit constraints and an example of what “correct” looks like. If the agent succeeds, you had a specification problem, not a model problem.
-
The briefing doc (CLAUDE.md / GEMINI.md / AGENTS.md). Is there a conflicting rule? A rule too vague to enforce? A rule the agent interprets differently than you intended? Test: temporarily move the briefing doc aside (
mv .claude/CLAUDE.md .claude/CLAUDE.md.bak) and retry. If the problem disappears, a rule is interfering. Add rules back incrementally to isolate. -
The codebase. Does the existing code teach the wrong patterns? The codebase-is-the-curriculum principle applies in reverse here: if 50 existing functions use pattern A and you want pattern B, the agent will default to A regardless of instructions. Test: ask the agent why it chose its approach. If it cites existing code, the codebase is teaching patterns your instruction is supposed to override — make the instruction explicit about the exception.
-
The model. Genuine limitations exist — certain reasoning patterns, mathematical computations, or domain-specific conventions the model gets wrong consistently. Test: try max reasoning depth; try a different model; if all fail the same way, accept the limitation and design a workaround (verification step, hook, or manual review).
In practice, the large majority of persistent failures resolve at Layer 1 or Layer 2. Practitioners who jump to “the model is wrong” usually haven’t verified their instructions are unambiguous and conflict-free.
Operation
The catalog. Eight common anti-patterns, each with a Recovery box you can apply on the spot. Prevention links back to the chapter that covers it properly.
1. Context overload
Symptom: the agent ignores important rules. Instructions from the briefing doc are followed inconsistently. Behavior degrades over long sessions.
Root cause: the briefing doc has grown to contain every rule, convention, and preference accumulated over months. Attention dilutes across all of it, and critical rules get the same weight as minor preferences.
Prevention: hub-and-spoke architecture with path-scoped rule files. Keep the core briefing doc under ~300 lines. Non-negotiable standards become hooks, not advisory lines.
2. The kitchen-sink session
Symptom: performance degrades midway through a session. The agent repeats itself, forgets earlier decisions, or produces lower-quality output.
Root cause: multiple unrelated tasks in a single session. Each task’s context remains, consuming tokens that contribute nothing to the current task. Covered in depth in Ch 2 Context as Currency.
3. Over-correcting
Symptom: three or more rounds of “no, that’s not what I meant” followed by increasingly desperate attempts at the same task.
Root cause: each correction adds noise — the original error, your correction, the agent’s acknowledgment, its retry. After three rounds, the context is dominated by failure patterns.
Prevention: the two-failure rule (covered in Ch 4 Session Loop). After two failed corrections, clear and re-prompt with precision.
4. The permanent prototype
Symptom: code has been “working” for weeks but has no tests, no type hints, no error handling. “We’ll add those later” becomes permanent.
Root cause: the exploration phase has no defined exit criteria. Without an explicit transition, the code accumulates users and dependencies while remaining at prototype quality.
Prevention: explicit phase transitions in the briefing doc, with graduation checklists and target dates. Covered in Ch 5 Edit-Test-Commit.
5. The verification gap
Symptom: agent-generated code accepted without testing. Subtle bugs surface in production weeks later.
Root cause: AI-generated code looks correct. The syntax is clean, variable names reasonable, logic reads well. This appearance of correctness is precisely why verification is essential — the bugs are subtle, not obvious.
Prevention: always provide verification criteria in the prompt, tests with code (not after), hook-enforced test requirements. Full treatment in Ch 5 Edit-Test-Commit.
6. Infinite exploration
Symptom: you asked the agent to “investigate” something without scoping. It reads 40 files, filling the context window with exploration results that crowd out implementation.
Root cause: unbounded investigation prompts give the agent no stopping criteria. Every related file gets read; context that should be reserved for the actual task is consumed.
Prevention: scope investigations narrowly with explicit deliverables. Use subagents for exploration so research context doesn’t consume your main session.
7. Big-bang refactoring
Symptom: a large rewrite fails partway through, leaving the codebase broken.
Root cause: the scope of the rewrite exceeds what can be held in context. The agent loses track of the original behavior and introduces regressions.
Prevention: incremental refactoring — extract one function, write characterization tests, refactor, verify, commit. Each step independently shippable. Characterization-test mechanics are covered in Ch 5 Edit-Test-Commit; the full four-step refactoring protocol (extract → test → harden → promote) is covered in Ch 10 Starting and Refactoring Projects.
8. Trusting the briefing doc untested
Symptom: you added a rule to your briefing doc weeks ago. You assume it’s in force. It isn’t.
Root cause: a rule in the briefing doc is advisory — the agent is told to follow it, not prevented from violating it. Rules that are too vague, buried among competing rules, or contradicted by codebase patterns silently fail.
Prevention: test every new rule immediately after adding it.
Evolution
Anti-patterns are the most tool-agnostic territory in the book. The failure modes are properties of agent-as-bounded-system, not of any specific product.
Convergence: the four-layer diagnosis. Prompt → briefing doc → codebase → model is universal. The briefing-doc filename changes (CLAUDE.md / GEMINI.md / AGENTS.md); the layer it represents doesn’t. When a team has internalized the layered diagnosis, their debugging-the-agent time drops dramatically because they stop jumping to Layer 4 prematurely.
Emerging: automated anti-pattern detection. Some teams have started instrumenting their agent sessions to auto-flag anti-patterns in real time — “you’ve corrected the same issue three times, consider clearing” or “this session has touched 12 files; kitchen-sink warning.” The tooling is hand-rolled in 2026; expect first-class product support by 2027.
How these anti-patterns scale
These are personal anti-patterns. When you scale agent-assisted work to a team, new patterns emerge — and the solutions shift from personal discipline to organizational infrastructure. Shared briefing docs, team hook libraries, enforced phase transitions, code-review guidelines that surface verification gaps. That scope is beyond this chapter, but the pattern is: individual-level anti-patterns become team-level infrastructure requirements as you scale.
Quick reference
- The agent is a powerful but bounded system. Most anti-patterns arise from forgetting the bounds.
- Four-layer diagnosis: prompt → briefing doc → codebase → model. Most failures resolve at Layer 1 or 2 — check those before blaming the model.
- Context overload: keep the hub briefing doc under 300 lines; offload specifics to path-scoped rules.
- Kitchen-sink session: one session per logical task; clear between.
- Over-correcting: two-failure rule; after two rounds, clear and re-prompt with precision.
- Permanent prototype: declare phases explicitly; set graduation checklists.
- Verification gap: every agent-generated change earns verification criteria; no exceptions.
- Infinite exploration: scope investigations with deliverables; delegate unbounded research to subagents.
- Big-bang refactoring: incremental protocol; each step independently shippable.
- Untested briefing-doc rule: smoke-test every new rule before trusting it.
- Most anti-patterns are tool-agnostic — the recovery procedures port across Claude, Gemini, and Codex with only minor command changes.
Automation and pipelines
Headless agent runs, CI integration, and scheduled pipelines take the interactive session loop and remove the human from it. That removal changes everything — permissions, observability, failure modes, cost. This chapter covers the design patterns that make unattended agents safe and the failure modes that make them dangerous.
An interactive session has a human in the loop — watching, nudging, aborting. A headless run does not. The same agent that is pleasant and corrigible in a live terminal becomes a different animal when it executes at 3am against production branches with no one watching. This chapter is about that difference and the design patterns that make the difference survivable.
Representation
Every agent invocation has three axes: who authorizes it, who observes it, and what it can touch. In an interactive session these are collapsed — you are authorizing, observing, and bounding in real time. In a pipeline run, each axis is a separate design decision that must be specified in advance.
Interactive mode is the default mental model for most practitioners, which is why the first experience of moving an agent into CI is disorienting. The agent that paused for your approval on destructive actions now either fails closed (refuses) or — if misconfigured — fails open (runs unconfirmed destructive actions). The agent that asked clarifying questions now has no one to ask. The agent that self-corrected based on your feedback has no feedback signal.
The practitioner’s instinct — this is just the agent I already use, with one less window — is wrong in the same way that treating ssh as “just a terminal that’s far away” is wrong. The distance changes the failure modes. A mistake in an interactive session is visible and cheap; a mistake in a headless run may ship to production before anyone sees it.
Two shifts follow from the interactive/headless distinction.
The first: permissions move from dynamic consent to static policy. The interactive agent asks “may I run this bash command?” and the human answers in the moment. The headless agent either has permission declared in advance or does not have it. The expressive middle ground — “yes, but with this modification” — disappears.
The second: observability moves from synchronous to asynchronous. The interactive agent’s reasoning is visible as it happens; a mistaken turn can be interrupted mid-sentence. The headless agent’s reasoning is only visible in the log it emits, read later, after the action has already landed. Retrospective logs have to be structured for this; free-text traces that are fine to skim live are nearly unreadable in a CI artifact viewer.
Operation
Three deployment shapes cover nearly all practical automation: one-shot batch runs, CI-triggered agents, and scheduled or event-driven agents. Each has a distinct permission and observability profile.
Shape 1: one-shot batch runs
The simplest case — a script or Makefile target invokes the agent non-interactively to do one bounded task, then exits. Typical uses: scripted refactors across many files, generating boilerplate from a spec, updating a dataset of docs. No CI, no schedule — a human runs it on demand and reads the output.
The three tools converge on a -p / print / prompt flag for non-interactive mode and diverge on how permissions are granted. The underlying question each tool answers in its own vocabulary: what is the agent allowed to do without asking, given no one is here to ask?
Shape 2: CI-triggered agents
The agent runs inside CI (GitHub Actions, GitLab pipelines, Jenkins) in response to a repository event: a PR opened, an issue mentioned, a label applied. It has no UI; its output is posted back as a PR comment, a review, or a new commit.
CI-triggered agents surface two failure modes that rarely appear interactively:
Context starvation. The CI runner does not know what you know. It has the repo, it has the PR diff, it has the comment thread — it does not have your memory of last week’s discussion about why that file looks weird. The briefing doc (see Ch 7) is the primary answer to this: the same file that bootstraps an interactive agent bootstraps the CI agent.
Credential leakage. The agent in CI runs with real credentials — a GitHub token with write access to the repo, possibly deploy tokens, possibly cloud keys. If the agent can be persuaded to dump those credentials into a log, a PR comment, or a generated file, the leak is permanent. The mitigation is a scoped-token discipline: CI runs use tokens with the narrowest possible permissions, logs are scrubbed, and prompts from external contributors are treated as untrusted input.
Shape 3: scheduled or event-driven agents
The agent runs on a cron schedule or in response to an external event (webhook, queue message, file drop). Nothing in the repo triggered it; the agent wakes up, reads some state, decides what to do, and acts.
This is the most powerful shape and the one with the highest variance in outcomes. Examples that work well: nightly stale-branch cleanup, weekly dependency-update PRs, monitoring-alert triage. Examples that go wrong: agents that rewrite arbitrary files on every run (drift), agents that retry a failing task forever (runaway cost), agents that notice their own prior runs and modify them recursively (meta-chaos).
Structured logging for headless runs
Interactive sessions let you skim the trace in real time and abort on anything weird. Headless runs require structured logs that survive into CI artifact viewers and can be queried after the fact. The minimum viable log has four things per run: the prompt, the final output (and all intermediate actions), the exit status, and the resource usage (tokens, wall-clock, tool calls). A human can reconstruct what happened from those four.
The observability stack
A production automation setup needs three layers beyond the agent itself: a trigger layer (CI event, schedule, webhook), an execution layer (the headless agent run), and an output layer (PR comment, commit, Slack message, dashboard). Each layer can fail independently. The trigger may fire twice; the execution may partially succeed; the output may be malformed. Treat the stack the way you’d treat any distributed system: idempotent handlers, structured logs, bounded retries.
Evolution
Automation is where the field is changing fastest. Three axes worth tracking.
Emerging: agent-as-service deployment. Several practitioners are running agents as long-lived services rather than one-shot invocations: a daemon that receives queued tasks, executes them, and posts results. This blurs the line between “automation pipeline” and “internal platform.” Expect purpose-built deployment patterns (containerized agents, IAM-scoped service accounts, per-agent telemetry) to crystallize over the next 12 months. The current DIY patterns work; a convergent shape has not yet formed.
Emerging: policy engines for agents. A recurring theme in enterprise deployments (see Ch 14) is the move from per-tool allow/deny lists to declarative policy engines — OPA-style rules that express “agents from repo X can touch paths matching Y but not Z.” None of the CLI agents ships this natively in 2026; early adopters roll their own wrappers. Vendor-first-class support is likely within a release cycle or two.
Emerging: differential trust for prompts. In CI-triggered pipelines, some prompts come from trusted sources (your team’s commits, your own comments) and some come from untrusted sources (external PR authors, forked repos). The tools treat both the same in 2026 — a prompt is a prompt. Expect differentiation here: tagged prompts, source-aware refusal heuristics, separate tool permission sets per trust tier. The exfiltration failure mode described in Recovery above is the forcing function.
Quick reference
- Headless runs are a different surface from interactive sessions. The same binary, yes — but with permissions, observability, and failure modes all shifted.
- Three deployment shapes: one-shot batch, CI-triggered, scheduled/event-driven. Each has its own authorization profile.
- For one-shot batch: dry-run first, scope permissions, commit to a scratch branch, review before merging.
- For CI-triggered: defend against context starvation (briefing doc) and credential leakage (narrow tokens, scrubbed logs, untrusted-input discipline).
- For scheduled: enforce idempotence, budgets, and scope — scheduled agents that violate any of the three eventually cause incidents.
- Headless permission configuration is convergent across the three tools in shape but divergent in vocabulary: allow/deny lists vs. approval modes.
- Structured logs are mandatory for headless runs — prompt, output, exit status, resource usage. Free-text traces that worked live are unreadable in artifact viewers.
- Untrusted PR comments are the AI-era RCE endpoint. Narrow tokens, narrow tool allowlists, refusal heuristics in the system prompt.
- Emerging: agent-as-service deployment, policy engines, differential trust for prompts. Expect substantial tooling movement in the next 12–18 months.
- Durable principle: the interactive safety net is not portable to headless mode. Rebuild the net explicitly — static policy, structured observability, bounded blast radius.
Team patterns and governance
An agent used by one person is a productivity tool. An agent used by a team is shared infrastructure — which means shared context, shared norms, shared failure modes. This chapter covers the team patterns that survive scale: shared briefing docs, skill registries, agent-assisted review, and the governance that keeps shared agents from becoming shared liabilities.
Solo practice with an agent is a skill; team practice is a system. The same briefing doc that felt ergonomic as a personal scratchpad becomes contested ground when five authors edit it. The skill one person wrote to speed up their week becomes a liability when it’s running on every teammate’s machine and no one is sure who maintains it. The move from individual to team is where agentic coding acquires the governance problems that every shared infrastructure eventually accumulates.
Representation
A team using agents shares three things, whether or not they intend to.
The first is briefing context — the project-level documents (CLAUDE.md, GEMINI.md, AGENTS.md) that prime every agent on every machine with the same baseline. If the team does not explicitly maintain these, each engineer’s personal copy drifts, and agents begin giving different answers to the same question depending on whose workstation is running them.
The second is skill and command infrastructure — the slash commands, custom skills, and prompt templates that encode team-specific workflows. One engineer writes a /deploy-preview command; three weeks later four teammates are relying on it, none of them sure who owns it.
The third is policy — what the agents are allowed to do in the team’s shared environments. Individual practitioners set permissions for themselves; teams must set permissions as a contract, written down, reviewable, auditable.
The failure modes of team agent practice are specific and repeat across teams.
Briefing-doc bloat. Everyone adds their context without removing anyone else’s. After six months the briefing doc is 8,000 words, the agent reads most of it before doing anything, and the team is paying a context tax on every session without noticing. Nobody owns it because it belongs to everyone.
Skill-registry rot. A skill is useful for a month, then becomes obsolete when the underlying workflow changes. Nobody removes it; a new teammate joins, uses the stale skill, gets a result that matched reality two quarters ago but does not now. The skill never broke; the context it encoded became wrong without breaking.
Permission erosion. The CI policy starts strict. A teammate hits a legitimate case the policy blocks, a one-off exception is carved out, then another, then another. Six months later the policy has thirty carve-outs and no one can tell which are still justified. The net effect is that permissions reset to “allow everything” through accretion.
Review-cycle arbitrage. A teammate notices that agent-generated PRs get lighter review than human-written PRs — because the reviewer assumes the agent was careful, or because the PRs are often trivial, or because there’s no convention. Real bugs start landing through this channel. The agent didn’t get worse; the review discipline slipped for an entire category of change.
Operation
Four practices — shared briefing-doc discipline, skill registries, agent-assisted review, and policy-as-code — carry most of the weight.
Practice 1: briefing-doc discipline
The top-level briefing doc (Ch 7) is the highest-leverage shared artifact in the repo. Treat it the way the team treats any other load-bearing file — a CODEOWNERS entry, PR review required, a changelog comment in the file itself when major sections are added or removed.
The mistake to avoid: treating the briefing doc as documentation for humans who happen to also be using agents. That framing leads to explanatory prose, historical context, onboarding material — all of which is useful for humans and wasteful of agent context. The briefing doc is a briefing for the agent; human-oriented material belongs in linked docs the agent can fetch on demand.
Practice 2: team skill and command registries
Personal skills live in ~/.claude/skills/, ~/.gemini/commands/, and similar per-user locations. Team skills need to live somewhere the team can see, review, and version. Two shapes work well in practice.
All three tools have converged on the same basic architecture: repo-committed project-level directory for shared skills, user-level directory for personal skills. The specific path names differ; the two-tier pattern is convergent. If you design your team registry around the two-tier model, you are betting on a structural pattern that will outlast the specific paths.
A team skill registry also needs deprecation machinery. A skill that worked six months ago may be actively misleading now; a registry with no way to mark “retired” silently rots. Minimum: a retired/ subdirectory and a retired-at: field in the skill’s frontmatter, plus a quarterly review that moves unused skills there.
Practice 3: agent-assisted code review
The convergent pattern across the three tools: a human opens a PR, an agent runs automatically via the tool’s GitHub Action, the agent posts a review with comments, a human reviewer uses the agent’s review as input rather than replacement.
The governance questions around agent-assisted review are where teams get themselves into trouble.
Who reviews the agent’s output? If the agent posts fifty inline suggestions on a PR, the human reviewer either reads all of them (slow) or skims (misses things). The working pattern: the agent’s top-level summary is mandatory-read; inline suggestions are advisory; a human reviewer is still required for merge approval regardless of the agent’s verdict.
Does the agent get merge authority? Some teams let agent-approved PRs auto-merge if they pass CI. This works for narrowly-scoped changes (dependency bumps, generated-code updates) and fails for anything else. The boundary is clear in principle, fuzzy in practice — err toward requiring human approval.
What’s recorded? Agent reviews leave an audit trail in PR comments, but a reviewer’s agreement with the agent often doesn’t. If a human reviewer LGTMs a PR after reading the agent’s detailed review, the reviewer’s sign-off is as binding as any other — but the agent’s review contains the substance. Teams should be explicit about this: the agent’s review is reference material; the human approval is the authorization.
Practice 4: policy-as-code
For teams operating in CI (see Ch 12) or production environments, agent permissions are shared policy. The working pattern: check the policy configuration into the repo, review it via the same PR flow as code, and apply the same test-before-merge discipline to policy changes as to code changes.
The minimum: the tool’s settings/permissions file lives in the repo (settings.json, .claude/settings.json, or equivalent), is owned by someone in CODEOWNERS, and policy changes require the same review as code changes. The maximum: a dedicated policy engine (OPA, Cedar) that wraps the agent with declarative rules (“agents from branch X can touch paths matching Y, can call tools Z”). Few teams are at the maximum in 2026; most are at some point along the spectrum.
Evolution
Team patterns move more slowly than feature-surface details. Three axes worth tracking.
Emerging: skill-and-policy version pinning. Teams are beginning to pin the versions of skills, briefing-doc conventions, and policies — the same way they pin dependency versions. A PR that changes the briefing doc increments a version; CI jobs can select a specific version. This mirrors the direction software packaging went two decades ago; agent infrastructure is likely to follow. 18–24 months before mature tooling.
Emerging: role-scoped agents. A few teams have begun running multiple agent profiles — a “junior-dev agent” with narrow write permissions and thorough explanations, a “senior-dev agent” with broader permissions and terser output — and routing tasks to the appropriate profile. This is role-based access control applied to the agent itself. No tool ships first-class support in 2026; it’s a manual configuration pattern that will likely become a product feature.
Emerging: agent-generated changelog discipline. When agents author significant PRs, the question of attribution surfaces: who is accountable for the change? The emerging convention — the human who dispatched the agent is accountable, the agent’s involvement is logged in commit trailers — is not yet universal. Expect explicit commit-trailer conventions and org-level policy to settle over the next year.
Quick reference
- Team agent practice shares three things whether the team intends to or not: briefing context, skill/command infrastructure, and permission policy. Un-owned shared artifacts drift.
- Four failure modes repeat across teams: briefing-doc bloat, skill-registry rot, permission erosion, review-cycle arbitrage.
- Briefing doc governance: CODEOWNERS ownership, append-with-justification, quarterly trim, diff audit. Treat as a load-bearing file.
- Team skills live in a two-tier registry: personal (per-user directory) and team (repo-committed, reviewed). Promotion from personal to team requires intent.
- Agent-assisted review is convergent across tools. The pattern is agent reviews alongside human, not agent replaces human. Explicit review discipline prevents two-tier quality erosion.
- Policy-as-code: check permission configs into the repo; review them as code; mark broad permissions with expiration comments.
- Durable architecture: top-level briefing doc + two-tier skill registry + agent-assisted review + policy-as-code. Paths and flag names will change; the architecture is more stable.
- Emerging: skill-and-policy version pinning, role-scoped agents, explicit agent-attribution in commits. 12–24 months of tooling movement.
- Governance debt compounds faster than individual productivity gains. A team that saves an hour per engineer per week but ships a credentials leak via an un-owned policy file did not come out ahead.
Enterprise deployment
Enterprise deployment of agentic tools adds constraints that personal and team use never surface: regulatory compliance, data residency, audit logging, air-gapped networks, procurement risk. This chapter covers the architectural patterns that make CLI agents acceptable to enterprise constraints — and the design choices that become load-bearing when they do.
An agent that works for one engineer is a tool. An agent that works for a team is infrastructure. An agent that works inside a regulated enterprise is a compliance surface — a place where data flows have to be documented, access has to be audited, procurement has to be survived. The patterns that get an agent across the enterprise threshold are not about the agent itself; they are about the envelope the agent runs inside.
Representation
Enterprise deployment adds three classes of constraint that personal and team use do not surface.
Regulatory. Financial services, healthcare, defense, critical infrastructure — each brings its own compliance regime (SOC 2, HIPAA, FedRAMP, PCI-DSS, various regional equivalents). The regimes differ in specifics; they converge on a handful of requirements: data residency (where the data may leave the network), access control (who may invoke what), audit logging (what was done, by whom, when, with what authorization), change control (how modifications to the system are approved).
Operational. Enterprise environments are usually not cleanly internet-connected. Some production networks are air-gapped. Some allow egress only to specific approved endpoints. Some route all outbound through inspection proxies that are intolerant of streaming responses. An agent that assumes direct access to a vendor API over the public internet does not run in these environments without modification.
Procurement. The agent tool must pass vendor review: security questionnaires, SOC 2 reports, DPIAs (data protection impact assessments), model card reviews. The tool vendor’s security posture becomes part of the deployment’s security posture. This is a months-long process at most enterprises; the engineering work must anticipate it.
The mental model to resist: we’ll use the agent the way our engineers already do, just at corporate scale. The mental model that works: we are designing the envelope first, and the agent is a component inside it. Framing the problem as agent-first invariably produces a deployment the compliance team rejects on first review; framing it as envelope-first lets the compliance team sign off on the envelope’s guarantees and then lets engineering choose and upgrade the agent inside those guarantees over time.
Operation
Five components — model endpoint, identity, audit logging, network policy, and change control — carry nearly all the compliance weight. Each maps onto distinct choices all three tools have converged on supporting.
Component 1: the model endpoint
The single largest compliance question is where the model runs and whether the prompts and completions leave the network.
Three deployment topologies cover most enterprise cases:
- Vendor-managed, region-scoped. The tool talks to the vendor’s public API, but with data residency committed (prompts and completions stay in a specific region, not stored for training). Easiest to set up; requires the vendor’s regional residency guarantees to satisfy compliance.
- Cloud-partner managed. The tool talks to AWS Bedrock, Google Vertex AI, or Azure OpenAI — the enterprise’s existing cloud provider hosts the inference endpoint under the contract the enterprise already has. Data never leaves the enterprise’s cloud tenancy; billing flows through the cloud account.
- Self-hosted inference. The tool talks to a model running on enterprise-controlled infrastructure, often a locally-hosted open-weights model or a vendor-approved on-prem deployment. Maximum control; maximum operational burden.
Most enterprises start with the cloud-partner topology — it matches their existing cloud security posture and does not require standing up inference infrastructure. Self-hosted is a fallback for the most sensitive workloads (classified, deeply-regulated healthcare, certain national-security contexts).
Component 2: identity and access
Enterprise identity is almost never “whoever is logged into this workstation.” It is corporate SSO (SAML, OIDC) that resolves to an identity with group memberships that map to permissions.
The CLI-agent surface is not usually where this integration lives. Instead, the model endpoint (Bedrock, Vertex, Azure OpenAI) is what integrates with corporate identity — IAM roles, service accounts, federated identity — and the CLI authenticates to the endpoint using whatever credential the enterprise identity system provides. The agent tool itself inherits the identity of the process invoking it.
The practical consequence: the CLI tool does not need to know about SAML. It needs to know about environment variables or local credentials that the identity infrastructure has provisioned. The integration point between corporate identity and the agent is the cloud provider’s IAM, not the agent binary.
Component 3: audit logging
Every regulated environment requires an audit log of who did what when. For agent tools, this splits into two distinct log streams.
The first: agent-invocation logs. Every time the agent runs, a record exists capturing who invoked it, what the prompt was, what tools it called, what it produced, how long it took. These logs belong in the enterprise’s SIEM (security information and event management system), not just the agent tool’s local trace file.
The second: model-request logs. Every call from the agent to the model endpoint produces a record at the endpoint side. Cloud providers (Bedrock, Vertex, Azure OpenAI) emit these automatically to the enterprise’s logging infrastructure. Together the two log streams let auditors reconstruct: a specific engineer invoked an agent, which made specific model calls, which produced specific outputs, which resulted in specific code changes.
Component 4: network policy
Air-gapped and restricted-egress environments put the final constraint on agent deployment. Two common patterns:
Egress allowlist. The only endpoints the agent may reach are explicitly allowlisted — typically the enterprise’s chosen model endpoint plus internal repositories. This is straightforward to configure as long as the tool’s default behavior does not require unexpected egress (telemetry endpoints, autoupdate checks, documentation fetches). Enterprise-friendly tools explicitly document all outbound endpoints so the allowlist can be authored precisely.
Air-gapped. No internet egress at all. Self-hosted model endpoint is mandatory. All documentation, skill registries, updates must be delivered through the enterprise’s existing internal distribution channels. Air-gapped deployment is dramatically more work than restricted-egress; most enterprises run air-gapped only for specific workload classes, not as the default.
Component 5: change control
Updates to the agent tool — new versions, new skills, new policies — are change-control events. In regulated environments, change control is a documented process: proposed change, risk assessment, approval, staged rollout, rollback plan.
The concrete manifestation: the agent tool’s version is pinned in configuration, updates go through the same pipeline as any other tooling update, and the change-control document records the specific behavior changes between versions. Vendors’ changelog discipline becomes part of the enterprise’s change-control machinery — a vendor that ships cryptic release notes (“bug fixes and improvements”) makes change-control impossible.
Evolution
Enterprise patterns are among the slowest-moving parts of agentic coding — compliance regimes change on the scale of years, not quarters. That said, two axes are in active motion.
Emerging: agent identity federation. In 2026, agent invocations inherit the identity of the human or service account that invoked them. A finer-grained model — the agent itself has an identity, with its own permissions that compose with the invoker’s — is beginning to appear in research deployments. This matters for enterprise governance because it lets agents be audited as actors independent of their invokers. Expect vendor support for explicit agent identity within 18–24 months.
Emerging: regulated-industry reference architectures. Anthropic, Google, and other vendors are beginning to publish reference architectures for specific regulated industries — HIPAA-compliant deployment guides, FedRAMP-ready configurations, PCI-DSS architectures. In 2026 these are still sparse. Within 18 months, expect well-documented reference architectures for the major compliance regimes that reduce the custom-engineering burden of enterprise deployment.
Emerging: policy engines for agents. The same direction flagged in Ch 12 and Ch 13 applies here: enterprises will move from per-tool settings files to declarative policy engines (OPA, Cedar, or custom). In enterprise contexts the forcing function is stronger — compliance auditors want to see the policy expressed declaratively, reviewable, versioned, tested. Tool-specific permission files do not satisfy this; a policy layer does.
Quick reference
- Enterprise deployment adds three classes of constraint: regulatory, operational, procurement. Each shapes the deployment more than the agent tool itself does.
- The envelope — model endpoint, identity, audit logging, network policy, change control — is where most of the engineering lives. The agent tool is a replaceable component inside it.
- Three model-endpoint topologies: vendor-managed region-scoped, cloud-partner managed (most common starting point), self-hosted (most sensitive workloads). Different workload classes often warrant different topologies.
- Identity flows through the cloud provider’s IAM, not the agent binary. The agent inherits the identity of the process that invoked it.
- Audit logs split into two streams — agent-invocation and model-request. A correlation ID ties them together. Without the ID the audit chain breaks.
- Network policy: restricted-egress is manageable if the vendor documents all outbound endpoints; air-gapped is dramatically more work and is usually reserved for specific workload classes.
- Change control: pin the tool version, document behavior at pinned version, re-evaluate on major version bumps. Enterprise runs on a slower clock than the vendor’s release cadence.
- Cloud-partner backend support is now table stakes across the three tools. Air-gapped maturity, skill-distribution in air-gapped environments, and policy engines are the active divergences.
- Emerging: agent identity federation, regulated-industry reference architectures, policy engines. 18–24 months of substantial movement.
- Durable principle: design envelope-first, not agent-first. The envelope’s guarantees are what the compliance team signs off on; the agent inside is the part that can change over time.
How agentic practices evolve
Agentic-coding practice is a moving target — tools ship quarterly, conventions shift, claims that were true six months ago are now wrong. This chapter is the meta-discipline that keeps your practice current: source tiering, volatility classification, convergence tracking.
Every chapter in this book is a bet that some part of what it says will still be true in three years. Most bets will land; some will not. The difference between a book that ages well and one that becomes embarrassing fast is not the quality of the original writing — it is the methodology applied continuously after publication. This chapter is that methodology, named explicitly.
Representation
Agentic coding is a fast-moving field embedded in an even faster-moving ML infrastructure. Tool releases ship monthly, conventions crystallize and dissolve quarterly, specific commands rename without warning. A book that tries to pin down “best practices” in such a field has two failure modes: either it stays vague (no concrete advice; safe but useless) or it gets specific and dates fast (useful for six months, embarrassing after).
The resolution is not to choose — it is to stratify. Not every claim ages at the same rate. A claim about transformer attention dynamics is decades-durable. A claim about a specific --flag is quarterly-volatile. Treating them identically is the mistake. The methodology in this chapter is how to treat them differently.
Three levers, working together, make stratified practice possible.
Lever 1: source tiering
Every claim you make rests on some source. The tier of that source determines how much weight the claim should carry and how often it needs re-verification.
The tiering this book uses:
- T1-official — vendor documentation, release notes, engineering team posts. Highest trust for factual claims about a tool’s behavior. Verify on major releases.
- T2-release-notes — product-announcement blog posts, conference talks, Changelog feeds. Trustworthy for intent and feature-availability claims; less reliable for edge-case behavior. Verify quarterly.
- T3-practitioner — respected community writing (e.g. Gwern on sidenote UX, Kleppmann on data systems). Trustworthy for pattern and principle claims that the authors have tested. Verify annually.
- T4-conjecture — tweets, speculation, claims without citations. Use as pointers to investigate, not as support. Verify before relying on.
The tier is not a comment on the source’s intelligence or honesty. A brilliant Twitter thread is still T4 until someone does the work to elevate it. A bland vendor doc is still T1 because the vendor is the definitive source for their own tool’s behavior.
Lever 2: volatility classification
Every claim has a half-life. Classify it explicitly:
- stable-principle — rooted in properties of the substrate (transformer attention, information theory, software engineering fundamentals). Rarely changes. Example: “context degrades non-linearly.”
- architectural-pattern — the shape of a convention that one or more tools have crystallized. Changes on major versions. Example: “top-level briefing doc re-injected on every turn.”
- feature-surface — specific commands, flags, file paths, integration points. Changes on minor versions or even patch releases. Example: “Claude Code uses
/compactwith optional focus argument.”
Each class needs a different review cadence. stable-principle chapters can be left alone for years; feature-surface claims need quarterly audit. Mixing classes within a chapter without distinguishing them is how otherwise-good books rot.
Lever 3: convergence / divergence tracking
Specific tool claims date unpredictably. Patterns across tools date much more slowly. When three CLI-agents converge on a pattern (e.g. briefing doc at project root), the pattern is signaling that it’s close to an architectural minimum — practitioners can safely bet on it. When they diverge (e.g. hook system depth), the divergence is signaling an open design space — practitioners should assume continued movement.
Convergence/divergence tracking turns a disorganized shelf of tool releases into a signal-extraction problem. The signal: which primitives have settled? Which are still contested? The book’s own changelog data is a concrete instance of this methodology — per-tool release timelines joined against a shared pattern registry to surface when multiple tools land the same primitive.
Operation
The methodology is cheap to apply once internalized. Four concrete routines.
Routine 1: tier a claim when you cite it
Every external claim that goes into your writing earns a tier tag — at authoring time, not during a later audit. This forces the evaluation while the source is fresh in your mind and enables downstream filters (the book’s <Citation> component renders the tier as a badge so readers can calibrate trust at the point of use).
# sources/manifest.yaml — one entry per cited source
- id: anthropic-context-engineering-2026
url: https://docs.anthropic.com/en/docs/context-engineering
tier: T1-official
captured_at: 2026-02-10T14:32:00Z
title: "Context Engineering Best Practices"
author: "Anthropic"
perma_cc: https://perma.cc/XXXX-YYYY
- id: gwern-sidenote
url: https://gwern.net/sidenote
tier: T3-practitioner
captured_at: 2026-04-17T13:00:00Z
title: "Sidenotes In Web Design"
author: "Gwern Branwen"
Routine 2: volatility-classify every chapter / section
Every major section of your writing earns a volatility flag. In this book, the flag lives in frontmatter (volatility: stable-principle | architectural-pattern | feature-surface). Downstream effects:
- Freshness stamps are required on volatile content; optional on stable content.
- Quarterly audit queues draw from the volatile buckets first.
- Readers can see the volatility class at the top of every chapter and calibrate confidence accordingly.
The mixed-class chapter is the dangerous case — a chapter that’s mostly stable-principle but with a few feature-surface claims sprinkled in. The fix: isolate the volatile claims into their own sections (or callouts), flag them explicitly, and review them on the volatile cadence while leaving the stable material alone.
Routine 3: track tool-pattern adoption
Every tool’s releases feed a per-tool changelog. Every pattern feeds a shared registry. Joined, they produce the convergence timeline.
# changelog/tools/claude-code.yaml — per tool
tool: claude-code
versions:
- version: "2.9"
date: 2026-03-27
changes:
- pattern: briefing-docs
kind: changed
note: "CLAUDE.md hub-and-spoke documented as default architecture"
# changelog/patterns.yaml — shared across tools
- id: plan-mode
name: "Plan mode"
category: safety
convergence_date: 2026-03-05 # date all three tools landed it
- id: subagents
name: "Subagent delegation"
category: scale
convergence_date: null # not yet converged
A dashboard — rendered programmatically from these YAML files — surfaces the current state: which patterns have converged, which are in flight, which are divergent. The dashboard is the meta-artifact of the entire methodology: it makes the evolution of the field visible rather than implicit.
Routine 4: scheduled drift audits
Three cadences, each catching a different class of drift:
Quarterly audit — highest volatility content. Walk every chapter flagged feature-surface; verify command names, flag syntax, file paths against current tool docs. This is the most expensive cadence, ~half a day per 16-chapter book.
On-release audit — when a tool ships a significant update, search for chapters that reference that tool’s specific behavior and re-verify those. Triggered by watching the tool’s release feed, not by calendar.
Annual audit — stable-principle content. Check that the principles themselves haven’t been superseded by new understanding (rare but happens). Walk the book’s claim taxonomy at a high level.
Evolution
The methodology in this chapter is itself young — less than two years old as a named discipline, with its constituent parts (source tiers, volatility classification, convergence tracking) drawn from older practices in academic citation, long-form journalism, and domain-driven software engineering. What’s new is the application to AI-assisted development specifically.
Convergence: the manifest + audit pattern. Academic research, legal scholarship, serious technical writing — all three traditions have independently landed on “structured source manifest + scheduled reverification” as the pattern for durability in citation-dependent work. Web archives (Perma.cc, Wayback) are the infrastructural layer that makes the pattern tractable for web sources. The agentic-coding application is new; the pattern is not.
Emerging: automated drift detection. A handful of practitioners are experimenting with automated drift checks — a script that periodically re-fetches every T1 source in the manifest, compares against the captured version, and flags changes. The tooling is hand-rolled in 2026; expect first-class product support within 18 months. The hardest part is not detection — it’s significance: most detected changes are cosmetic (typo fix, reorganization), and distinguishing those from substantive changes that invalidate a claim requires judgment the automation doesn’t have.
Emerging: cross-team pattern registries. If pattern tracking is valuable within one team’s book, it’s more valuable across teams. A shared registry — “plan-mode adopted by Claude 2025-06, Gemini 2026-03, Codex 2026-01” — would save every team the tool-watching work. No mature shared registry exists in 2026; various community efforts are early. Watch for consolidation here in the next 18–24 months.
How this chapter is itself aging
The meta-methodology ages differently than content claims. The three-lever structure (source tiers, volatility classes, convergence tracking) is likely to outlast many specific tool names — it’s a structural pattern that applies to any fast-moving field with citable sources. The specific tier names (T1–T4) and volatility class names are terminological choices that might evolve. The infrastructure (YAML manifests, scheduled audits) will be augmented by tooling — automated drift detection, shared pattern registries — over the next couple of years, at which point the hand-rolled version documented here will feel quaint. But the underlying discipline will hold.
Quick reference
- Agentic coding is a fast-moving field. Every claim has a half-life. Treating all claims as equivalent is the mistake.
- Three levers: source tiering (trust axis), volatility classification (decay axis), convergence/divergence tracking (field-level signal).
- Source tiers: T1-official, T2-release-notes, T3-practitioner, T4-conjecture. Tag at citation time, not during audit.
- Volatility classes: stable-principle, architectural-pattern, feature-surface. Tag in frontmatter. Audit cadence scales with volatility.
- Convergence across tools signals durability; divergence signals open design space. Track both with structured changelog data.
- Four routines: tier claims when citing, volatility-classify every chapter, track tool-pattern adoption, run scheduled drift audits.
- When rot is discovered, fix in four layers: correct the claim, refresh the freshness stamp, diagnose the audit failure, reclassify.
- The methodology is young — named as a discipline only in the last year or two. Tooling will improve substantially. The structural discipline is stable.
- Durability in a volatile field comes from knowing what to trust at what depth, not from trying to make everything equally authoritative.
Auditing your own practice
Ch 15 covered how to keep a book or a team's playbook current. This chapter is the same discipline applied to the single highest-leverage artifact most practitioners own: their own daily practice. Your workflows quietly rot. Commands you rely on get renamed. Habits ossify into superstitions. The audit discipline that catches field drift applies, scaled down, to the practitioner's own routine.
The previous chapter was about the field’s evolution. This chapter is about yours. The same forces that make a book rot also rot your personal practice — but quieter, without the feedback loop a reader would provide, and so the rot goes on longer before it is noticed. This is the last methodological chapter because it is the one that matters the most to the reader: the methodology only pays off if you apply it to yourself.
Representation
Your practice has the same three dimensions as the book this methodology was built for.
You have sources you trust — the docs, posts, colleagues, and past experiences that shape your mental model of what the agent will do. Some of those sources are current; some described the tool eighteen months ago and you never updated. Without explicit re-tiering, the weighting in your head drifts toward the sources you encountered first, not the sources that are most accurate now.
You have claims you rely on — assertions about what works, what doesn’t, what the agent is good at, what to avoid. Each claim has a volatility class, same as each claim in a book has one. Some of your operating claims are stable principles (context is expensive); some are architectural patterns (briefing doc at the repo root); some are feature-surface (the /compact command does X). You track zero of them as such unless you build the habit.
You have a repertoire of specific workflows — the exact sequences of commands, prompts, and fallbacks you reach for without thinking. This is where the most personal rot lives. A workflow that was optimal last year remains in muscle memory even when the tool has grown a better primitive; a superstition that never mattered remains encoded because no one has re-examined it.
The uncomfortable observation: practice rot is not symmetric with tool change. The tools get better; some of your habits are still calibrated to the version where the tool was worse. Those habits continue to produce correct outputs — you do not hit an error — but they waste context, waste tokens, waste time, and cover up places where the new primitive would serve better.
Operation
Four routines — a daily micro-audit, a weekly repertoire review, a quarterly belief audit, and an annual integration — handle the practice-rot problem at different time scales. None is costly on its own. The combination is what works; any single cadence in isolation either misses drift (too coarse) or creates friction (too fine).
Routine 1: daily micro-audit at session close
At the end of a substantial agent session, spend ninety seconds on three questions: What did I reach for that didn’t work well? What did I want that the agent couldn’t do? What surprised me?
These three questions surface different things.
Didn’t work well flags feature-surface rot — you reached for a command that has been superseded, a pattern the agent no longer handles cleanly, a workflow that was fine until a recent release changed something.
Wanted that the agent couldn’t do flags the boundary between what’s possible now and what you thought was possible. Sometimes the gap is real (the tool cannot yet do this); sometimes it is your gap (the tool can do this and you haven’t learned how). The question is whether you investigate before assuming the gap is the tool’s.
Surprised me flags anywhere the agent’s behavior diverged from your model. This is the highest-information signal of the three. Surprise is evidence your mental model is incomplete or wrong; tracking surprises over weeks reveals the places your practice has rotted most.
Routine 2: weekly repertoire review
Weekly, walk through the commands, skills, and prompts you reached for in the past week. For each, one question: is this still the best way to do this?
Three outcomes from the question:
- Still good — no action. Skip to the next item.
- Still works but something new is better — queue a migration. Update the skill, rewrite the command, change the habit.
- Broken or obsolete — delete. A dead skill left in place is active misinformation; it will mislead you, mislead your agent if it reads your config, mislead future-you when you search for the right way to do something.
The weekly cadence catches drift fast enough that the accumulated debt never becomes overwhelming. Teams that audit quarterly instead of weekly face a very different problem: twelve weeks of drift, the causes tangled with each other, and the cost of unwinding far higher than the sum of twelve individual weekly reviews would have been.
Routine 3: quarterly belief audit
Quarterly, step up a level and audit your operating beliefs — the claims about what the agent is good at, what to avoid, when to use which tool.
List your top-of-mind beliefs: the agent is bad at X. Don’t use tool Y for Z. Always do A before B. For each, one question: when did I last verify this, and against what source?
The audit reliably surfaces three classes of stale belief:
-
Beliefs that were never grounded. You picked them up from a blog post, a tweet, a colleague’s offhand remark — and they hardened into operating principles without ever being tested against your actual workflow. Some will hold up; some will not. Either way, the audit promotes them from assumed-truth to tested-truth or rejects them.
-
Beliefs that were grounded but have expired. A limitation that was real six months ago has been fixed in a minor release and you never updated. The belief continues to bound your behavior — you avoid doing things the agent now handles perfectly — at real cost.
-
Beliefs that were never even articulated. Some of your practice is encoded as habit rather than belief. The audit is the time to surface those: I always start agent sessions with a full file read — when did I decide that? Is it still helpful? Half of these pass inspection; half do not.
Routine 4: annual integration
Annually, do the cross-tool version of the quarterly audit. The question: are the tools I use still the right tools for my work, or have I stayed on them out of inertia?
This is the rarest routine because the switching cost is high and the judgment is hard. But it is the one that catches the deepest form of practice rot: continuing to use a tool optimized for a problem shape that your work no longer has. A practitioner whose work shifted from greenfield creation to brownfield maintenance may be using a tool selected for the old shape; a practitioner whose team adopted a new language may be relying on tool primitives that predate the language.
The annual integration does not require a tool switch. It requires considering a tool switch with a clear head — evaluating what you use, what the alternatives have become in the past year, whether any single change would materially improve your work. The answer is usually no. When it is yes, catching it within one year rather than three is the difference between a month of rebuilding practice and a quarter of it.
The link between personal audit and team audit
Personal rot and team rot are coupled. Your surprises, once investigated, should feed the team’s shared artifacts — briefing docs, skill registry, policy — so the investigation benefits everyone. A common anti-pattern: the individual notices rot, updates their personal config, but never updates the team-level artifact. Six months later, every new team member is still inheriting the stale state from the shared artifact. The audit closes only when the fix has propagated to the surface other people see.
The inverse flow is also important. When the team updates a shared artifact, every individual’s personal practice has a choice: align with the new shared state or drift away from it. Drift-away happens silently; alignment requires a small deliberate action. Building alignment into your weekly repertoire review (what did the team change this week, and did I update?) closes the loop in the other direction.
Evolution
The self-audit discipline is the stable-principle core of this book, but the specific routines benefit from a light touch on current practice.
Emerging: AI-assisted self-audit. A natural loop: the agent itself can help audit your practice. Read my last thirty session transcripts and list claims I seem to rely on that no longer match the tool’s current behavior. This is a first-class task for the agent, and the feedback loop — agent helps practitioner audit agent-use — is a genuinely new possibility that predates this book’s methodology. Tools to support this are not mature in 2026; expect a generation of them within 18 months. The methodology in this chapter is the scaffolding; agent-assisted audit is the amplifier.
Emerging: cohort-level audit. Tools or third-party services that let a team see aggregated drift signals — everyone on the team seems surprised when X happens; that’s a candidate for a team-level update — do not exist in 2026 but are an obvious extension of the personal pattern. Privacy-preserving designs will matter; expect opt-in, aggregated-signal-only products within 18–24 months.
Emerging: practice longevity research. A small community is beginning to study practitioner skill decay in AI-assisted development — how fast skilled practitioners get slower when they stop practicing, how much of their skill transfers to new tools, whether the three-cadence discipline actually produces measurable improvement. This is nascent. Expect publications over the next two years; expect the methodology in this chapter to be refined or partially superseded by more evidence-based versions.
Quick reference
- Your personal practice rots the same way a book rots — by the same mechanisms, along the same axes. Self-audit is the counter-discipline.
- Three dimensions of personal rot: sources you trust, claims you rely on, repertoire you reach for. Each ages at a different rate.
- Four routines at four cadences: daily surprise log, weekly repertoire review, quarterly belief audit, annual tool integration. The combination is load-bearing; any single cadence in isolation fails.
- Daily: end-of-session 90 seconds on what didn’t work, what the agent couldn’t do, what surprised you. Surprises are the highest-information signal.
- Weekly: walk your commands/skills/prompts. For each — still good, needs migration, or delete? Dead skills are active misinformation.
- Quarterly: audit your operating beliefs. When did you last verify this? Against what source? Promote assumed-truth to tested-truth or retire.
- Annually: consider tool switching with a clear head. Usually no change; when yes, catching it at one year rather than three is worth the audit cost.
- Personal audit and team audit are coupled. Individual fixes should propagate to shared artifacts; shared changes should trigger personal alignment.
- Convergent across tools: the shape of self-audit is universal. Divergent: the specific surfaces, the tooling support, and the emerging vocabulary.
- Emerging: agent-assisted self-audit, cohort-level audit signals, practitioner-skill research. 18–24 months of substantial movement.
- Durable principle: you are the most important source in your own practice. Audit your own claims at least as carefully as you’d audit someone else’s book.
Appendix A — Claude Code companion
A deep reference for Claude Code specifically. Organized around the book's concepts rather than as a feature catalogue: how the primitives the book discusses (briefing docs, plan mode, skills, hooks, subagents, MCP) actually map to Claude Code's surfaces, with their current flags and file paths. Where the main chapters kept Claude-specific detail bounded for comparative fairness, this appendix lets it flow.
The main chapters kept Claude-specific detail bounded because comparative pedagogy demands it. This appendix is where the bound comes off: concrete paths, specific flags, the exact primitives as they exist in Claude Code as of the chapter’s last-verified date. Nothing here is a principle — principles are in the body chapters. Everything here is current feature surface, classified explicitly as volatile, audited quarterly.
How to use this appendix
Three reading modes fit this appendix.
Reference lookup. You know what you want and need the specific command or path. Skim the section headers for the concept; the answer is a paragraph or table below.
Gap-check against a chapter. You’ve just read a main chapter and want the Claude-specific detail. Find the section named after the concept (e.g. “Briefing documents” for Ch 7); read the concrete surface here.
Comparative study. You know Claude Code and want to see how the book frames its primitives. The section headers match the book’s concepts, not Claude’s product surface; the translation is visible.
The appendix assumes you have Claude Code installed and a basic session-loop familiarity. It does not reteach the interactive loop.
Invocation modes
Claude Code ships one binary (claude) with several invocation modes.
| Mode | Command | Purpose |
|---|---|---|
| Interactive | claude | REPL-style session in the terminal |
| Print (headless) | claude -p "<prompt>" | One-shot non-interactive run |
| Resume | claude --resume or claude --continue | Reattach to prior session |
| Stream | JSON-stream output for piping to other tools | Scriptable integrations |
The print mode (-p) is the primary surface for CI integration and scripted batches (see Ch 12). Output shape is controlled by --output-format (text, json, stream-json). Scripting against a stable machine-readable output is more durable than scripting against the interactive text format, which has drifted between releases.
Briefing documents (Ch 7)
Claude’s briefing-doc convention is a file named CLAUDE.md at the repo root, resolved up from the working directory. Nested CLAUDE.md files closer to the working directory are concatenated; the closest wins on conflict. The file is re-injected on every turn — budget-sensitive claims from Ch 2 apply directly.
The supported locations, in precedence order:
./CLAUDE.md(project-local, highest precedence for its directory)~/.claude/CLAUDE.md(user-level defaults applied to every session)- Parent-directory
CLAUDE.mdfiles walked up from the current directory
A user-level CLAUDE.md is the right place for preferences that apply to all your work (use terse output, prefer Python over shell, never write new comments). Project-level CLAUDE.md is the right place for project-specific context (this repo uses Turbo and pnpm, the staging branch is develop). Do not collapse the two levels.
Slash commands and skills (Ch 8)
Claude Code exposes two extension surfaces.
Slash commands live in .claude/commands/ (project-level) or ~/.claude/commands/ (user-level). Each command is a Markdown file with optional frontmatter; the filename minus extension becomes the slash-command name. The body is used as a prompt template, with {{args}} substitution for arguments passed on the command line.
Skills live in .claude/skills/ (project) or ~/.claude/skills/ (user). A skill is a directory containing a SKILL.md with frontmatter describing when Claude should invoke it; the directory can also contain scripts, templates, and reference material the skill uses. Skills are autonomously invoked when Claude judges them relevant, rather than explicitly called like slash commands.
Hooks (Ch 8)
Hooks are user-defined shell commands that run in response to agent lifecycle events. Configured in .claude/settings.json (project) or ~/.claude/settings.json (user). Each hook entry binds an event name to a command string.
Event names, at time of capture:
| Event | Fires when | Common use |
|---|---|---|
PreToolUse | Before the agent runs a tool | Block or mutate disallowed actions |
PostToolUse | After a tool runs | Log results, post-process outputs |
UserPromptSubmit | When user sends a prompt | Inject additional context |
Notification | When the agent is idle and wants attention | Trigger desktop notifications |
Stop | When the session ends | Final logging, commit hooks |
PreCompact | Before the agent compacts its context | Archive full transcript before loss |
SessionStart | At session open | Set env, run briefings |
Hooks have a significant power/risk ratio. A well-scoped hook adds strong guardrails; a misconfigured hook can silently mutate the agent’s behavior in ways that are very hard to debug. Start narrow (PreToolUse on specific dangerous commands) and only broaden after you’ve watched the narrow version run for a week.
Plan mode (Ch 6)
Plan mode is Claude Code’s read-only planning phase. Enter with a keyboard shortcut (typically Shift+Tab cycling through auto-accept / plan / normal) or via CLI flag on startup. In plan mode the agent reads files, searches, analyzes — but cannot write files or execute mutating commands. Its output is a plan artifact written to a known path (~/.claude/plans/<name>.md) for review.
The workflow: enter plan mode → describe the goal → the agent proposes a plan file → you review and iterate → exit plan mode → the agent implements against the approved plan.
Subagents (Ch 9)
Claude Code’s delegation primitive is the Task tool, which spawns a child agent with its own context window, tool access, and prompt. The child runs to completion and returns a single summary message to the parent. The parent’s context grows by the summary, not by the child’s full transcript — this is the compression mechanism that makes delegation valuable.
Subagent type definitions live in .claude/agents/ (project) or ~/.claude/agents/ (user). Each is a Markdown file with frontmatter describing the agent’s purpose, tool access, and any system prompt customizations. Predefined subagent types (e.g. general-purpose, Explore, Plan) ship with Claude Code; custom types extend them.
The decision to delegate vs. stay in the main thread comes down to Ch 9’s principle: delegate when the subtask has a well-defined input and output, and when inlining the full work would blow the parent’s context budget. The Task tool is the mechanical execution of that judgment call.
MCP integration (Ch 8)
Claude Code is an MCP client. Servers are configured in .mcp.json at the repo root (project-level) or in user settings. Each server declares a command to launch and an optional environment; Claude connects on session start and exposes the server’s tools to the agent.
Settings file reference
Claude Code’s settings split across two levels.
Project-level: .claude/settings.json (committed) and .claude/settings.local.json (gitignored, per-user). Project settings affect everyone on the team who works in the repo; local settings are personal overrides.
User-level: ~/.claude/settings.json. Applies to every session regardless of project.
Key fields, at time of capture:
| Field | Purpose |
|---|---|
model | Default model for the session |
permissions.allow / permissions.deny | Tool-level allow/deny lists |
permissions.defaultMode | Auto-accept vs ask-first default |
env | Environment variables applied to tool invocations |
hooks | Hook bindings (see Hooks above) |
subagents.disabledTypes | Block specific subagent types from being spawned |
Permissions strings support globs (Bash(git:*) allows all git subcommands) and path scoping (Edit(src/**) allows edits only under src/). The expressive power is what makes Claude’s permission model work for both personal and enterprise deployment.
Session state and memory
Session transcripts are stored locally (typically under ~/.claude/projects/<project-hash>/). The --resume flag reattaches to prior sessions; --continue picks up the last session automatically.
The memory system — distinct from session transcripts — stores persistent notes the agent has written about the user, the project, or past interactions. Memory files live in user-level config and are loaded into every session’s context. The memory system has its own audit discipline: stale memories are active misinformation and should be pruned on the same cadences Ch 16 describes.
When to use Claude Code
The honest answer: most of this book treats Claude as the primary tool because its briefing-doc / skills / hooks / subagents / MCP surface is the most mature of the three, not because it is categorically superior. Teams with existing Google Cloud investment, existing OpenAI relationships, or specific Codex workflow preferences should read the corresponding appendix (B, C) — the principle chapters apply across tools.
Situations where Claude Code is a particularly strong fit: heavy multi-file refactors that benefit from subagent delegation, teams with rich hook-driven automation needs, workflows that lean on plan mode’s explicit read-then-write separation. Situations where another tool may fit better are noted in Appendices B and C.
Quick reference
claude(interactive),claude -p "<prompt>"(headless),--resume/--continue(session reattach)- Briefing doc:
CLAUDE.mdat repo root, plus~/.claude/CLAUDE.mduser defaults - Slash commands:
.claude/commands/(project) or~/.claude/commands/(user) - Skills:
.claude/skills/(project) or~/.claude/skills/(user); autonomous rather than user-triggered - Hooks:
.claude/settings.jsonhooksfield; events includePreToolUse,PostToolUse,UserPromptSubmit,SessionStart,Stop,PreCompact - Plan mode:
Shift+Tabcycles modes; plan artifacts land in~/.claude/plans/ - Subagents:
Tasktool spawns children; types configured in.claude/agents/ - MCP servers:
.mcp.jsonat repo root; server tools appear in agent’s tool list on session start - Permissions: allow/deny in
settings.jsonwith glob and path scoping (Bash(git:*),Edit(src/**)) - Claim volatility: entire appendix is feature-surface — audit quarterly against current docs.
Appendix B — Gemini CLI companion
What's different about Gemini CLI: the parts of its surface that diverge from the Claude-centric defaults in the body chapters. Kept brief on purpose — comparative pedagogy, not exhaustive documentation. Where Gemini is genuinely the better fit, that is named.
This appendix is explicitly scoped to the parts of Gemini CLI that diverge from Claude’s defaults or that make Gemini a better fit for a particular workload. Convergent primitives (briefing doc at repo root, slash commands, headless mode, MCP integration) are covered in the body chapters and not repeated here. The intent is a reference that helps a practitioner who already read the book translate its concepts to Gemini specifically, not a standalone tutorial.
What Gemini CLI is
Gemini CLI is Google’s open-source command-line agent, integrating the Gemini model family with local tooling. Its natural home is workflows where the developer already operates inside Google Cloud — existing Vertex AI contracts, existing ADC (Application Default Credentials) setup, existing BigQuery / GCS / Cloud Run infrastructure. In those contexts the credential chain, billing, and deployment path are ergonomic rather than adversarial.
Key defaults at time of capture:
| Surface | Gemini CLI |
|---|---|
| Binary | gemini |
| Interactive entry | gemini |
| Non-interactive | gemini -p "<prompt>" or piped stdin |
| Briefing doc | GEMINI.md at repo root |
| Slash commands | .gemini/commands/ (project) or ~/.gemini/commands/ (user) |
| Extension model | Primarily MCP servers |
| Auth | Google Cloud ADC chain by default |
What’s different about context
The corollary is discipline. A long-context model tempts the anti-pattern Ch 2 warns about: adding more because the budget allows it. The right use of Gemini’s context is to reduce pre-processing (less forced summarization, less aggressive chunking) when the task genuinely needs many files in view at once, not to make context hygiene optional.
Extension model: MCP-primary
Gemini CLI leans heavily on MCP as its extension surface. Where Claude exposes skills, hooks, and MCP as three distinct surfaces, Gemini unifies most of that territory around MCP servers. The upside: a cleaner mental model — all tool extension is an MCP server. The downside: more setup cost for tasks where Claude’s skill-as-directory or command-as-Markdown would be lighter-weight.
Memory and session commands
Gemini CLI’s built-in session commands include /memory with subcommands (refresh, add, show) that manage the persistent memory surface — GEMINI.md-derived context plus any memory updates made during a session. /memory refresh re-reads the briefing doc when it’s been edited mid-session; /chat manages conversation state.
The --yolo flag hazard
Gemini CLI ships a flag — --yolo — that disables all approval prompts, letting the agent run destructive actions without pausing. For ephemeral experimentation in a throwaway environment it is convenient; for CI, for production repositories, for anything that matters, it is dangerous.
The failure mode: a well-intentioned engineer uses --yolo to unblock a specific workflow, commits the invocation to a repo script, and months later the script runs in a context where the safety it skipped was load-bearing. The fix — and it applies to similar broad-authorization flags in any tool — is to treat --yolo as a local-session affordance only, never commit invocations that include it, and audit scripts for its presence as part of the team’s governance practice (Ch 13).
Authentication chains
Gemini CLI’s default auth path is Google Cloud ADC — the same credential chain that gcloud, Cloud Run deploys, and other Google Cloud tools use. Running gcloud auth application-default login once sets the credential; subsequent Gemini invocations inherit it. For Vertex AI–backed usage this is particularly smooth because the CLI, the model endpoint, and any downstream GCP services all resolve identity through the same chain.
For teams not already in Google Cloud, the auth setup cost is real (a GCP project, ADC, Vertex access). This is one of the tradeoffs that pushes some workloads toward other tools despite Gemini’s capability advantages.
GitHub Action
Google ships an official google-gemini/gemini-cli-action for GitHub Actions. The integration pattern is convergent with Claude and Codex (Ch 12): @mention-triggered agent runs on PRs and issues, agent outputs posted as comments or inline suggestions. Configuration lives in .github/workflows/; the action exposes inputs for prompt, allowed tools, credentials, output format.
When Gemini CLI is the right fit
Situations where Gemini is a particularly strong choice:
- Existing Google Cloud investment. The credential chain, billing, and compliance path are ergonomic; switching tools would add friction for no capability gain.
- Large-context workloads. Monorepo review, whole-document synthesis, long-lived architectural investigations — the context advantage is genuinely useful here.
- MCP-heavy extension ecosystem. Teams that have already standardized on MCP servers for their internal tooling get the most leverage from Gemini’s MCP-primary model.
Situations where another tool may fit better:
- Rich hook-driven automation where Claude Code’s dedicated hook events are ergonomic.
- Teams with existing OpenAI / Codex relationships whose workflows are already integrated there.
- Narrow, specific plan-mode-style workflows where Claude’s explicit read-only mode is a primitive rather than a convention.
Quick reference
- Binary:
gemini;gemini -p "<prompt>"for print mode - Briefing doc:
GEMINI.mdat repo root; re-read mid-session with/memory refresh - Slash commands:
.gemini/commands/(project) or~/.gemini/commands/(user) - Extension: MCP-primary — most tool extension is an MCP server
- Auth: Google Cloud ADC chain; Vertex AI is the ergonomic backend
- Hazard:
--yolodisables all approvals; treat as local-only; audit scripts for its presence - GitHub Action:
google-gemini/gemini-cli-action - Context: multi-hundred-K to multi-million tokens; use the window to reduce pre-processing, not to skip context hygiene
- Best fit: Google-Cloud-native teams, large-context workloads, MCP-standardized extension ecosystems
- Volatility: feature-surface; audit quarterly.
Appendix C — Codex CLI companion
What's different about Codex CLI: the parts of its surface that diverge from the Claude-centric defaults in the body chapters. Focus on its distinctive approval-mode permission model, sandbox defaults, and the OpenAI/Azure deployment path. Where Codex is genuinely the better fit, that is named.
This appendix is the Codex-specific counterpart to Appendix B. Same scoping discipline: convergent primitives (briefing doc, slash commands, headless mode) live in the body chapters; this reference covers what’s different and when Codex is the better fit.
What Codex CLI is
Codex CLI is OpenAI’s open-source command-line agent. Like Claude and Gemini, it runs an agent loop against a model backend, exposes tool-calling, and supports both interactive and headless modes. Its characteristic design choice is the approval mode model for permissions — a tighter, more declarative alternative to the allow/deny list patterns the other tools use.
Key defaults at time of capture:
| Surface | Codex CLI |
|---|---|
| Binary | codex |
| Interactive entry | codex |
| Non-interactive | Approval-mode flags; see below |
| Briefing doc | AGENTS.md at repo root |
| Backend | OpenAI API or Azure OpenAI; configurable via env |
| Sandbox | Built-in filesystem / command sandbox |
Approval modes: the distinctive permission surface
The approval-mode levels, at time of capture, span roughly:
- Ask on every action — maximum human oversight. The agent proposes, the human approves each step.
- Ask on request (
-a on-requestor--suggest) — the agent works autonomously on reads and analysis, asks before any write or destructive action. - Ask on failure — run without prompting unless a command fails; then surface for human judgment.
- Never ask — full autonomy, equivalent to Gemini’s
--yoloin trust level.
For automated pipelines the approval-mode flag is the primary configuration. The discipline from Ch 12 applies: pick the most restrictive mode that still lets the workflow complete, not the most permissive one that always works.
The built-in sandbox
Codex ships a filesystem and command sandbox as a first-class feature rather than as an optional hook. File writes default to a scoped working directory; shell commands run inside a constrained environment; network egress can be restricted per invocation.
This changes the default risk posture. Where Claude and Gemini rely more heavily on hook-based or permission-based guardrails that the operator configures, Codex ships stricter defaults and lets the operator loosen them. For CI integration and scheduled runs the sandbox is a meaningful safety net — misconfiguration errs toward blocking rather than toward blast radius.
Backend flexibility
Codex’s backend is configurable via environment variables, which makes it particularly easy to point at alternative inference endpoints: the default OpenAI API, Azure OpenAI (enterprise default), or any OpenAI-compatible proxy (including private inference servers hosting compatible model gateways). The environment-variable approach is less opinionated than either Claude’s Bedrock/Vertex integrations or Gemini’s Vertex-native path, which makes Codex particularly flexible for unusual deployment topologies.
For enterprise deployment (Ch 14), this manifests as an easier path to self-hosted or proxied inference. For teams using Azure as their cloud, the integration is natively ergonomic.
What’s thinner
Being honest about the comparison: some of the surface that is mature in Claude (and reasonably mature in Gemini) is thinner or differently-shaped in Codex.
- Hook-driven automation. Codex’s hook surface is less comprehensive than Claude’s eight-event model at time of capture. Lifecycle-based automation is more often implemented as wrapper scripts than as first-class hooks.
- Skills-as-autonomous-capabilities. The distinction Appendix A draws between Claude’s user-triggered commands and autonomous skills is less explicit in Codex; the conventional unit is closer to a prompt template than to an auto-reaching capability.
- Purpose-built GitHub Action. The integration path leans on general non-interactive invocation rather than a Codex-specific action. More DIY; more flexible; fewer opinionated defaults.
These gaps are narrowing over time — the convergence axis of Ch 15 applies — but as of the last-verified date, a team leaning heavily on any of them will find more ergonomic surfaces in Claude.
Authentication
Codex authenticates via API key or via Azure OpenAI’s credential chain. Key management in enterprise deployments typically routes through a secrets store (HashiCorp Vault, AWS Secrets Manager, Azure Key Vault); the CLI reads the key from an environment variable that the secrets infrastructure populates.
For interactive personal use, the pattern is the OpenAI API key in ~/.config/codex/config.toml or equivalent. Keep the key out of committed files; the same discipline applies to Gemini and Claude secrets.
When Codex CLI is the right fit
Situations where Codex is a particularly strong choice:
- OpenAI-first or Azure-first shop. Existing OpenAI API relationships, existing Azure OpenAI tenancy — the friction of adopting a different tool for the same underlying model family isn’t worth paying.
- Sandbox-as-primary-safety-net workflows. Exploratory automation where you want a structural bound on blast radius rather than a permissions-configuration bound.
- Unusual backend topologies. Self-hosted inference with an OpenAI-compatible gateway, private proxies, testing environments that swap model endpoints freely.
- Approval-mode ergonomics. If your team has converged on thinking about agent trust in depth terms rather than tool-scope terms, the approval-mode model is a closer match.
Situations where another tool may fit better:
- Rich hook-driven automation (Claude’s event model is ergonomic here).
- Large-context workloads (Gemini’s window advantage is real).
- Workflows that benefit from first-class autonomous skills (Claude’s skill model is more explicit).
Quick reference
- Binary:
codex; approval-mode flags (-a on-request,--suggest, etc.) configure autonomy for non-interactive use - Briefing doc:
AGENTS.mdat repo root (part of the cross-tool convergence on top-level briefing docs) - Backend: OpenAI API or Azure OpenAI; configurable via env; OpenAI-compatible proxies work
- Permissions: approval-mode model, not allow/deny lists — controls trust depth rather than tool scope
- Sandbox: built-in filesystem and command sandbox; default posture errs toward blocking
- Thinner vs Claude: hook surface, autonomous skills, purpose-built GitHub Action
- Auth: API key via env var; integrates with enterprise secrets stores
- Best fit: OpenAI / Azure-first shops, sandbox-primary safety, unusual backend topologies
- Volatility: feature-surface; audit quarterly. Gaps vs Claude are narrowing over time.
Appendix D — Source archive index
The canonical index of every external source cited in the book. Each entry renders with title, author, publish date, capture date, trust tier, and (when available) a Perma.cc archival link. The full Playwright capture of each source lives in the repo's gitignored local cache for drift detection; only the metadata below is public.
Every claim in this book that rests on an external source is tagged with a <Citation> that resolves to an entry in this archive. The archive itself lives in version-controlled YAML; this page is the human-readable view. Ch 15’s source-tiering methodology is implemented concretely here — every source carries a tier from T1-official to T4-conjecture, and the audit cadence for each chapter is shaped by the tier mix of its cited sources.
How the archive works
The book ships a structured source manifest at sources/manifest.yaml. Each entry has a stable slug, a URL, a title, an author, a publish date, a capture date, a trust tier, and (optionally) a Perma.cc archival link and a hash of the locally-captured content.
Citations in chapter MDX reference the slug: <Citation src="gwern-sidenote" /> resolves at build time to the metadata below. The <Citation> component renders the trust tier as a badge so readers can calibrate their confidence at the point of use, and exposes the archival link when present.
The full captured HTML + PDF + screenshot of each source is held in the repo’s gitignored sources/cache/<hash>/ directory — used for drift detection (re-fetch periodically, compare hashes, flag significant changes) but never published. The defensive posture this creates (reader sees metadata and archival link, full cached content stays private) keeps the legal surface small while preserving the author’s ability to verify integrity.
The archive
The listing below is rendered live from sources/manifest.yaml at build time by the <SourceArchive /> component. Entries are grouped by tier in descending authority (T1 → T4); within a tier, newest publish dates come first. Empty tiers render an honest placeholder — the archive is intentionally sparse in the early book, and visible gaps are part of the pedagogy.
T1 · Official no entries yet
Vendor-official documentation or release notes. Highest trust for factual claims about the vendor’s own tool.
No sources at this tier yet.
T2 · Release notes no entries yet
Release blog posts, changelogs, conference talks. Trustworthy for intent and availability claims.
No sources at this tier yet.
T3 · Practitioner 2 entries
Respected community writing with a durable argument the author has defended over time.
T4 · Conjecture no entries yet
Blog posts, tweets, or unverified claims. Pointers to investigate, not authorities.
No sources at this tier yet.
What’s coming
The archive is intentionally sparse at this point in the book’s development. Stage 3 of the roadmap expands it substantially:
- Tool-specific citations. Claude Code’s documentation, Gemini CLI’s release notes, Codex CLI’s reference pages — currently referenced only implicitly in the companion appendices, with explicit citations to land as Stage 3 chapters complete their research phase.
- Academic and practitioner sources. Koller-Friedman’s Probabilistic Graphical Models (the pedagogical framework), Tufte’s design books (the layout inheritance), practitioner long-form on AI-assisted development — landing as chapters citing them move out of draft.
- Faceted filtering (post-v1.0). Client-side filters by tier, tool, and publish-date band over the auto-rendered listing above. Live already at the chapter-level tool filter; the faceted source view reuses the same plumbing.
- Drift-detection workflow. A scheduled job (quarterly) re-fetches every T1 source and flags significant hash changes for human review. The manifest grows a
last_verified_atfield tracking when each source was last re-checked against its live URL.
Audit expectations
Per Ch 15’s methodology, each entry in the archive carries an implicit re-verification cadence based on its tier:
- T1-official sources: verify on major vendor releases (triggered by release, not by calendar).
- T2-release-notes sources: verify quarterly.
- T3-practitioner sources: verify annually — the author is unlikely to retract, but arguments do evolve.
- T4-conjecture sources: verify before relying on; otherwise do not cite.
The archive’s integrity is the book’s integrity. A source whose content has shifted beneath a citation silently demotes the chapter that cited it to an unknown state. The drift-detection workflow above is the machinery that keeps the gap small; the human audit cadence is the safety net when the machinery misses something.
Quick reference
- Canonical location:
sources/manifest.yaml; citations by slug via<Citation src="slug" /> - Full captures in
sources/cache/<hash>/(gitignored; drift-detection only) - Four tiers: T1-official, T2-release-notes, T3-practitioner, T4-conjecture
- Audit cadence scales with tier — T1 on release, T2 quarterly, T3 annually, T4 before citing
- Stage 3 replaces this static page with an auto-rendered manifest view
- Volatility: architectural-pattern — the archive’s structure is durable; specific entries change as the book grows.
Appendix E — Glossary
A cross-tool vocabulary. Each entry names a concept the book uses, gives a tool-agnostic definition, then maps the concept to the specific surface each of the three tools exposes it as. The glossary is the translation layer: if a body chapter talks about briefing docs and you only know GEMINI.md, the entry tells you they are the same thing.
The body chapters use deliberately tool-agnostic vocabulary — briefing doc, extension surface, headless mode — because convergent concepts deserve convergent names. Readers coming in through a single tool’s documentation know a different vocabulary. This glossary is the translation. Each entry: the book’s name for the concept, a short definition, and the specific surface each tool exposes it as.
Terms
Agent
The runtime loop that consumes a prompt, decides what to do, calls tools, observes their output, and iterates until the task is complete. The agent is distinct from the underlying model (which produces text completions) and from the tool (which executes a specific capability like Read or Bash). Across Claude Code, Gemini CLI, and Codex CLI the agent loop has converged on the same shape; differences are in the primitives exposed to the operator.
Approval mode
A permission model — Codex CLI’s distinctive primitive — that configures agent autonomy along a depth axis: “ask every action” → “ask on request” → “ask on failure” → “never ask.” Contrasts with Claude and Gemini’s allow/deny-list models, which configure permission along a tool-scope axis. The two models describe overlapping territory; they are not semantically equivalent. See Permission policy.
Briefing doc
A convention: a Markdown file at the repo root that stateless agents re-read on every session, encoding project-level context (build commands, conventions, current focus). The convergence point is cross-tool:
- Claude Code:
CLAUDE.md - Gemini CLI:
GEMINI.md - Codex CLI:
AGENTS.md
All three tools resolve the briefing doc from the working directory upward. The file’s role is identical; only the filename differs.
Context window
The total token budget the model can attend to in a single turn: the briefing doc + conversation history + tool outputs + system prompt. Budget is finite and non-linear — content past certain positions effectively decays (see Context rot). Claude’s production models ship a 200K-token window; Gemini’s span multi-hundred-K to multi-million tokens; Codex-class models typically range around 272K at time of capture. The size difference matters less than it appears once context rot is accounted for.
Context rot
The observation that attention quality degrades across a long context — not only at the window’s hard edge but well before it. A claim made in turn 3 of a 50-turn session may effectively be invisible by turn 40 even if it technically still fits in the token budget. Ch 2 is the main treatment. The phenomenon is substrate-level; it applies to every CLI agent and every large-window model equally.
Convergence / Divergence
Book vocabulary (Ch 15) for how tool behaviors align or differ. A pattern is convergent when all tracked tools have adopted it (e.g. briefing doc at repo root, headless print mode, MCP client support) — signal that the pattern is architecturally stable. A pattern is divergent when the tools diverge on implementation or are not all present (e.g. hook event granularity, approval-mode-vs-allow-list permissions) — signal of open design space. Convergence is a bet practitioners can safely make; divergence is a bet they cannot.
Extension surface
The parts of a tool that operators can extend without modifying the tool itself — slash commands, skills, hooks, MCP servers. Each tool’s extension surface has a different shape (Ch 8); this is the biggest divergence axis across the three tools.
Hook
A user-defined shell command that fires in response to an agent lifecycle event (pre-tool-use, post-tool-use, session-start, etc.). Primarily a Claude Code primitive; Gemini and Codex have hook-adjacent but less-developed surfaces. See Appendix A for the current Claude event list.
Headless mode
Invocation of the agent with no human present to observe, approve, or interrupt during the run. Ch 12 is the main treatment. Cross-tool naming:
- Claude Code:
claude -p "<prompt>"(print mode) - Gemini CLI:
gemini -p "<prompt>"or piped stdin - Codex CLI: approval-mode flags chosen per invocation
Convergent in shape (one-shot non-interactive run); divergent in permission configuration.
MCP (Model Context Protocol)
A cross-tool protocol for exposing capabilities — tools, data sources, prompts — to an agent through a standardized server interface. MCP is one of the clearest convergence points in the field: all three agents are MCP clients; the same MCP server can be consumed by any of them. For Gemini, MCP is the primary extension model; for Claude and Codex, it is one of several extension surfaces.
Permission policy
The set of rules that constrain what an agent can do without asking. In Claude Code: an allow/deny list at tool-name granularity, with glob and path-scoping support, configured in settings.json. In Gemini CLI: similar allow/deny structure with a dangerous --yolo override that bypasses everything. In Codex CLI: approval modes (see above). All three share the underlying abstraction — what may the agent do unattended? — but express it differently.
Plan mode
A read-only agent phase where the agent analyzes, reads, and proposes a plan but cannot mutate files. A Claude Code primitive with a dedicated keyboard shortcut; on Gemini and Codex, closer to a convention operators impose via instructions rather than a first-class mode. Ch 6 treats the principle (think before acting); plan mode is the Claude-specific implementation.
Session
A single bounded conversation between user and agent. Sessions have state (prompt history, context loaded, tools registered). claude --resume, gemini /chat, and Codex’s equivalent let sessions be reattached across invocations. A session is distinct from a workflow (which may span multiple sessions) and from a project (which persists across all sessions).
Skill
A named, self-contained capability the agent can invoke when a situation matches — a directory containing a SKILL.md describing when to use it plus any scripts or references the skill needs. Claude Code uses the term most explicitly and autonomously; Gemini and Codex have adjacent concepts (slash commands, prompt templates) that overlap but are not identical in intent — skills auto-fire, slash commands must be triggered. See Ch 8.
Slash command
A user-invoked shortcut — a slash-prefixed name the operator types to trigger a prompt template or a built-in action. /compact, /memory refresh, /plan and similar. Conventional across the three tools; storage locations differ:
- Claude Code:
.claude/commands/(project) or~/.claude/commands/(user) - Gemini CLI:
.gemini/commands/(project) or~/.gemini/commands/(user) - Codex CLI: project-level config at the repo root
Source tier
One of four trust levels (T1-official → T4-conjecture) assigned to any cited external source per the methodology of Ch 15. The tier determines the audit cadence for claims resting on the source. See Appendix D for the archive index.
Subagent
A child agent spawned by the main agent with its own context window, tool access, and prompt, which runs a bounded subtask and returns a summary to the parent. Claude Code implements this via the Task tool with configurable agent types (.claude/agents/). Gemini and Codex have partial parallels (MCP-based delegation, async task invocation) but not identical ergonomics. Ch 9 treats the principle.
Volatility class
Book vocabulary (Ch 15) for how fast a claim is likely to date: stable-principle, architectural-pattern, or feature-surface. Every chapter carries a volatility class in frontmatter; audit cadences scale accordingly. See the book’s src/content.config.ts for the canonical list.
Tool-to-concept translation table
A compact reference for readers who know one tool’s vocabulary and want to translate:
| Concept | Claude Code | Gemini CLI | Codex CLI |
|---|---|---|---|
| Briefing doc | CLAUDE.md | GEMINI.md | AGENTS.md |
| Project slash commands | .claude/commands/ | .gemini/commands/ | project config |
| Autonomous capability | Skills (.claude/skills/) | MCP tools | Prompt templates |
| Hook-style lifecycle | .claude/settings.json → hooks | Thin / convention-based | Wrapper scripts |
| Subagent primitive | Task tool + .claude/agents/ | Partial (MCP-based) | Partial |
| Headless invocation | claude -p (print) | gemini -p | Approval-mode flags |
| Permission model | Allow/deny with globs | Allow/deny + --yolo override | Approval modes |
| Enterprise backend | Bedrock / Vertex | Vertex (native) | OpenAI / Azure |
Volatility note
This glossary’s tool-to-surface mappings are feature-surface volatility — paths, flag names, and specific commands will shift over time. The concepts (briefing doc, headless mode, permission policy, skill, subagent) are architectural-pattern volatility — the shape of the abstraction is durable even when the implementation moves. If a mapping below is stale, the concept is still the right thing to look up; translating to the current tool’s surface is the quarterly audit task.
Appendix F — Maturity model
A five-level maturity model for agentic coding practice, across six dimensions. The model is diagnostic rather than prescriptive — most teams do not and should not aim for the highest level on every dimension. The right level depends on your team's risk surface, team size, and regulatory context. The value of the model is in self-locating (*where are we?*) and in roadmap sequencing (*what's the next natural move?*).
Maturity models get a bad reputation from their overuse in management consulting — “your Level 2 team should be aiming at Level 4” applied regardless of whether the climb actually serves anyone. This one is shaped to avoid that trap. The right level for a dimension depends on what that dimension is load-bearing for; a solo practitioner has no reason to reach the team-governance level on team dimensions; a regulated enterprise cannot operate below a certain level on compliance dimensions. The model’s job is diagnostic — showing you where you are — and sequencing — showing you what the next natural move is. It is not a finish line.
Six dimensions
An agentic-coding practice matures along six mostly-independent axes.
- Individual discipline — the practitioner’s own workflow habits: briefing hygiene, session management, audit cadence.
- Briefing and context — how shared context is authored, owned, maintained, and pruned.
- Extension surface — how deeply the team invests in skills, commands, hooks, MCP servers.
- Automation — how much agent work happens without interactive human supervision: CI, scheduled runs, event-triggered pipelines.
- Governance — team-level ownership, review discipline, permission policy.
- Audit and maintenance — explicit self-review cadences, drift detection, artifact deprecation.
The dimensions are mostly independent because moving on one does not force moving on another. A team can have sophisticated automation (L3) with primitive governance (L1) — in fact, that combination is the most common source of incident. The model’s diagnostic value is showing you dimensions that have advanced out of step with each other.
Levels
Five levels, each defined by observable practice rather than by intent.
Level 0 — Ad-hoc
The agent is used occasionally by individuals with no persistent context. Each session starts from scratch; no briefing doc exists; no commands or skills are codified; there is no audit discipline. This is where nearly every team starts.
Observable signals. Agent invocations happen without a briefing file in the repo. Team members independently discover the same prompts and patterns. There is no shared vocabulary for what the agent is good at. Recurring tasks are re-prompted from scratch each time.
When this level is appropriate. Early exploration, non-critical experimentation, single-person use of the agent for tasks that don’t repeat. Staying at L0 beyond the experimentation phase is a waste of compounding leverage.
Level 1 — Individual discipline
Individual practitioners have personalized their workflow: personal commands, personal skills, personal briefing preferences in ~/.claude/CLAUDE.md or equivalent. But nothing is shared with the team. Each engineer has an effective but private practice.
Observable signals. Individuals speak about their workflow competently; they invoke slash commands and skills reflexively. But when asked to share, the sharing is ad-hoc — a colleague watches over the shoulder, a DM with a paste of commands, no repo-committed artifact.
When this level is appropriate. Solo work, or a team where the agent is one of several competing tools and the team has not yet consolidated. This is a stable equilibrium for many practitioners; the move to L2 requires shared intent.
Level 2 — Team-shared
Shared artifacts exist in the repo: a committed briefing doc with an owner, team-tier skills and commands, basic permission policy. Agent-assisted review happens via a GitHub Action. The team has converged on some shared vocabulary.
Observable signals. CLAUDE.md or equivalent at the repo root is on the review path; changes go through PR. The .claude/commands/ or .gemini/commands/ directory has meaningful content. New team members are onboarded to the agent infrastructure as part of repo onboarding.
When this level is appropriate. Most teams doing nontrivial shared work. L2 is the most common target; it captures most of the agent’s team-scale leverage with manageable governance overhead.
Level 3 — Automated
Agent work escapes the interactive session: CI integration triggers agents on PRs or issues, scheduled agents run maintenance tasks, structured logging ties agent runs back to humans and correlates with model-endpoint access logs. The team has moved from “agent helps me write code” to “agent runs parts of the workflow.”
Observable signals. .github/workflows/ contains agent actions. Scheduled cron jobs invoke agents. Structured log pipelines capture each agent invocation. A human on-call rotation owns the automation, not just the underlying code.
When this level is appropriate. Teams operating at a scale where interactive-only agent use leaves compounding leverage on the table — typically 5+ engineers, frequent repetitive tasks (dependency bumps, documentation updates, triage), or growing repos. The move from L2 to L3 is where most teams encounter the failure modes of Ch 12 (unexpected egress, credential leakage, scheduled-agent drift); plan for them before shipping automation.
Level 4 — Governed
Policy-as-code, audit-log integration with enterprise SIEM, compliance-reviewed deployment envelope, formal change-control for agent tooling updates, quarterly drift-detection workflow. The agent infrastructure is treated as production-grade shared infrastructure.
Observable signals. The permission configuration is version-controlled and gates on CI tests. Policy changes require formal review. Audit logs ship to the enterprise’s central logging; an auditor can reconstruct any agent action. There is a named person accountable for the agent infrastructure, not just for the code it produces.
When this level is appropriate. Regulated environments, enterprise contexts (Ch 14), any setting where the blast radius of an unchecked agent action is incompatible with the team’s risk tolerance. L4 is not an aspirational goal for every team — the overhead is meaningful — but it is not optional for some.
The six-dimension matrix
The meaningful use of the model is populating this matrix for your own practice. For each dimension, name the level that best describes you today:
| Dimension | L0 | L1 | L2 | L3 | L4 |
|---|---|---|---|---|---|
| Individual discipline | Ad-hoc | Personal workflow | Shared vocabulary | Practice-pattern expertise | Teaches others |
| Briefing and context | None | Personal only | Committed briefing doc | Versioned + quarterly-trimmed | Policy-reviewed |
| Extension surface | None | Personal scripts | Team skill registry | MCP servers + hooks | Curated, versioned, policy-bound |
| Automation | None | One-shot scripts | Occasional CI | Scheduled + event-driven | Formal SRE-grade operations |
| Governance | None | Informal | CODEOWNERS + PR review | Policy config in repo | Policy-as-code + SIEM |
| Audit and maintenance | None | Ad-hoc reflection | Quarterly-ish review | Systematic cadences | Automated drift detection |
Progression signals
The question readers most often have from a maturity model is how do I know we’re ready to move to the next level? Brief signals for each transition:
L0 → L1. You find yourself retyping the same prompt for the third time. You have read at least one tool’s documentation past the quickstart. You have stopped being surprised by the agent’s basic behaviors. Move to L1 by writing your first personal skill or command.
L1 → L2. Two colleagues are doing similar work with the agent and neither knows what the other has. You have explained your setup to a teammate more than once. A new team member has been onboarded and there is nothing to show them. Move to L2 by committing a briefing doc and promoting one personal skill to team-tier.
L2 → L3. Your team does the same repetitive task weekly that the agent could handle with minimal prompting. A mechanical PR review pattern is slowing reviewers down. You’ve asked “could the agent do this on a cron?” and had no infrastructure to point at. Move to L3 by shipping one scheduled or CI-triggered agent with structured logging.
L3 → L4. Your automated agent runs have produced a near-miss incident (or a real one) that a policy engine would have prevented. Your compliance team has asked for an audit trail you cannot produce. The agent has been given a credential broader than it needed. Move to L4 by formalizing policy-as-code and wiring audit logs to your central logging.
When to stay put
The honest counterpart: progression is not always the right move.
- A solo practitioner at L1 with no team is not at a deficit. There is no L2 to reach because L2 requires sharing.
- A small team at L2 on a low-stakes codebase should usually not chase L4 governance. The overhead will slow them down without commensurate risk reduction.
- A team at L3 automation that has not yet stabilized its governance should fix governance before pushing further into automation. Advancing unevenly is riskier than staying put.
The model is a map, not a gradient to climb. The question it answers is given where we are, where is the next move most valuable? — not how do we get to the top?
Quick reference
- Six dimensions: individual discipline, briefing and context, extension surface, automation, governance, audit/maintenance.
- Five levels: L0 ad-hoc, L1 individual, L2 team-shared, L3 automated, L4 governed.
- Dimensions are mostly independent; gaps of 2+ levels between dimensions signal where the next incident will come from.
- Advance the weakest dimension, not the one already ahead.
- Progression is not universally correct. Solo practitioners at L1, small teams at L2, or regulated enterprises at L4 can all be at the right level for their context.
- The matrix is diagnostic (where are we?) and sequencing (what’s the next natural move?), not prescriptive.
- Volatility: stable-principle — the shape of maturity progression is durable; the specific tools and practices at each level change over time. Audit annually alongside the rest of Part V’s meta-discipline.