Part 1 Chapter 3 Last verified 2026-04-17 Fresh

Prompting as specification

Prompts are specifications — the input side of a stateful loop. Five levers shape how the agent interprets a prompt — precision, scope, structure, depth, and cost. This chapter treats prompting as an engineering activity, not a conversational art.

Volatility: architectural-pattern

Tools compared: claude-codegemini-clicodex-cli

On this page

Representation
Precision: the vocabulary problem
Scope: the boundary problem
Structure: the organization problem
Depth: the reasoning problem
Cost: the economics problem
Operation
Evolution
Quick reference

Your prompts work, but they’re inconsistent. Sometimes the agent nails it on the first try; sometimes you spend three rounds correcting misunderstandings. The gap is not randomness — it is precision. A prompt is the specification half of a contract; the agent’s output is the implementation half. Vague specs produce vague implementations. The rest follows.

Representation

A prompt is a specification, not a request. It declares the task, the constraints, the acceptable outputs, and the verification criteria — the same shape a well-written function signature has. When this framing clicks, most prompting problems dissolve: the fix is never “try a different phrasing” but “say what you actually want.”

Precision: the vocabulary problem

Natural language is ambiguous on purpose. “Clean this data” could mean drop nulls, impute, validate schema, deduplicate, standardize types, or all five. The agent picks one interpretation and runs with it; three rounds of correction later, you converge on what you actually meant.

The fix is a shared precision vocabulary — a small, stable set of verbs that mean exactly one thing in your project.

Natural language	Precise specification
”Clean this data”	`validate schema → impute nulls with median → drop rows where target is NaN`
”Train a model”	`fit XGBoost on train split → evaluate AUC on val → log params to MLflow`
”Check if this works”	`run pytest → check no data leakage → verify feature distributions match prod`
”Add error handling”	`add validation to build_features() → raise ValueError on schema mismatch → log and skip rows with >50% NaN`

Scope: the boundary problem

Agents are helpful by default — they read broadly, notice related issues, and fix them. This is useful until it’s scope creep: the agent modifies a file you weren’t working on, refactors a function you didn’t mention, or “improves” code that was intentionally written that way.

Scope is a spectrum:

Too restrictive: “Only modify line 42 of auth.py.” The agent can’t fix related issues the change requires.
Too permissive: “Fix the auth system.” The agent may rewrite half the codebase.
Right scope: “Fix the session expiry bug in src/auth/session.py. You may also modify src/auth/middleware.py if the fix requires it. Don’t touch other files without asking.”

Scope creep is a prompting gap, not a model deficiency. The fix is always in the prompt — name the files, separate discovery from action, or use plan mode. Never hope the agent will guess your boundaries.

Structure: the organization problem

For complex prompts, plain prose mixes concerns. Task, context, constraints, and verification run together and the agent has to infer structure. A lightweight markup — XML tags, numbered sections, or just clear paragraph separators — gives the agent that structure for free.

<context>
The churn model uses features from src/features/customer.py.
Current AUC is 0.72 on validation.
</context>

<task>
Add recency features: days since last purchase,
days since last login. Compute from the events table.
</task>

<constraints>
- No data leakage: features must use only pre-churn data.
- Must work with the existing FeatureStore interface.
- Include unit tests for the new features.
</constraints>

<verification>
Run: pytest tests/features/ -v
All existing tests must still pass.
Verify: no future-dated features in test set.
</verification>

The XML tags are not magic — they’re a convention the agent’s training reinforced. Any consistent structure works: numbered sections, Markdown headings, a table. The point is to make the shape of the specification explicit so the agent can parse it mechanically rather than guess.

Depth: the reasoning problem

Complex tasks benefit from the agent thinking before producing output. Trivial tasks do not; extra thinking burns context and latency for no gain. Each agent exposes this differently:

Claude Code: low / medium / high / max effort levels; ultrathink keyword escalates one turn.
Gemini CLI: thinking is adaptive by default with less granular user override.
Codex CLI: reasoning depth tied to model selection; less in-session control.

The principle is stable: match reasoning depth to problem complexity. Debugging a subtle concurrency issue needs deep thinking; renaming a variable does not. The controls vary; the judgment doesn’t.

Cost: the economics problem

Every prompt has a token price. The levers that reduce it:

Prompt caching — repeat content across calls hits a cache at substantial discount.
Batch APIs — bulk operations run asynchronously at a discount.
Model selection — use a smaller model for simpler tasks.

The biggest wins come from workflow design, not from squeezing individual prompts: a stable briefing doc that hits the cache on every turn saves more than any single-prompt optimization.

Operation

The three CLI-agents expose the prompting levers with different surface area. The table maps what’s broadly available; check your tool’s current docs for the exact command.

Lever	Claude Code	Gemini CLI	Codex CLI
File reference	`@path/to/file`	`@path/to/file`	`@path/to/file`
Directory reference	`@dir/`	`@dir/`	`@dir/`
Stdin pipe	`cat x \| claude -p '...'`	`gemini -p "..." < x`	`codex exec "..." < x`
Image input	drag, paste, or `@screenshot.png`	drag or `@screenshot.png`	`@screenshot.png`
Plan / dry-run	Shift+Tab plan mode	`/plan` (v0.8+)	`--dry-run`
Reasoning depth	`/effort {low\|medium\|high\|max}`	adaptive; limited override	via model choice
Prompt caching	automatic + explicit breakpoints	automatic	automatic
Batch API	Anthropic Batch API (50% off)	Vertex AI Batch	OpenAI Batch API (50% off)

Evolution

Prompting is the most convergent surface of agentic coding — the core vocabulary (precision, scope, verification, structure, depth) has been stable across tools since 2025. Active divergence lives in the reasoning-depth controls and the cost-optimization surface.

Convergence: the five-part prompt structure. Context → task → constraints → files in scope → verification. This emerged as a practitioner pattern in 2024, was reinforced by Anthropic’s official prompting guides in 2025, and has since been adopted by Gemini and OpenAI documentation. It’s no longer a contested choice.

Convergence: prompt caching as implicit optimization. All three tools cache repeated content automatically. The discounts vary (~90% for Claude, ~75% for Gemini, ~50% for Codex as of 2026), but the pattern — stable briefing doc, sticky tool definitions, repeated references — is the same.

Emerging: structured output modes. All three tools are shipping JSON-mode or schema-constrained output. Claude Code’s --output-format json landed in 2025; Gemini’s responseSchema parameter has been in the SDK since late 2025 and is reaching the CLI; Codex’s structured-output support is partial. Practices that pipe agent output into downstream automation should expect this to be fully converged within a year.

Quick reference

A prompt is a specification, not a conversation. Treat it like a function signature — declare inputs, constraints, and verification.
Five levers: precision, scope, structure, depth, cost. Debug a failed prompt by naming which lever was wrong.
Precision vocabulary compounds. Stabilize it in your briefing doc; every future session benefits.
Scope is always your responsibility. “Only modify files I explicitly name” is a reasonable default.
Structured prompts (five-part shape) reduce correction rounds substantially. The boredom is the point.
Plan mode (Shift+Tab / /plan / --dry-run) separates discovery from action for unfamiliar code — cheap, universally supported.
Reasoning depth should match problem complexity. The controls vary; the judgment doesn’t.
Caching wins come from workflow stability, not from single-prompt optimization. A stable briefing doc beats a clever phrasing.