Part IV opens the domain the exam weights at 20% — getting a model to produce what you actually need. The first lever is the cheapest and the most overlooked: the prompt’s own precision. Everything later in this Part (few-shot, structured outputs, validation loops) is an escalation from this baseline. The principle here outlasts every model version — newer models make it more true, not less — which is why this chapter is a stable principle, not a feature surface.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

Two runs of the same prompt return differently-shaped output. Where does the inconsistency actually live?
A more-literal model (Opus 4.8) is given a prompt with one unstated-but-intended requirement. What happens to that requirement?
Which steers more reliably — “don’t use markdown lists” or “respond in flowing prose” — and why?
Name the three rungs of the output-control escalation ladder, cheapest first.
Why does a newer, more capable model make “be explicit” more important rather than less?

Check your answers

In the specification, not the model — the disagreement was latent in the prompt: a degree of freedom you left unstated, which the model resolved differently each run.
It is simply not honored — Opus 4.8 interprets prompts literally and “does not infer requests you didn’t make,” so an unwritten requirement goes unmet.
“Respond in flowing prose” — a positive instruction points straight at the destination, while a prohibition only names a forbidden region without locating the target inside the still-vast permitted one.
Explicit instruction → few-shot examples (D4.2) → structured outputs / strict tools (D4.3) — each rung costs more, so you climb only as far as the stakes require.
Because newer models follow instructions more literally — they invalidate the assumption that the model will generously guess your intent, so an unstated requirement becomes a liability, not a gap the model fills.

Specify the output format; do not leave it to inference

The single most reliable quality lever is to state the output contract explicitly. “Precisely define your desired output format using JSON, XML, or custom templates so that Claude understands every output formatting element you require.” [Official] Increase output consistency · AnthropicT1-official original A vague instruction (“summarize this”) leaves the shape — length, fields, ordering, what to do with missing data — for the model to guess, and a guess varies from run to run. An explicit instruction (“return JSON with keys sentiment, key_issues (list), and action_items (list of objects with team and task)”) removes the variance at its source.

The model will not infer what you did not ask for

Modern models follow instructions more literally, which makes implicit expectations a liability. “Claude Opus 4.8 interprets prompts literally and explicitly, particularly at lower effort levels. It does not silently generalize an instruction from one item to another, and it does not infer requests you didn’t make.” [Official] Prompting best practices · AnthropicT1-official original The upside is precision and less thrash; the cost is that a requirement you held in your head but never wrote down will simply not be honored. The fix is not a cleverer prompt that the model “figures out” — it is stating the requirement.

Tell the model what to do, not what to avoid

When you are steering format or tone, a positive instruction outperforms a prohibition. The docs are explicit that demonstrating the wanted behavior beats forbidding the unwanted one: “Positive examples showing how Claude can communicate with the appropriate level of concision tend to be more effective than negative examples or instructions that tell the model what not to do.” [Official] Prompting best practices: Use examples effectively (multishot / few-shot) · AnthropicT1-official original “Respond in smoothly flowing prose” steers better than “do not use markdown lists,” because a prohibition names a forbidden region without locating the target inside the (still vast) permitted one. A positive instruction points straight at the destination. The same logic applies to eliminating preambles: state “respond directly without preamble” rather than enumerating the openings you dislike. [Official] Prompting best practices · AnthropicT1-official original

The escalation ladder: instruction, then examples, then a hard schema

Explicit instruction is the first rung, not the only one. The documented hierarchy is to ask plainly first and escalate only when you need a stronger guarantee: “Try simply asking the model to conform to your output structure first, as newer models can reliably match complex schemas when told to, especially if implemented with retries. For classification tasks, use either tools with an enum field containing your valid labels or structured outputs.” [Official] Prompting best practices · AnthropicT1-official original Plain instruction handles most cases; a few-shot example (the next chapter) disambiguates edge cases; a hard schema (D4.3) makes a shape unrepresentable to violate. Each rung costs more — context, latency, setup — so you climb only as far as the stakes require.

Climbing the ladder on one prompt Worked example

A nightly job starts with a vague prompt and hardens it exactly as far as the stakes demand:

Rung 0 (vague). "Summarize this support ticket." Output drifts run to run — one line here, three paragraphs there, sentiment sometimes present — and the downstream parser breaks. The variance is latent in the prompt: length, fields, and sentiment-handling were never pinned.
Rung 1 (explicit instruction). "Return JSON: summary (≤2 sentences), sentiment (one of positive/neutral/negative), action_needed (boolean). If there is no clear action, set action_needed=false." Now every degree of freedom the parser cares about is fixed. This handles the common case and costs nothing but words.
Rung 2 (few-shot, D4.2). A batch of feature-request tickets keeps getting tagged negative because they describe a missing capability. The instruction can’t fully pin that judgment, so you add two worked examples showing a feature request labeled neutral. The example disambiguates what prose could not.
Rung 3 (structured outputs / strict tool, D4.3). The pipeline must never see a label outside the three — but the model occasionally emits "mixed". You move sentiment to an enum-constrained tool / structured output, so a fourth value is unrepresentable, not merely discouraged.

The discipline is to stop at the rung the stakes require. Most fields never leave Rung 1; only the crash-on-violation field (sentiment) earns Rung 3. Climbing further than the risk warrants just buys setup cost and latency you don’t need.

The hands-on craft of writing these prompts — the iteration rhythm, the worked before-and-after examples — is the Use book’s territory; its prompt-engineering chapter is the use-side companion to this exam-angle treatment (forthcoming in the handbook).

Practice

Exercise solutions

Solution ↑ Exercise

C. The inconsistency is latent in the prompt: “summarize” leaves length, fields, and sentiment-handling unstated, so the model resolves them differently each run. Specifying the exact contract — fixed fields, each with a type and a length bound — removes those degrees of freedom and is what a downstream parser needs. A (temperature) reduces token-level randomness but does nothing about an underspecified shape; a deterministic model still has to invent a structure you never gave it. B is a negative, vague instruction — “too much” is undefined and “be consistent” names the goal without specifying the target. D is the exact assumption modern literal-following models invalidate: a bigger model will not infer a contract you didn’t write, and may follow your vague prompt more faithfully, not less.

Solution ↑ Exercise

A positive rewrite: “Respond directly with the answer in flowing prose.” That single instruction covers both prohibitions — “directly” eliminates the preamble, “flowing prose” rules out headers — by naming the target instead of the forbidden regions. It steers more reliably because a prohibition (“don’t use headers,” “no preamble”) shrinks the output space without locating the destination inside the still-vast permitted region, whereas the positive form aims straight at the one shape you meant.

Solution ↑ Exercise

The three rungs, cheapest first: (1) explicit instruction — name the four labels in the prompt and ask the model to return exactly one; (2) few-shot examples — demonstrate the labeling on ambiguous inputs; (3) structured outputs / strict tools — constrain decoding so only an in-set label can be emitted. Here you should climb to rung 3: the premise is that a malformed or out-of-set label crashes the pipeline, so you need the shape to be unviolatable, not merely likely-correct. An enum-constrained tool or structured output makes an out-of-set value unrepresentable — the only rung that turns “should be valid” into “cannot be invalid,” which is what a crash-on-violation contract demands.

Exam essentials

Consistency lives in the specification — if two runs disagree, a degree of freedom was left unstated; pin the format (fields, types, lengths, missing-data handling) explicitly.
Modern models follow instructions literally — Opus 4.8 “does not infer requests you didn’t make,” so an unstated requirement is an unmet one; state it rather than expecting the model to read intent.
Positive beats negative — “respond in flowing prose” / “answer in one sentence” steers better than “don’t use lists” / “don’t be verbose”; aim at the target instead of ruling out one failure.
Escalation ladder — explicit instruction → few-shot examples (D4.2) → structured outputs / strict tools (D4.3); climb only as far as the stakes require, since each rung costs more.
Why stable-principle — “be explicit about what you want” survives every model version; newer, more-literal models make it more load-bearing, not less.