D4.1’s escalation ladder put examples on the second rung: when a plain instruction can’t fully pin a behavior — especially on the messy, ambiguous inputs — a demonstration does what a description cannot. This chapter is that rung. It is an architectural pattern: the 3-5-example construction and its quality criteria are stable across model versions, with the example-tag syntax as the illustration that may shift.

Examples are the most reliable steering mechanism

When a behavior is hard to describe, demonstrate it. “Examples are one of the most reliable ways to steer Claude’s output format, tone, and structure. A few well-crafted examples (known as few-shot or multishot prompting) can dramatically improve accuracy and consistency.” [Official] Prompting best practices: Use examples effectively (multishot / few-shot) · AnthropicT1-official original An example carries information a sentence struggles to: the exact field ordering, the precise tone, how a borderline input should resolve. The model does not memorize the examples — it extracts the implicit pattern across them and applies it to the new input.

The 3-5 sweet spot

The documented count is small and specific: “Include 3-5 examples for best results. You can also ask Claude to evaluate your examples for relevance and diversity, or to generate additional ones based on your initial set.” [Official] Prompting best practices: Use examples effectively (multishot / few-shot) · AnthropicT1-official original The range is not arbitrary — too few examples and the model latches onto an incidental trait; too many and you burn context for no gain and risk contradictory examples confusing it.

Relevant, diverse, structured

Quality matters more than count. The three criteria are explicit: “When adding examples, make them: Relevant: Mirror your actual use case closely. Diverse: Cover edge cases and vary enough that Claude doesn’t pick up unintended patterns. Structured: Wrap examples in <example> tags (multiple examples in <examples> tags) so Claude can distinguish them from instructions.” [Official] Prompting best practices: Use examples effectively (multishot / few-shot) · AnthropicT1-official original Diversity is the criterion most often neglected and the most consequential: three examples that all happen to share an irrelevant trait teach that trait as if it were the rule.

Target the ambiguous case directly

This is the heart of the cert task area. When an extraction or classification has a messy case — a missing field, an “other” bucket, an unusual variant — do not write a separate rule for it; show an example on that case with the handling you want. The model generalizes from the example treatment, not from a prose rule beside it. Put the ambiguous input in the middle of the set with its desired output:

<examples>
  <example>
    <input>Order #4815 shipped on Apr 3 via UPS tracking 1Z999AA10123456784.</input>
    <output>{"order_id": "4815", "carrier": "UPS", "tracking": "1Z999AA10123456784"}</output>
  </example>
  <example>
    <input>Customer asked about order status yesterday but gave no order number.</input>
    <output>{"order_id": null, "carrier": null, "tracking": null}</output>
  </example>
  <example>
    <input>Shipped today via 'FedEx Express Saver' - see ref 7712-4488-9933.</input>
    <output>{"order_id": null, "carrier": "FedEx", "tracking": "7712-4488-9933"}</output>
  </example>
</examples>

The middle example is the ambiguous case: it teaches that “no order number” resolves to null — not an empty string, not "unknown", not "n/a". [Official] Prompting best practices: Use examples effectively (multishot / few-shot) · AnthropicT1-official original Few-shot also composes with the next rung of the ladder: with structured outputs (D4.3), the schema locks the shape while examples still teach content and edge-case handling — the two are complementary, not redundant. [Official] Prompting best practices: Use examples effectively (multishot / few-shot) · AnthropicT1-official original A schema can require order_id to be a string-or-null; only an example teaches that this kind of input is the null one.

The use-side craft of building and iterating on an example set lives in the Use book’s prompt-engineering chapter (forthcoming), alongside D4.1’s hands-on companion material.

Practice

Exercise solutions

Solution ↑ Exercise

B. The ambiguous case — an input with no due date — is exactly what an example should demonstrate: place one in the set whose output shows "due_date": null, and the model generalizes from that treatment. A is the D4.1 instinct (be explicit), and it helps, but a prose rule beside the examples is weaker than a demonstration on the case itself: the model generalizes from the example treatment, not from a separate written rule sitting next to it (the principle established in “Target the ambiguous case” above). C adds volume without diversity: ten clean invoices never show the missing-date case, so the model still has nothing to copy for it. D works mechanically but is a band-aid that hard-codes one symptom (“unknown”); the model may emit “none” or "" next, and you are now maintaining a translation table instead of fixing the prompt. The few-shot fix targets the root cause.

Solution ↑ Exercise

The three criteria are relevant (mirror your actual use case closely), diverse (cover edge cases and vary enough that the model doesn’t latch onto an unintended pattern), and structured (wrap each example in <example> tags, the set in <examples>, so the model separates demonstrations from instructions). Diversity is the silent corrupter: when three examples happen to share an incidental trait — every input ends in a period, every output capitalizes the first field — the model generalizes from whatever is common across the set, so it learns that trait as if it were the rule. The prompt still “looks right,” but it has quietly taught the wrong invariant, and the failure only surfaces on inputs that don’t share the accidental trait.

Solution ↑ Exercise

With a single example, the model cannot tell which of the example’s traits are the pattern and which are incidental — quoting the first field is one concrete trait of that one sample, and with nothing to contrast it against, the model copies it as if it were required. This is the documented failure of 1-2 examples: high risk of picking up an incidental pattern instead of the intended one. Raising the count into the 3-5 range fixes it by adding contrast: across several examples that vary the incidental traits (some quote nothing, different field orders) while holding the intended pattern constant, the first-field-quoting habit no longer appears in every example, so the model stops treating it as the rule. The cure is diversity-via-count, not volume for its own sake.

Exam essentials