D4.1’s escalation ladder put examples on the second rung: when a plain instruction can’t fully pin a behavior — especially on the messy, ambiguous inputs — a demonstration does what a description cannot. This chapter is that rung. It is an architectural pattern: the 3-5-example construction and its quality criteria are stable across model versions, with the example-tag syntax as the illustration that may shift.

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Exam essentials; if any is shaky, read closely — each is developed below.

When does a demonstration outperform a written instruction — what kind of case is few-shot’s home turf?
What is the documented example-count sweet spot, and what fails below it and above it?
Name the three example-quality criteria. Which one, when neglected, silently corrupts the prompt?
To pin that a missing field resolves to null (not "unknown"), do you add a rule or an example — and where in the set?
Few-shot and structured outputs (D4.3): competing or complementary? What does each one control?

Check your answers

A demonstration wins where description is ambiguous — few-shot’s home turf is the case an instruction can’t fully specify, like how a borderline input should resolve.
3-5 examples is the documented sweet spot: with 1-2 the model latches onto an incidental trait, and at 6+ you burn context and risk disagreeing examples teaching “either is acceptable.”
Relevant, diverse, structured — diverse is the most often neglected and most consequential: examples that all share an irrelevant trait teach that trait as if it were the rule.
An example — show an input with the missing field whose output is null, placed in the middle of the set; the model generalizes from the example treatment, not from a prose rule beside it.
Complementary — the schema locks the shape while examples teach content and edge-case handling.

Examples are the most reliable steering mechanism

When a behavior is hard to describe, demonstrate it. “Examples are one of the most reliable ways to steer Claude’s output format, tone, and structure. A few well-crafted examples (known as few-shot or multishot prompting) can dramatically improve accuracy and consistency.” [Official] Prompting best practices: Use examples effectively (multishot / few-shot) · AnthropicT1-official original An example carries information a sentence struggles to: the exact field ordering, the precise tone, how a borderline input should resolve. The model does not memorize the examples — it extracts the implicit pattern across them and applies it to the new input.

The 3-5 sweet spot

The documented count is small and specific: “Include 3-5 examples for best results. You can also ask Claude to evaluate your examples for relevance and diversity, or to generate additional ones based on your initial set.” [Official] Prompting best practices: Use examples effectively (multishot / few-shot) · AnthropicT1-official original The range is not arbitrary — too few examples and the model latches onto an incidental trait; too many and you burn context for no gain and risk contradictory examples confusing it.

Relevant, diverse, structured

Quality matters more than count. The three criteria are explicit: “When adding examples, make them: Relevant: Mirror your actual use case closely. Diverse: Cover edge cases and vary enough that Claude doesn’t pick up unintended patterns. Structured: Wrap examples in <example> tags (multiple examples in <examples> tags) so Claude can distinguish them from instructions.” [Official] Prompting best practices: Use examples effectively (multishot / few-shot) · AnthropicT1-official original Diversity is the criterion most often neglected and the most consequential: three examples that all happen to share an irrelevant trait teach that trait as if it were the rule.

Target the ambiguous case directly

This is the heart of the cert task area. When an extraction or classification has a messy case — a missing field, an “other” bucket, an unusual variant — do not write a separate rule for it; show an example on that case with the handling you want. The model generalizes from the example treatment, not from a prose rule beside it. Put the ambiguous input in the middle of the set with its desired output:

<examples>
  <example>
    <input>Order #4815 shipped on Apr 3 via UPS tracking 1Z999AA10123456784.</input>
    <output>{"order_id": "4815", "carrier": "UPS", "tracking": "1Z999AA10123456784"}</output>
  </example>
  <example>
    <input>Customer asked about order status yesterday but gave no order number.</input>
    <output>{"order_id": null, "carrier": null, "tracking": null}</output>
  </example>
  <example>
    <input>Shipped today via 'FedEx Express Saver' - see ref 7712-4488-9933.</input>
    <output>{"order_id": null, "carrier": "FedEx", "tracking": "7712-4488-9933"}</output>
  </example>
</examples>

The middle example is the ambiguous case: it teaches that “no order number” resolves to null — not an empty string, not "unknown", not "n/a". [Official] Prompting best practices: Use examples effectively (multishot / few-shot) · AnthropicT1-official original Few-shot also composes with the next rung of the ladder: with structured outputs (D4.3), the schema locks the shape while examples still teach content and edge-case handling — the two are complementary, not redundant. [Official] Prompting best practices: Use examples effectively (multishot / few-shot) · AnthropicT1-official original A schema can require order_id to be a string-or-null; only an example teaches that this kind of input is the null one.

Constructing a 3-5 set that targets the edge case Worked example

Task: classify a support ticket as one of bug | feature_request | question | other. The hard case is a feature request phrased as a complaint (“Your export is useless without CSV”) — instructions keep mislabeling it bug. Build the set deliberately:

<examples>
  <example>  <!-- canonical bug -->
    <input>Clicking Export throws a 500 error every time.</input>
    <output>{"category": "bug"}</output>
  </example>
  <example>  <!-- the edge case, placed in the middle -->
    <input>Your export is useless without CSV — fix this.</input>
    <output>{"category": "feature_request"}</output>
  </example>
  <example>  <!-- a question, to vary the class -->
    <input>Where do I find the export button?</input>
    <output>{"category": "question"}</output>
  </example>
  <example>  <!-- the "other" bucket -->
    <input>Thanks for the great support last week!</input>
    <output>{"category": "other"}</output>
  </example>
</examples>

Read it against the three criteria. Relevant: every input is a real ticket in your actual phrasing, not a toy sentence. Diverse: four different categories, and the second example deliberately breaks the “angry tone → bug” pattern an instruction-only prompt was learning — that one example is doing the real work. Structured: each pair is wrapped in <example>, the set in <examples>, so the model separates demonstrations from the instruction and the input. Four examples, one per class with the edge case carrying its own slot — squarely in the 3-5 sweet spot. Note what is not here: a prose sentence saying “angry feature requests are not bugs.” The demonstration replaces that rule, and does it more reliably.

The use-side craft of building and iterating on an example set lives in the Use book’s prompt-engineering chapter (forthcoming), alongside D4.1’s hands-on companion material.

Practice

Exercise solutions

Solution ↑ Exercise

B. The ambiguous case — an input with no due date — is exactly what an example should demonstrate: place one in the set whose output shows "due_date": null, and the model generalizes from that treatment. A is the D4.1 instinct (be explicit), and it helps, but a prose rule beside the examples is weaker than a demonstration on the case itself: the model generalizes from the example treatment, not from a separate written rule sitting next to it (the principle established in “Target the ambiguous case” above). C adds volume without diversity: ten clean invoices never show the missing-date case, so the model still has nothing to copy for it. D works mechanically but is a band-aid that hard-codes one symptom (“unknown”); the model may emit “none” or "" next, and you are now maintaining a translation table instead of fixing the prompt. The few-shot fix targets the root cause.

Solution ↑ Exercise

The three criteria are relevant (mirror your actual use case closely), diverse (cover edge cases and vary enough that the model doesn’t latch onto an unintended pattern), and structured (wrap each example in <example> tags, the set in <examples>, so the model separates demonstrations from instructions). Diversity is the silent corrupter: when three examples happen to share an incidental trait — every input ends in a period, every output capitalizes the first field — the model generalizes from whatever is common across the set, so it learns that trait as if it were the rule. The prompt still “looks right,” but it has quietly taught the wrong invariant, and the failure only surfaces on inputs that don’t share the accidental trait.

Solution ↑ Exercise

With a single example, the model cannot tell which of the example’s traits are the pattern and which are incidental — quoting the first field is one concrete trait of that one sample, and with nothing to contrast it against, the model copies it as if it were required. This is the documented failure of 1-2 examples: high risk of picking up an incidental pattern instead of the intended one. Raising the count into the 3-5 range fixes it by adding contrast: across several examples that vary the incidental traits (some quote nothing, different field orders) while holding the intended pattern constant, the first-field-quoting habit no longer appears in every example, so the model stops treating it as the rule. The cure is diversity-via-count, not volume for its own sake.

Exam essentials

Few-shot is the most reliable steering mechanism for format, tone, and structure — the model extracts the implicit pattern across examples, so it disambiguates where instructions can’t.
3-5 examples is the documented sweet spot: 1-2 risks learning an incidental trait, 6+ burns context and risks contradictions; budget the 3-5 as one canonical + variants + one edge case.
Relevant, diverse, structured — mirror the real use case, vary enough to avoid spurious patterns, and wrap each in <example> (group in <examples>) so the model separates examples from instructions and input.
Target the ambiguous case — put an example on the edge case (null field, “other” bucket) showing the desired handling; the example teaches it, a prose rule beside it teaches it less reliably.
Composes with structured outputs — the schema locks shape, examples teach content and edge-case handling; complementary, not redundant.