Part 2 Chapter 14 Last verified 2026-06-13 Fresh

Tool Minimization: Subtract First

The governing default of the volume's capability axis — the smallest tool set that covers the workflow beats a complete one. Why an extra tool is paid twice (definition tokens at rest, selection errors at runtime), the three independent production reports that converge on subtract-first, the two highest-leverage heuristics (consolidate, return high-signal), and the dynamic complement — load tools on demand when scale forces it.

Volatility: architectural-pattern

Tools compared: claude-codecross-tool

On this page

Subtract-first: the smallest set that covers the workflow
The case studies converge — and how far that licenses you
Consolidate, and make the response high-signal
When you can’t subtract, defer
Measure, don’t guess
Patterns
Quick reference
Practice

This is the governing default of the volume’s capability axis. The spine framed every tool as a context cost; this chapter turns that into a discipline. The move is subtraction: start from the smallest tool set that covers the workflow and justify every addition, because each tool you add is paid twice — in definition tokens at rest and in selection errors at runtime. The principle is Anthropic’s; the unusually strong corroboration is three production teams who cut their tool sets and reported back.

Subtract-first: the smallest set that covers the workflow

Start from the counter-intuitive claim and let it organize the rest. Anthropic’s tool-design guidance states it flatly: “More tools don’t always lead to better outcomes.” [Official] Writing effective tools for agents — with agents · Aizawa (Anthropic) (2025)T1-official original The failure mode is specific — “too many tools or overlapping tools can also distract agents from pursuing efficient strategies.” [Official] Writing effective tools for agents — with agents · Aizawa (Anthropic) (2025)T1-official original The corrective is to build a few thoughtful tools that cover the workflow rather than wrap every API endpoint you happen to have. [Official] Writing effective tools for agents — with agents · Aizawa (Anthropic) (2025)T1-official original

The reason this is a discipline and not a preference is that an extra tool charges you twice.

This is the same logic the volume’s spine drew as the capability axis, now stated as an action. A “complete” tool set — one tool per endpoint, every capability the platform offers — optimizes for coverage of possibilities. A minimal set optimizes for coverage of the workflow, which is the only thing the agent actually has to do. The two come apart fast: most of a complete set’s tools never fire on a given task, but every one of them is in the window distracting selection.

The case studies converge — and how far that licenses you

Subtract-first would be a plausible-but-unproven heuristic if it rested only on the design guidance. It does not. Three first-party engineering reports from 2025, on three different production agents, independently cut their tool sets and reported the same direction.

Convergence claude-codecross-tool

Vercel stripped its internal text-to-SQL agent down to “a single tool: execute arbitrary bash commands” and reported a “100% success rate instead of 80%.” [Practitioner] We removed 80% of our agent's tools · Andrew Qu (Vercel) (2025)T3-practitioner original GitHub trimmed Copilot’s default “40 built-in tools down to 13 core ones” and measured a 2–5 percentage-point success-rate gain across SWE-Lancer and SWE-bench Verified (plus a ~400 ms latency drop in A/B testing). [Practitioner] How we're making GitHub Copilot smarter with fewer tools · Anisha Agarwal and Connor Peet (GitHub) (2025)T3-practitioner original Block consolidated “over 30+ APIs and 200+ endpoints, with only 3 MCP tools” behind a layered tool pattern in its Square MCP Server. [Practitioner] Build MCP Tools Like Ogres... With Layers · Richard Moot (Block) (2025)T3-practitioner original Three independent teams, the same finding: fewer, sharper tools made the agent better. [Convergence]

That is genuine convergence of practice — and it is worth being precise about what kind. These are vendor self-reports, each team measuring its own agent on its own tasks, not independent third-party benchmarks. The figures are real (each is quoted from the primary write-up), but the convergence is in direction, not in a transferable effect size.

The honest reading is the strongest one here: three independent practitioners pointing the same way is better evidence than any single benchmark, and none of these numbers is a law you can quote for your own agent. Subtract-first is well-corroborated as a direction; the size of the win is yours to measure.

Consolidate, and make the response high-signal

Once count is under control, two heuristics carry most of the remaining leverage — and both are really the same context-management instinct applied to tools.

Consolidate overlapping functionality into fewer capable tools, and namespace them so the model can tell them apart. [Official] Define tools · AnthropicT1-official original Two tools that do almost the same thing do not add a capability; they add a coin-flip the model has to win on every relevant turn. Folding them into one tool with a clear name removes the ambiguity at its source.

Return high-signal information, not raw dumps. A tool’s response is as much a context-management decision as its existence: returning a 5,000-token raw payload when the agent needs three fields spends the window you just saved by cutting tool count. [Official] Writing effective tools for agents — with agents · Aizawa (Anthropic) (2025)T1-official original

When you can’t subtract, defer

Some agents genuinely need a large capability surface — a platform with hundreds of legitimate operations. Subtract-first has a dynamic complement for exactly this case: keep the active set small by loading tools on demand rather than presenting all of them upfront. The Tool Search Tool does this — tools are discovered and loaded when relevant, with only the few most-used kept non-deferred in the window. [Official] Tool search tool · AnthropicT1-official original A working default is to keep roughly the 3–5 most-used tools resident and defer the rest. [Official] Tool search tool · AnthropicT1-official original

The order matters: subtract first, then defer what genuinely cannot be cut. On-demand loading is not a license to keep a bloated set and hide it behind search — a hundred half-redundant tools still produce wrong-tool selection once they are loaded. Cut to the workflow, consolidate the overlaps, and only then defer the irreducible remainder.

Measure, don’t guess

Subtract-first becomes empirical, not aesthetic, when you close the loop: build a small agentic eval over realistic tasks, read the tool-calling metrics for redundant or never-used tools, and prune by evidence. That evaluate-then-prune loop is what turns “fewer tools” from a slogan into a measured discipline — and it belongs to the volume’s evaluation material, developed in the Operations volume rather than here. This chapter establishes the principle and its heuristics; the measurement loop that decides which tools to cut is the eval discipline’s job.

Three independent production reports, one direction. Vercel (many tools → 1), GitHub Copilot (40 → 13), and Block (30+ APIs → 3 MCP tools) each cut their tool set; Vercel and GitHub measured success-rate gains, Block reports a consolidation count. The convergence is in direction, not in a transferable number.

Pruning a bloated integrations agent Worked example

An agent wraps a SaaS platform with 22 tools — one per API endpoint — and keeps calling list_records when it should call search_records, then truncating the result by hand. Where is the fix?

Walk subtract-first:

Count. Most of the 22 tools never fire on the real workflow (create-ticket, update-status, search). Cut to the handful the workflow uses — that alone removes most wrong-tool surface.
Overlap. list_records and search_records overlap; the model flips between them. Consolidate into one find_records(query) and the coin-flip disappears. Define tools · AnthropicT1-official original
Response. find_records should return scoped fields, not the raw record dump the agent was hand-truncating. Writing effective tools for agents — with agents · Aizawa (Anthropic) (2025)T1-official original
Only then defer. If the platform genuinely needs all 22 for some workflows, keep the 3–5 most-used resident and load the rest on demand. Tool search tool · AnthropicT1-official original

The wrong move is to add a 23rd tool — a “smart record fetcher” — to paper over the selection error. That is addition where subtraction is the fix.

Patterns

Subtract to the workflow. Sketch: start from the smallest tool set that covers the actual workflow, justify every addition. When to use: always — it is the default. Writing effective tools for agents — with agents · Aizawa (Anthropic) (2025)T1-official original Mechanics: list the workflow’s real operations; expose those, not every endpoint. Remember: a tool is paid twice — definition tokens at rest, selection errors at runtime.

Consolidate overlaps. Sketch: fold near-duplicate tools into one capable, namespaced tool. When to use: whenever two tools could plausibly answer the same call. Define tools · AnthropicT1-official original Mechanics: merge list_*/search_*-style pairs; give the survivor a clear name. Remember: overlap is a runtime coin-flip, not an added capability.

High-signal responses. Sketch: return scoped, relevant fields — not raw dumps. When to use: any tool whose natural output is large. Writing effective tools for agents — with agents · Aizawa (Anthropic) (2025)T1-official original Mechanics: project to the fields the agent needs; paginate or summarize the rest. Remember: the response is a context-management decision as much as the tool’s existence.

Defer the irreducible remainder. Sketch: when many tools are genuinely required, load on demand and keep ~3–5 resident. When to use: large, legitimate capability surfaces you cannot cut. Tool search tool · AnthropicT1-official original Mechanics: Tool Search Tool; mark the most-used non-deferred. Remember: defer after subtracting — search does not fix a bloated set, it only hides it.

Quick reference

Default: the smallest tool set that covers the workflow beats a “complete” one. Writing effective tools for agents — with agents · Aizawa (Anthropic) (2025)T1-official original
Why: every tool is paid twice — definition tokens at rest + selection errors at runtime.
Evidence: Vercel (→1 tool, 80→100%), GitHub Copilot (40→13, +2–5 pts), Block (30+ APIs → 3 tools, consolidation count) — three independent vendor self-reports converging on subtract-first; direction, not a transferable number.
Heuristics: consolidate overlapping tools (+ namespace); return high-signal responses, not raw dumps.
Scale escape hatch: load tools on demand (keep ~3–5 resident) — but only after subtracting.
Honesty: unanchored aggregate token/eval figures are deliberately omitted; measure your own win.

Practice

Exercise solutions

Solution ↑ Exercise

The two costs are definition tokens at rest (the tool’s schema occupies the context window on every turn, spent regardless of use) and selection errors at runtime (overlapping or near-duplicate tools raise the chance the model picks the wrong one or takes a longer path). The first cost — definition tokens — is incurred even on turns where the tool is never called, because the schema is in the window either way. For a fixed workflow, a complete set therefore pays for capabilities the workflow never exercises while also enlarging the selection surface, so it is strictly worse than a minimal set that covers the same workflow: more cost, more error opportunity, no added coverage of what the agent must actually do.

Solution ↑ Exercise

The convergence licenses a directional conclusion: three independent teams, on three different production agents, each cut their tool set and reported the same outcome — fewer, sharper tools made the agent better (or, for Block, materially simpler to maintain). That is stronger evidence for the direction of subtract-first than any single benchmark, precisely because the reports are independent. It does not license quoting “80→100%” or “+2–5 points” as an effect size you will see: each is a vendor self-report on its own agent and tasks, Block’s is a consolidation count with no measured quality delta, and none is an independent benchmark. The honest stance is “subtract-first is well-corroborated as a direction; the size of the win on my agent is something I have to measure” — which is exactly why the evaluate-then-prune loop exists.