Tool Minimization: Subtract First
The governing default of the volume's capability axis — the smallest tool set that covers the workflow beats a complete one. Why an extra tool is paid twice (definition tokens at rest, selection errors at runtime), the three independent production reports that converge on subtract-first, the two highest-leverage heuristics (consolidate, return high-signal), and the dynamic complement — load tools on demand when scale forces it.
On this page
This is the governing default of the volume’s capability axis. The spine framed every tool as a context cost; this chapter turns that into a discipline. The move is subtraction: start from the smallest tool set that covers the workflow and justify every addition, because each tool you add is paid twice — in definition tokens at rest and in selection errors at runtime. The principle is Anthropic’s; the unusually strong corroboration is three production teams who cut their tool sets and reported back.
Subtract-first: the smallest set that covers the workflow
Start from the counter-intuitive claim and let it organize the rest. Anthropic’s tool-design guidance states it flatly: “More tools don’t always lead to better outcomes.” [Official] Writing effective tools for agents — with agents · Aizawa (Anthropic) (2025)T1-official original The failure mode is specific — “too many tools or overlapping tools can also distract agents from pursuing efficient strategies.” [Official] Writing effective tools for agents — with agents · Aizawa (Anthropic) (2025)T1-official original The corrective is to build a few thoughtful tools that cover the workflow rather than wrap every API endpoint you happen to have. [Official] Writing effective tools for agents — with agents · Aizawa (Anthropic) (2025)T1-official original
The reason this is a discipline and not a preference is that an extra tool charges you twice.
This is the same logic the volume’s spine drew as the capability axis, now stated as an action. A “complete” tool set — one tool per endpoint, every capability the platform offers — optimizes for coverage of possibilities. A minimal set optimizes for coverage of the workflow, which is the only thing the agent actually has to do. The two come apart fast: most of a complete set’s tools never fire on a given task, but every one of them is in the window distracting selection.
The case studies converge — and how far that licenses you
Subtract-first would be a plausible-but-unproven heuristic if it rested only on the design guidance. It does not. Three first-party engineering reports from 2025, on three different production agents, independently cut their tool sets and reported the same direction.
That is genuine convergence of practice — and it is worth being precise about what kind. These are vendor self-reports, each team measuring its own agent on its own tasks, not independent third-party benchmarks. The figures are real (each is quoted from the primary write-up), but the convergence is in direction, not in a transferable effect size.
The honest reading is the strongest one here: three independent practitioners pointing the same way is better evidence than any single benchmark, and none of these numbers is a law you can quote for your own agent. Subtract-first is well-corroborated as a direction; the size of the win is yours to measure.
Consolidate, and make the response high-signal
Once count is under control, two heuristics carry most of the remaining leverage — and both are really the same context-management instinct applied to tools.
Consolidate overlapping functionality into fewer capable tools, and namespace them so the model can tell them apart. [Official] Define tools · AnthropicT1-official original Two tools that do almost the same thing do not add a capability; they add a coin-flip the model has to win on every relevant turn. Folding them into one tool with a clear name removes the ambiguity at its source.
Return high-signal information, not raw dumps. A tool’s response is as much a context-management decision as its existence: returning a 5,000-token raw payload when the agent needs three fields spends the window you just saved by cutting tool count. [Official] Writing effective tools for agents — with agents · Aizawa (Anthropic) (2025)T1-official original
When you can’t subtract, defer
Some agents genuinely need a large capability surface — a platform with hundreds of legitimate operations. Subtract-first has a dynamic complement for exactly this case: keep the active set small by loading tools on demand rather than presenting all of them upfront. The Tool Search Tool does this — tools are discovered and loaded when relevant, with only the few most-used kept non-deferred in the window. [Official] Tool search tool · AnthropicT1-official original A working default is to keep roughly the 3–5 most-used tools resident and defer the rest. [Official] Tool search tool · AnthropicT1-official original
The order matters: subtract first, then defer what genuinely cannot be cut. On-demand loading is not a license to keep a bloated set and hide it behind search — a hundred half-redundant tools still produce wrong-tool selection once they are loaded. Cut to the workflow, consolidate the overlaps, and only then defer the irreducible remainder.
Measure, don’t guess
Subtract-first becomes empirical, not aesthetic, when you close the loop: build a small agentic eval over realistic tasks, read the tool-calling metrics for redundant or never-used tools, and prune by evidence. That evaluate-then-prune loop is what turns “fewer tools” from a slogan into a measured discipline — and it belongs to the volume’s evaluation material, developed in the Operations volume rather than here. This chapter establishes the principle and its heuristics; the measurement loop that decides which tools to cut is the eval discipline’s job.
Patterns
Subtract to the workflow. Sketch: start from the smallest tool set that covers the actual workflow, justify every addition. When to use: always — it is the default. Writing effective tools for agents — with agents · Aizawa (Anthropic) (2025)T1-official original Mechanics: list the workflow’s real operations; expose those, not every endpoint. Remember: a tool is paid twice — definition tokens at rest, selection errors at runtime.
Consolidate overlaps. Sketch: fold near-duplicate tools into one capable, namespaced tool. When to use: whenever two tools could plausibly answer the same call. Define tools · AnthropicT1-official original Mechanics: merge list_*/search_*-style pairs; give the survivor a clear name. Remember: overlap is a runtime coin-flip, not an added capability.
High-signal responses. Sketch: return scoped, relevant fields — not raw dumps. When to use: any tool whose natural output is large. Writing effective tools for agents — with agents · Aizawa (Anthropic) (2025)T1-official original Mechanics: project to the fields the agent needs; paginate or summarize the rest. Remember: the response is a context-management decision as much as the tool’s existence.
Defer the irreducible remainder. Sketch: when many tools are genuinely required, load on demand and keep ~3–5 resident. When to use: large, legitimate capability surfaces you cannot cut. Tool search tool · AnthropicT1-official original Mechanics: Tool Search Tool; mark the most-used non-deferred. Remember: defer after subtracting — search does not fix a bloated set, it only hides it.
Quick reference
- Default: the smallest tool set that covers the workflow beats a “complete” one. Writing effective tools for agents — with agents · Aizawa (Anthropic) (2025)T1-official original
- Why: every tool is paid twice — definition tokens at rest + selection errors at runtime.
- Evidence: Vercel (→1 tool, 80→100%), GitHub Copilot (40→13, +2–5 pts), Block (30+ APIs → 3 tools, consolidation count) — three independent vendor self-reports converging on subtract-first; direction, not a transferable number.
- Heuristics: consolidate overlapping tools (+ namespace); return high-signal responses, not raw dumps.
- Scale escape hatch: load tools on demand (keep ~3–5 resident) — but only after subtracting.
- Honesty: unanchored aggregate token/eval figures are deliberately omitted; measure your own win.
Practice
Exercise solutions
The two costs are definition tokens at rest (the tool’s schema occupies the context window on every turn, spent regardless of use) and selection errors at runtime (overlapping or near-duplicate tools raise the chance the model picks the wrong one or takes a longer path). The first cost — definition tokens — is incurred even on turns where the tool is never called, because the schema is in the window either way. For a fixed workflow, a complete set therefore pays for capabilities the workflow never exercises while also enlarging the selection surface, so it is strictly worse than a minimal set that covers the same workflow: more cost, more error opportunity, no added coverage of what the agent must actually do.
The convergence licenses a directional conclusion: three independent teams, on three different production agents, each cut their tool set and reported the same outcome — fewer, sharper tools made the agent better (or, for Block, materially simpler to maintain). That is stronger evidence for the direction of subtract-first than any single benchmark, precisely because the reports are independent. It does not license quoting “80→100%” or “+2–5 points” as an effect size you will see: each is a vendor self-report on its own agent and tasks, Block’s is a consolidation count with no measured quality delta, and none is an independent benchmark. The honest stance is “subtract-first is well-corroborated as a direction; the size of the win on my agent is something I have to measure” — which is exactly why the evaluate-then-prune loop exists.