Part 3 Chapter 27 Last verified 2026-06-14 Fresh

Security: The Adversarial-Input Layer

The adversarial-input layer — who is really issuing the instruction. Prompt injection and Willison's lethal trifecta as the necessary-conditions threat model; the incidents (EchoLeak, Comet, ShadowPrompt) as one attack shape; why detection-only fails by construction and design-by-construction is this volume's one genuine convergence; the honest residual that defenses reduce, not eliminate; and a supply chain whose trust the registry delegates to you. The authorized-but-forged counterpart to Vol 1's authorized-but-risky guardrails.

Volatility: architectural-pattern

Tools compared: claude-codecross-tool

On this page

The authorized-but-forged problem
The lethal trifecta
The incidents are one attack shape
Why detection-only fails by construction
Design-by-construction is the backbone
Defenses reduce, not eliminate
The supply chain delegates trust to you
Quick reference
Practice

Before you start: Vol 1's guardrails — the permission model that decides which actions an agent may attempt (Always / Ask / Never, the deny-list). This chapter is its counterpart: not what the agent is authorized to do, but whether an attacker can forge the instruction it follows.

You will learn

Why prompt injection is a structural problem, not a bug to be patched — the agent cannot reliably tell its principal’s instructions from untrusted content
The lethal trifecta as the necessary-conditions threat model, and why every robust defense cuts one of its three legs
Why the real incidents (EchoLeak, Comet, ShadowPrompt) are one attack shape, not three
Why detection-only fails by construction, and why design-by-construction is the one place this volume finds genuine convergence
The honest residual: defenses reduce, not eliminate — and a supply chain whose trust the registry hands to you

Vol 1’s guardrails answered “what may this agent attempt?” — the authorized-but-risky question, governed by the permission model. This chapter answers a different one: “who is really issuing the instruction the agent just followed?” That is the authorized-but-forged question. When an agent reads a web page, an email, or a tool result, it ingests text an attacker may control — and the thesis of this chapter is that the agent cannot, by construction, reliably tell that text apart from its operator’s commands. Security here is the discipline of defending a system that trusts its inputs more than it should.

The authorized-but-forged problem

Start with the definition. A prompt-injection vulnerability “occurs when user prompts alter the LLM’s behavior or output in unintended ways.” [Practitioner] LLM01:2025 Prompt Injection · OWASP Gen AI Security ProjectT3-practitioner original The community standard splits it in two: a direct injection is supplied by the user, while an indirect injection “occur[s] when an LLM accepts input from external sources, such as websites or files.” [Practitioner] LLM01:2025 Prompt Injection · OWASP Gen AI Security ProjectT3-practitioner original The indirect case is this chapter’s threat model, because it is the one the operator did not type: the dangerous content arrives through a web page the agent browses, a document it retrieves, or a tool result it reads back.

The reason this is not “just a bug to be patched” is structural. The model receives one undifferentiated stream of tokens; the operator’s instructions and the ingested content are the same kind of thing to it. There is no reliable, built-in channel that says this half is my principal and that half is data. Patching one injection string does not change that — the next phrasing slips through. This is why the rest of the chapter is about cutting the attack’s preconditions by design rather than spotting its signature after the fact.

The lethal trifecta

The sharpest statement of those conditions is Simon Willison’s lethal trifecta. [Practitioner] The lethal trifecta for AI agents: private data, untrusted content, and external communication · Simon Willison (2025)T3-practitioner original An agent becomes exfiltratable when it simultaneously has three capabilities: access to private data, exposure to untrusted content — “any mechanism by which text (or images) controlled by a malicious attacker could become available to your LLM” [Practitioner] The lethal trifecta for AI agents: private data, untrusted content, and external communication · Simon Willison (2025)T3-practitioner original — and a path for external communication. With all three present, an attacker can trick the agent “into accessing your private data and sending it to that attacker.” [Practitioner] The lethal trifecta for AI agents: private data, untrusted content, and external communication · Simon Willison (2025)T3-practitioner original

The framing is load-bearing for the whole chapter because it is a necessary-conditions model: the catastrophe needs all three legs at once. That has a direct design consequence — the cleanest defenses work by removing one leg: deny the private data, isolate the untrusted content so it cannot become instruction, or block the exfiltration path (and where no leg can be removed outright, hardening the model against the combination is the weaker fallback the next section covers). It also makes the incident landscape legible, because every real case below is the same three legs in a different costume.

The lethal trifecta (Willison, a practitioner coinage). Three legs — private-data access, exposure to untrusted content, and an external-communication path — converge on catastrophic exfiltration only when all three are present. Each robust defense cuts one leg: deny the data, isolate the untrusted content, or block the exfiltration channel. Any single leg removed defuses the combination.

The incidents are one attack shape

Indirect injection is not hypothetical, and the public incidents are best read as one attack instantiated three ways. EchoLeak is the keystone: its authoritative record describes an “Ai command injection in M365 Copilot [that] allows an unauthorized attacker to disclose information over a network,” CVE-2025-32711 · NVD (NIST National Vulnerability Database)T1-official original and the disclosing researchers reported that the chains “automatically exfiltrate sensitive and proprietary information from M365 Copilot context, without the user’s awareness or relying on any specific victim behavior” [Practitioner] Breaking down 'EchoLeak', the First Zero-Click AI Vulnerability Enabling Data Exfiltration from Microsoft 365 Copilot · Itay Ravia (Cato Networks / Aim Labs)T3-practitioner original — a zero-click realization of all three legs. Comet, an agentic browser, on a “summarize this page” action fed page content to its model “without distinguishing between the user’s instructions and untrusted content from the webpage,” [Practitioner] Agentic Browser Security: Indirect Prompt Injection in Perplexity Comet · Artem Chaikin and Shivan Kaul Sahib (Brave)T3-practitioner original the instruction/data confusion made concrete. ShadowPrompt disclosed “a vulnerability that allowed any website to silently inject prompts into [Claude’s Chrome extension] as if the user wrote them.” [Practitioner] ShadowPrompt: How Any Website Could Have Hijacked Claude's Chrome Extension · Oren Yomtov (Koi Research) (2026)T3-practitioner original And the generic exfiltration leg is old: the markdown-image channel, where “the individual controlling the data a plugin retrieves can exfiltrate chat history due to ChatGPT’s rendering of markdown images,” [Practitioner] ChatGPT Plugins: Data Exfiltration via Images and Cross Plugin Request Forgery · Johann Rehberger (wunderwuzzi) (2023)T3-practitioner original shows the “external communication” leg needs nothing more exotic than an auto-rendered image URL.

Read together they are not a zoo of exotic exploits; they are the trifecta, again and again — private context plus attacker-controllable content plus a way out.

Why detection-only fails by construction

The tempting response is to add a classifier that flags malicious input. The literature is blunt that this is the wrong primary control. A formal analysis of known-answer detection “uncover[s] a structural vulnerability that invalidates its core security premise,” How Not to Detect Prompt Injections with an LLM · Choudhary, Anshumaan, Palumbo, Jha (2025)T3-practitioner original and the authors build an adaptive attack, DataFlip, that “consistently evades KAD defenses.” How Not to Detect Prompt Injections with an LLM · Choudhary, Anshumaan, Palumbo, Jha (2025)T3-practitioner original Independent empirical work against deployed commercial detectors reaches, “in some instances[,] up to 100% evasion success.” Bypassing LLM Guardrails: An Empirical Analysis of Evasion Attacks against Prompt Injection and Jailbreak Detection Systems · Hackett, Birch, Trawicki, Suri, GarraghanT3-practitioner original The lesson is not that detectors are merely imperfect in practice — it is that they are evadable by construction: a classifier that can be probed can be defeated, because the attacker optimizes against exactly the signal the classifier reads.

Design-by-construction is the backbone

If you cannot reliably spot the attack, you must build the defense in by construction — so that untrusted content cannot reliably act as instruction in the first place. This is the one place in this volume where multiple independent research groups converge on the same principle, so it is the one place the book tags genuine convergence.

Convergence

Multiple independent research groups — two academic, two from a model vendor (Meta), but mutually independent — agree that prompt injection is defended by construction, not detection: CaMeL separates control flow from data flow so “the untrusted data retrieved by the LLM can never impact the program flow” Defeating Prompt Injections by Design · Debenedetti, Shumailov, Fan, Hayes, Carlini, et al.T3-practitioner original and secures the agent “even when underlying models are susceptible to attacks” Defeating Prompt Injections by Design · Debenedetti, Shumailov, Fan, Hayes, Carlini, et al.T3-practitioner original ; Beurer-Kellner et al. propose “principled design patterns for building AI agents with provable resistance to prompt injection” Design Patterns for Securing LLM Agents against Prompt Injections · Beurer-Kellner, Buesser, Creţu, Debenedetti, et al.T3-practitioner original ; and Meta SecAlign pushes the same idea to the model itself, “the first fully open-source LLM with built-in model-level defense.” Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks · Chen, Zharmagambetov, Wagner, Guo (Meta)T3-practitioner original Three independent groups, three layers — system, pattern, model — one principle: build the defense in (cut a trifecta leg architecturally, or harden the model itself) rather than classify the input after the fact.

The actionable form is a single rule, and it is genuine convergence, not one vendor’s house style: [Convergence] Defeating Prompt Injections by Design · Debenedetti, Shumailov, Fan, Hayes, Carlini, et al.T3-practitioner original Design Patterns for Securing LLM Agents against Prompt Injections · Beurer-Kellner, Buesser, Creţu, Debenedetti, et al.T3-practitioner original Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks · Chen, Zharmagambetov, Wagner, Guo (Meta)T3-practitioner original do not buy a prompt-injection “detector” as your primary control — defend by construction, whether by cutting a trifecta leg architecturally or hardening the model itself. Runtime monitors still have a place, but as one layer among several: LlamaFirewall is described by its authors as “a final layer of defense,” LlamaFirewall: An open source guardrail system for building secure AI agents · Chennabasappa, Nikolaidis, Song, et al. (Meta)T3-practitioner original not a solution. The honest counterweight even from the design side is that the patterns “discuss their trade-offs in terms of utility and security” Design Patterns for Securing LLM Agents against Prompt Injections · Beurer-Kellner, Buesser, Creţu, Debenedetti, et al.T3-practitioner original — cutting a leg constrains what the agent can do, so security here is bought with capability, not for free.

Defenses reduce, not eliminate

No control in this chapter takes the risk to zero, and the most honest reading is that today’s safety margin is partly accidental. Anthropic’s own browser red-team is the cleanest illustration: “Browser use without our safety mitigations showed a 23.6% attack success rate when deliberately targeted by malicious actors,” [Official] Piloting Claude in Chrome · Anthropic (2025)T1-official original and with mitigations “we reduced the attack success rate of 23.6% to 11.2%.” [Official] Piloting Claude in Chrome · Anthropic (2025)T1-official original Those are first-party, self-reported figures, and the load-bearing fact is the residual: 11.2% is a large reduction, but it is not zero. A benchmark of web-agent security states the point even more starkly — “attacks partially succeed in up to 86% of the case[s], even [as] state-of-the-art agents often struggle to fully complete the attacker goals,” which the authors name “security by incompetence.” WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks · Evtimov, Zharmagambetov, Grattafiori, Guo, ChaudhuriT3-practitioner original In other words, part of why agents are not catastrophically exploited today is that they are not yet good enough at completing an attacker’s goal — a margin that erodes as agents improve. Treat prompt-injection defense as risk reduction and defense-in-depth, never as solved.

The supply chain delegates trust to you

The last surface is the components an agent depends on. OWASP frames the integrity risk across “training data, models, and deployment platforms,” [Practitioner] LLM03:2025 Supply Chain · OWASP Gen AI Security ProjectT3-practitioner original and the modern instance is third-party MCP servers (with third-party skills an adjacent surface this chapter does not quantify). The load-bearing fact is first-party and candid: Anthropic reviews connectors against listing criteria “but does not security-audit or manage any MCP server,” [Official] Security · AnthropicT1-official original so the recommended posture is to write “your own MCP servers or [use] MCP servers from providers that you trust.” [Official] Security · AnthropicT1-official original Academic work corroborates the gap: malicious MCP servers are “easy to implement, difficult to detect with current tools, and capable of causing concrete damage.” When MCP Servers Attack: Taxonomy, Feasibility, and Mitigation · Zhao, Liu, Ruan, Li, Liang (2025)T3-practitioner original The conclusion is sharp — listing in a registry is not vetting. Install-time trust is the operator’s responsibility, and allowlisting by provenance is the control, not the marketplace.

Quick reference

Threat model: prompt injection is structural — the agent cannot reliably separate its principal’s instructions from ingested content; LLM01:2025 Prompt Injection · OWASP Gen AI Security ProjectT3-practitioner original the indirect (external-content) case is the danger.
The hinge — the lethal trifecta: private data + untrusted content + an exfiltration path; all three are needed, so cut one leg. The lethal trifecta for AI agents: private data, untrusted content, and external communication · Simon Willison (2025)T3-practitioner original
One attack shape: EchoLeak, CVE-2025-32711 · NVD (NIST National Vulnerability Database)T1-official original Comet, Agentic Browser Security: Indirect Prompt Injection in Perplexity Comet · Artem Chaikin and Shivan Kaul Sahib (Brave)T3-practitioner original ShadowPrompt, ShadowPrompt: How Any Website Could Have Hijacked Claude's Chrome Extension · Oren Yomtov (Koi Research) (2026)T3-practitioner original and the markdown-image channel ChatGPT Plugins: Data Exfiltration via Images and Cross Plugin Request Forgery · Johann Rehberger (wunderwuzzi) (2023)T3-practitioner original are the same three legs.
Detection fails by construction: known-answer detection is structurally evadable; How Not to Detect Prompt Injections with an LLM · Choudhary, Anshumaan, Palumbo, Jha (2025)T3-practitioner original deployed detectors hit “up to 100%” evasion in some instances. Bypassing LLM Guardrails: An Empirical Analysis of Evasion Attacks against Prompt Injection and Jailbreak Detection Systems · Hackett, Birch, Trawicki, Suri, GarraghanT3-practitioner original
The one convergence — design-by-construction: cut a leg architecturally (CaMeL / design patterns / model-level) Defeating Prompt Injections by Design · Debenedetti, Shumailov, Fan, Hayes, Carlini, et al.T3-practitioner original Design Patterns for Securing LLM Agents against Prompt Injections · Beurer-Kellner, Buesser, Creţu, Debenedetti, et al.T3-practitioner original Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks · Chen, Zharmagambetov, Wagner, Guo (Meta)T3-practitioner original — don’t buy a detector as your primary control. (The convergence tag sits on this claim in the body.)
Reduce, not eliminate: Anthropic’s browser ASR fell 23.6% → 11.2%, not to zero; Piloting Claude in Chrome · Anthropic (2025)T1-official original WASP’s “security by incompetence” is a margin that erodes as agents improve. WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks · Evtimov, Zharmagambetov, Grattafiori, Guo, ChaudhuriT3-practitioner original
Supply chain: the registry “does not security-audit … any MCP server” Security · AnthropicT1-official original — listing is not vetting; allowlist by provenance.

Practice

Exercise solutions

Solution ↑ Exercise

The three legs are access to private data, exposure to untrusted (attacker-controllable) content, and a path for external communication. It is a necessary-conditions model because the catastrophe — an attacker reading your secrets and shipping them out — requires all three at once: private data with no untrusted content has nothing to hijack it; untrusted content with no private-data access has nothing to steal; either of those with no exfiltration path cannot get the data out. That is exactly why the defensive move is to remove one leg rather than to harden all three. Mapping EchoLeak: the private data is the M365 Copilot context; the untrusted content is the attacker’s email/content that Copilot ingests; the exfiltration path is the network disclosure the CVE describes (“disclose information over a network”). The same trio appears in Comet (page content as untrusted input, browser tools as the exfiltration path) and ShadowPrompt (any website injecting prompts “as if the user wrote them,” with the extension’s reach as the data/exfiltration surface).

Solution ↑ Exercise

“Fails in practice” would mean a detector that is merely imperfect — catches most attacks, misses some, and improves with more training. “Fails by construction” is stronger: the design of a detection-only defense contains the vulnerability, independent of how good the classifier is. Known-answer detection has a “structural vulnerability that invalidates its core security premise,” and an adaptive attacker who can probe the classifier optimizes directly against the signal it reads — DataFlip “consistently evades” it, and deployed detectors have been evaded “up to 100%” in some instances. Because the failure is structural, throwing a better classifier at it does not close the hole; you must instead make untrusted content structurally unable to act as instruction (cut a trifecta leg) — design-by-construction. A detector is still defensible as one layer of defense-in-depth — a runtime monitor that raises the attacker’s cost and catches the unsophisticated cases — exactly as LlamaFirewall positions itself as “a final layer of defense.” What is indefensible is making that evadable layer your primary control.

Solution ↑ Exercise

A worked example. Take a research agent that browses the web and can post to an internal Slack. Trifecta audit: (1) private data — yes, it has the team’s internal context and Slack history; (2) untrusted content — yes, it reads arbitrary web pages; (3) external communication — yes, web fetches and Slack posts can both carry data out. All three legs are present, so it is exfiltratable. The most practical leg to cut by design is usually (2)→instruction: run the browsing in a mode where retrieved page text is structurally treated as data, never as instruction (e.g., a control/data-flow separation so fetched content cannot trigger tool calls), which is the CaMeL-style move. The capability cost is real and must be named: the agent can no longer act on instructions it finds on a page — including legitimate ones like “see the linked doc for the full spec” — so some autonomy is lost. If your harness cannot enforce that separation, the honest fallbacks are to cut leg (1) (scope the agent’s data access so a successful injection steals little) or leg (3) (remove the outbound channel — no Slack post, no arbitrary fetch — so exfiltration has nowhere to go). If you genuinely cannot cut any leg, the correct output of this exercise is to say so plainly: the agent is exposed, and the residual risk must be accepted, escalated to a human gate (ch26), or the deployment reconsidered — not papered over with a detector.