Context Assembly: Engineering the Window
The engineering response to context rot — the harness owns the boundary deciding what enters the window and when. Cache stability, just-in-time loading, compaction, attention placement, and assembly-as-prompt.
On this page
- The window is assembled from regions
- Cache stability — the prefix is an economic variable
- Loading — just-in-time vs preload
- Compaction — the window is finite
- Attention placement — position is not neutral
- Assembly is a prompt
- Diagnostic: which assembly failure are you hitting?
- Patterns
- Quick reference
- Practice
The previous chapter established the problem: long context does not degrade gracefully. This chapter is the engineering response. The harness owns the boundary deciding what enters the window and when — and “context assembly” is the discipline of choosing, ordering, caching, and compacting the bytes that go to the model each turn. This is the deepest chapter in the book; it carries the richest pattern catalog.
The window is assembled from regions
Each turn, the harness assembles a window from regions that differ in stability and role.
Cache stability — the prefix is an economic variable
A practitioner building Manus calls the KV-cache hit rate the “single most important metric for a production-stage AI agent,” [Practitioner] Context Engineering for AI Agents: Lessons from Building Manus · Yichao Ji (Manus) (2025)T3-practitioner original because it drives both latency and cost. The lever is a stable prefix: no mid-history edits, deterministic serialization, append-only. Anthropic’s prompt cache has, “by default, a 5-minute lifetime” [Official] Prompt caching · AnthropicT1-official original — the window inside which stability pays off.
A controlled multi-provider evaluation quantifies the payoff: caching “reduces API costs by 41-80% and improves time to first token by 13-31% across providers,” Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks · Lumer et al. (2026)T3-practitioner original and the placement matters — “placing dynamic content at the end of the system prompt” Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks · Lumer et al. (2026)T3-practitioner original beats “naive full-context caching, which can paradoxically increase latency.” Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks · Lumer et al. (2026)T3-practitioner original
Loading — just-in-time vs preload
What enters the window during a turn is a choice, not a default. The just-in-time pattern keeps “lightweight identifiers (file paths, stored queries, web links, etc.) and use[s] these references to dynamically load data into context at runtime using tools.” [Official] Effective context engineering for AI agents · Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, and Jeremy Hadfield (2025)T1-official original Even tool definitions can be deferred — presenting tools as code lets models “read tool definitions on-demand, rather than reading them all up-front.” Code execution with MCP: Building more efficient agents · Jones & Kelly (Anthropic) (2025)T1-official original The framing generalizes: context operations are write / select / compress / isolate, where writing context means “saving it outside the context window.” [Practitioner] Context Engineering · The LangChain Team (2025)T2-release-notes original
Compaction — the window is finite
The premise: the window “must be treated as a finite resource with diminishing marginal returns.”
[Official]
Effective context engineering for AI agents · Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, and Jeremy Hadfield (2025)T1-official original When it fills, you summarize (/compact) or you checkpoint-and-restart — the harness pattern of “finding a way for agents to quickly understand the state of work when starting with a fresh context window,” Effective harnesses for long-running agents · Justin Young (2025)T1-official original i.e., a progress file plus checkpoints rather than lossy in-context summarization.
Attention placement — position is not neutral
Where content lands changes whether the model uses it. Recall “is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades… in the middle.” Lost in the Middle: How Language Models Use Long Contexts · Liu et al. (TACL) (2023)T3-practitioner original
There is also an instruction budget: “performance consistently degrades as the number of instructions increases,” When Instructions Multiply: Measuring and Estimating LLM Capabilities of Multiple Instructions Following · Harada et al. (EMNLP) (2025)T3-practitioner original and even the best models “only achieve 68% accuracy at the max density of 500 instructions,” How Many Instructions Can LLMs Follow at Once? · Jaroslawicz et al. (2025)T3-practitioner original with “a bias towards earlier instructions.” How Many Instructions Can LLMs Follow at Once? · Jaroslawicz et al. (2025)T3-practitioner original The practitioner countermeasure exploits the end-of-context peak: recitation — “reciting its objectives into the end of the context” (a maintained todo.md)
[Practitioner]
Context Engineering for AI Agents: Lessons from Building Manus · Yichao Ji (Manus) (2025)T3-practitioner original to fight goal drift.
Assembly is a prompt
The window’s contents are prose to engineer, not plumbing. Tool descriptions sit in the cache-sensitive pre-commitment region — author them as you would “describe your tool to a new hire on your team.” [Official] Writing effective tools for agents — with agents · Aizawa (Anthropic) (2025)T1-official original Structure disambiguates: XML tags “help Claude parse complex prompts unambiguously, especially when your prompt mixes instructions, context, examples, and variable inputs,” Prompting best practices: use XML tags · AnthropicT1-official original and instruction-following gains “stem largely from parameter updates in attention modules,” A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in LLMs · Ye et al. (2025)T3-practitioner original a hint for why structured, constraint-bearing assembly is tractable.
Diagnostic: which assembly failure are you hitting?
Mirroring the previous chapter’s router, map an observed symptom to the assembly lever that addresses it — each routes to a pattern below:
- Symptom: latency and cost climb every turn, even on small tasks. → Cache instability — something perturbs the prefix. Reach for stable prefix + dynamic content at the tail.
- Symptom: quality decays as a long session or file grows. → Unbounded loading — too much is resident at once. Reach for just-in-time loading and compact-or-checkpoint.
- Symptom: the agent ignores an instruction you know is loaded. → Misplacement — it is sagging mid-window. Reach for place at the edges; recite the goal.
- Symptom: cost spikes right after a long session compacts. → Cache invalidation at the compaction boundary. Reach for checkpoint-and-restart with a small prefix.
- Symptom: a multi-part prompt is parsed wrong. → Unstructured assembly. Reach for structure with delimiters.
Patterns
Stable prefix. Sketch: keep the window front byte-identical turn to turn. When to use: always. Context Engineering for AI Agents: Lessons from Building Manus · Yichao Ji (Manus) (2025)T3-practitioner original Mechanics: no timestamps, deterministic serialization, append-only. Remember: the prefix is a 41–80% cost lever; perturbing it silently breaks the cache.
Dynamic content at the tail. Sketch: put volatile bits at the end of the (system) prompt. When to use: any cacheable prefix with changing parts. Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks · Lumer et al. (2026)T3-practitioner original Mechanics: stable front, volatile tail. Remember: naive full-context caching can increase latency.
Just-in-time loading. Sketch: keep pointers; resolve to data on demand. When to use: large/optional context. Effective context engineering for AI agents · Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, and Jeremy Hadfield (2025)T1-official original Mechanics: store identifiers (paths, queries); fetch via tools; defer tool defs too. Code execution with MCP: Building more efficient agents · Jones & Kelly (Anthropic) (2025)T1-official original Remember: preload the minimum; the window is finite.
Compact or checkpoint. Sketch: manage the window when it fills. When to use: long sessions. Effective harnesses for long-running agents · Justin Young (2025)T1-official original Mechanics: prefer checkpoint-and-restart from a progress file over lossy summarize. Remember: compaction can invalidate the cache — keep the restart prefix small.
Place at the edges; recite the goal. Sketch: exploit the U-shaped attention curve. When to use: load-bearing instructions; long tasks. Lost in the Middle: How Language Models Use Long Contexts · Liu et al. (TACL) (2023)T3-practitioner original Mechanics: put critical content at an edge; re-emit goals at the tail (todo.md). Context Engineering for AI Agents: Lessons from Building Manus · Yichao Ji (Manus) (2025)T3-practitioner original Remember: the middle is where attention sags.
Structure with delimiters. Sketch: mark up multi-part context. When to use: mixed instructions/context/examples. Prompting best practices: use XML tags · AnthropicT1-official original Mechanics: XML tags as semantic anchors; tool descriptions as clear prose. Remember: the assembled window is a prompt — structure shapes attention.
Quick reference
- The window is assembled from pre-commitment / loaded / history / placement regions.
- Stable prefix = cost + latency lever (41–80% / 13–31%); dynamic content at the tail.
- JIT loading: keep pointers, resolve on demand; the window is finite.
- Compaction: prefer checkpoint-and-restart; mind the stale-prefix/cache interaction.
- Placement: edges beat the middle (U-shaped bias); budget instructions; recite goals at the tail.
- Assembly is a prompt: engineer tool descriptions and structure.
Practice
Exercise solutions
Positional → place load-bearing content at an edge (lost-in-the-middle / U-shaped bias). Length → just-in-time loading + compaction (keep the window small; treat it as finite). Reasoning → decomposition + recitation (shorten what must be reasoned over at once; re-emit goals). Effective-window → budget instructions and load JIT so the working set stays well under the marketed limit. The throughline: every rot failure mode has an assembly lever — which is exactly why the rot chapter (problem) precedes this one (response).
Fix: (1) just-in-time loading for what enters the window — keep a lightweight pointer to the config (a path/identifier) and resolve only the needed slice via a tool, instead of inlining the whole file each turn; this pulls the length/finite-window lever (smaller working set, less rot). (2) stable prefix for prefix stability — keep the cacheable front (system prompt, tool defs) byte-identical and put anything volatile at the tail, so the cache stays valid; this pulls the cost/latency lever (41–80% cost / 13–31% TTFT). Together they turn “re-read everything every turn” into “stable cached front + minimal JIT tail” — the assembly answer to the worked complaint.