Context Rot: Why Windows Degrade
The evidence that long context does not degrade gracefully — four distinct failure modes, why the robust claim is directional not numeric, and why "architectural and unsolved" overshoots in 2026. This is the problem context assembly answers.
On this page
The environment half is done: the substrate is made legible, budgeted, loaded on demand, guarded, and bounded at scale. Now the window itself — what the harness assembles from all that available signal, and why it degrades. This is the problem chapter. If long contexts degraded gracefully, “just put everything in the window” would be sound and the next chapter would be unnecessary. The evidence says they do not — so context assembly (next) is a response to a measured failure. This chapter is evidence, not patterns: it builds the case, then hands you a diagnostic for locating which failure mode you’re hitting.
Degradation is four failure modes, not one
“Context rot” is an umbrella over mechanisms that fail for different reasons and are caught by different benchmarks.
- Positional — where the token sits. Recall “is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades… in the middle” Lost in the Middle: How Language Models Use Long Contexts · Liu et al. (TACL) (2023)T3-practitioner original — the U-shaped curve.
- Length — how much is in the window. An “evaluation across 18 LLMs… reveal[s] nonuniform performance with increasing input length,” Context Rot: How Increasing Input Tokens Impacts LLM Performance · Hong, Troynikov & Huber (Chroma Research) (2025)T3-practitioner original independent of whether the needle is found.
- Reasoning — reasoning over the facts, not locating them. Multi-hop degradation is “primarily driven by the reduction in the length of the thinking process as the input length increases.” Reasoning on Multiple Needles In A Haystack · Wang (2025)T3-practitioner original
- Effective vs. claimed — the marketed window is not the working one. Of models claiming 32K+, “only half… can maintain satisfactory performance at the length of 32K.” RULER: What's the Real Context Size of Your Long-Context Language Models? · Hsieh et al. (NVIDIA, COLM) (2024)T3-practitioner original
The robust claim is directional, not numeric
Across an 18-model panel and a peer-reviewed synthetic suite, the robust, corroborated finding is that performance falls with length and that the effective window is materially shorter than the claimed one — RULER reports “almost all models exhibit large performance drops as the context length increases.” RULER: What's the Real Context Size of Your Long-Context Language Models? · Hsieh et al. (NVIDIA, COLM) (2024)T3-practitioner original The specific percentages (“11 models drop below 50% of their strong short-length baselines” at 32K, NoLiMa: Long-Context Evaluation Beyond Literal Matching · Modarressi et al. (ICML) (2025)T3-practitioner original “only half at 32K”) are model- and benchmark-dependent.
The practitioner operationalization is directional too: “get your context into the LLM in the most token- and attention-efficient way you can.” [Practitioner] 12-Factor Agents — Factor 3: Own your context window · Dex Horthy (HumanLayer) (2025)T3-practitioner original Fewer, denser tokens — not a threshold.
Two findings that surprise builders
”Unsolved” overshoots — and rot reaches the overseer
The strong framing — context rot is architectural and no model solves it — overshoots the 2026 evidence. The degradation is robust and near-universal today, but decode-time work shows the attenuation is partially reversible: gold tokens are down-weighted, not erased Mitigating Posterior Salience Attenuation in Long-Context LLMs with Positional Contrastive Decoding · Xiao et al. (2025)T3-practitioner original — they still occupy high-ranking positions in the decoding space, recoverable at decode time. Honest synthesis: degradation is near-universal now; whether it is fundamentally architectural or substantially trainable/decodable is open, and the 2026 frontier is actively eroding the “unsolvable” claim. Build for the degradation that exists; don’t bet the architecture on it being permanent.
One 2026 extension matters for design: rot reaches the monitor. An LLM acting as judge/monitor degrades on long transcripts, missing flagged actions far more often as the trace grows. Classifier Context Rot: Monitor Performance Degrades with Context Length · Martin & Roger (2026)T3-practitioner original Long-running agentic sessions degrade both the actor and the safety layer watching it.
Diagnostic: which failure mode are you hitting?
This chapter has no pattern catalog — the responses are the next chapter. Instead, a diagnostic to locate the failure before you reach for a fix (every fix routes to Context Assembly):
- Symptom: the agent ignores something you know is in context. → Positional — it’s likely buried mid-window. Look at placement (move load-bearing content to an edge).
- Symptom: quality falls as the session/file grows, even when the fact is present. → Length — look at how much is loaded (prune, compact).
- Symptom: it finds the right facts but draws the wrong conclusion. → Reasoning — look at decomposition (split the multi-hop task).
- Symptom: it works in small repros, fails at “full” context well under the limit. → Effective-vs-claimed — look at the working window, not the marketed one.
- Symptom: your LLM judge/monitor stops flagging issues on long runs. → Monitor rot — shorten/segment what the overseer reviews.
Quick reference
- Four failure modes: positional · length · reasoning · effective-vs-claimed. Different causes, different fixes.
- Robust = directional: performance falls with length; the effective window is far shorter than the claimed window. Never quote a portable %.
- Surprises: coherence can hurt; retrieval ≠ reasoning.
- “Unsolved” overshoots: near-universal now, but partially trainable/decodable — an open question.
- Rot reaches the overseer: long traces degrade the monitor too.
- The responses are the next chapter (Context Assembly).
Practice
Exercise solutions
Positional (symptom: a known in-context fact is ignored → fix is placement); length (symptom: quality decays as the window fills, needle present → amount); reasoning (symptom: right facts, wrong multi-hop conclusion → decomposition); effective-vs-claimed (symptom: fine in small repros, fails well under the limit → working-window awareness). The most-misdiagnosed is reasoning degradation — “it found the facts but got the answer wrong” looks like a capability gap, but the evidence attributes it to a shortening thinking process at length, which is a context problem with a context fix (decompose), not a reason to swap models.
“A larger marketed window isn’t a larger working window — RULER found only half of 32K-claimed models hold up at 32K, and degradation is near-universal as length grows, so a 1M model will still rot well before 1M. I’d first diagnose length and reasoning degradation: prune/compact what’s loaded and decompose the multi-hop steps before assuming we need more context — the fix is assembly, not a bigger window.” The deeper point: rot is why context engineering exists; buying window capacity treats the symptom’s label, not the mechanism.