Context Engineering
The discipline behind reliable AI agents
Serious AI builders in 2026 have shifted from prompt engineering (optimize one string) to context engineering (optimize the whole information environment). The six layers — system prompt, skills, memory, retrieval, tool results, in-context examples — each carry a token cost, and the engineer decides what enters and exits the window turn-by-turn. ICLR 2026 research (ACE framework) shows +10.6% agent accuracy when contexts are treated as evolving playbooks rather than static prompts.
+10.6%
Agent accuracy lift (ACE framework)
Zhang et al., ICLR 2026
6
Context layers to compose
Synthesis
21pt
Accuracy gap by context format
McMillan 2026
The Six Layers of Context
Every turn, the engineer composes six layers of information into the model's window. Budget is finite — every token of in-context examples is a token not available for retrieval.
System prompt
Layer 1Base persona, rules, non-negotiable constraints. Fixed per application.
Skills (progressive disclosure)
Layer 2Loadable capability packets — tools, expertise, examples. Loaded on demand, unloaded when done.
Memory
Layer 3Persistent facts about user and project across sessions. Rules ("when X, do Y") earn their spot; summaries rarely do.
Retrieval
Layer 4Documents pulled based on the current query. Just-in-time beats static RAG in most production systems.
Tool results
Layer 5Outputs from executed tool calls. Must be pre-summarized or paginated — a 5MB JSON tool breaks context engineering.
In-context examples
Layer 6Few-shot demonstrations of the desired behavior. Curated, not hardcoded — library over hardcode.
Core Patterns
Five patterns separate production-grade context engineering from prompt-engineering-at-scale. Each trades complexity for reliability.
Progressive disclosure
Pattern 1Load skills when needed, unload when done. Claude Code's skill system is the canonical example.
Just-in-time retrieval
Pattern 2Let the model call grep or vector search *after* it understands the user's intent. Beats large static RAG prompts.
Memory that earns its spot
Pattern 3Every persistent memory costs tokens forever. Write rules, not summaries. Avoid context rot.
Tool output shaping
Pattern 4Pre-summarize, paginate, or stream tool outputs. Format choice drove a 21-point accuracy gap in 2026 benchmarks.
Context budget accounting
Pattern 5Track tokens per layer. A debugging agent whose system prompt eats 30% of the window has 30% less room for the actual bug.
Why Long Context Is Not Free Lunch
Claude Opus 4.7 ships with 1M tokens. Gemini 2.5 Pro does the same. Yet the "Lost in the Middle" study (Liu et al., 2024) showed models reliably attend to the beginning and end of context but degrade in the middle. The NVIDIA RULER benchmark confirmed the same result across every frontier model tested. Longer context changes the question from "how do we fit our information in?" to "how do we decide what *not* to include?" — which is still a context engineering problem.
When to Use Context Engineering
Use it when building systems that run many turns or serve many users; when reliability matters more than any single clever output; when you have tools, memory, retrieval, or skills to compose. Skip it when you're solving a one-shot task (write a single email, summarize a single doc) — just prompt engineer. Anti-pattern: treating context engineering as "prompt engineering for long context." If your entire approach is "stuff more stuff into the prompt," you're doing prompt engineering at scale, not context engineering.
Production Case Studies
Every serious agentic system in 2026 is a context engineering system wearing different clothes. Claude Code uses progressive skill loading + MCP servers + CLAUDE.md files. Cursor tunes what goes into the context for every keystroke. Anthropic Managed Agents (2026) abstracts runtime away entirely, leaving context design as the only work left for authors. Fortune 500 AI Centers of Excellence turn context engineering into governance — who decides what is retrievable, loadable, persistable, callable.
Key Findings
Context engineering optimizes the whole information environment, not a single prompt — every token is a decision
ICLR 2026 ACE paper: treating contexts as evolving playbooks lifts agent accuracy +10.6% on AppWorld and +8.6% on finance benchmarks
Six layers compose the window — system prompt, skills, memory, retrieval, tool results, in-context examples — each with its own budget
Long context is not free lunch: Lost-in-the-Middle and RULER benchmarks show degradation regardless of window size
Production systems (Claude Code, Cursor, Anthropic Managed Agents) are context engineering systems wearing different clothes
Format choice matters: McMillan 2026 measured a 21-point accuracy gap between context formats across 9,649 experiments
If your agent is unreliable, the first debug is "what's actually in the context?" — not "what should the prompt say?"
Research Transparency
Limitations
- •Fast-moving field — patterns from 2025-Q4 (Anthropic skills, MCP) are less than a year old and best practices are still emerging.
- •Quantified impact (ACE +10.6%) measured on AppWorld/finance benchmarks — generalization to broader agent tasks is plausible but unproven.
- •Most production case studies (Claude Code, Cursor, Managed Agents) are Anthropic-ecosystem; patterns may differ for open-source or multi-vendor stacks.
What We Don't Know
- ?Whether context engineering consolidates into a formal engineering discipline (like DevOps or SRE) or remains a collection of patterns.
- ?How enterprise governance of context (who can retrieve what, which skills are approved) will evolve as regulated industries adopt agents.
- ?Whether "context rot" can be fully solved or is an intrinsic property of attention-based models at scale.
Frequently Asked Questions
RAG is one tool inside context engineering. Context engineering is the discipline of deciding when to RAG, what else to include alongside retrieval, and how to budget the window. RAG is a technique; context engineering is the architecture.
Sources & References
10 validated sources · Last updated 2026-04-24