Skip to content
Research Hub/Context Engineering

Context Engineering

The discipline behind reliable AI agents

TL;DR

Serious AI builders in 2026 have shifted from prompt engineering (optimize one string) to context engineering (optimize the whole information environment). The six layers — system prompt, skills, memory, retrieval, tool results, in-context examples — each carry a token cost, and the engineer decides what enters and exits the window turn-by-turn. ICLR 2026 research (ACE framework) shows +10.6% agent accuracy when contexts are treated as evolving playbooks rather than static prompts.

Updated 2026-04-2410 sources validated

+10.6%

Agent accuracy lift (ACE framework)

Zhang et al., ICLR 2026

6

Context layers to compose

Synthesis

1M

Token window (Opus 4.7)

Anthropic

21pt

Accuracy gap by context format

McMillan 2026

01

The Six Layers of Context

Every turn, the engineer composes six layers of information into the model's window. Budget is finite — every token of in-context examples is a token not available for retrieval.

System prompt

Layer 1

Base persona, rules, non-negotiable constraints. Fixed per application.

Skills (progressive disclosure)

Layer 2

Loadable capability packets — tools, expertise, examples. Loaded on demand, unloaded when done.

Memory

Layer 3

Persistent facts about user and project across sessions. Rules ("when X, do Y") earn their spot; summaries rarely do.

Retrieval

Layer 4

Documents pulled based on the current query. Just-in-time beats static RAG in most production systems.

Tool results

Layer 5

Outputs from executed tool calls. Must be pre-summarized or paginated — a 5MB JSON tool breaks context engineering.

In-context examples

Layer 6

Few-shot demonstrations of the desired behavior. Curated, not hardcoded — library over hardcode.

02

Core Patterns

Five patterns separate production-grade context engineering from prompt-engineering-at-scale. Each trades complexity for reliability.

Progressive disclosure

Pattern 1

Load skills when needed, unload when done. Claude Code's skill system is the canonical example.

Just-in-time retrieval

Pattern 2

Let the model call grep or vector search *after* it understands the user's intent. Beats large static RAG prompts.

Memory that earns its spot

Pattern 3

Every persistent memory costs tokens forever. Write rules, not summaries. Avoid context rot.

Tool output shaping

Pattern 4

Pre-summarize, paginate, or stream tool outputs. Format choice drove a 21-point accuracy gap in 2026 benchmarks.

Context budget accounting

Pattern 5

Track tokens per layer. A debugging agent whose system prompt eats 30% of the window has 30% less room for the actual bug.

03

Why Long Context Is Not Free Lunch

Claude Opus 4.7 ships with 1M tokens. Gemini 2.5 Pro does the same. Yet the "Lost in the Middle" study (Liu et al., 2024) showed models reliably attend to the beginning and end of context but degrade in the middle. The NVIDIA RULER benchmark confirmed the same result across every frontier model tested. Longer context changes the question from "how do we fit our information in?" to "how do we decide what *not* to include?" — which is still a context engineering problem.

04

When to Use Context Engineering

Use it when building systems that run many turns or serve many users; when reliability matters more than any single clever output; when you have tools, memory, retrieval, or skills to compose. Skip it when you're solving a one-shot task (write a single email, summarize a single doc) — just prompt engineer. Anti-pattern: treating context engineering as "prompt engineering for long context." If your entire approach is "stuff more stuff into the prompt," you're doing prompt engineering at scale, not context engineering.

05

Production Case Studies

Every serious agentic system in 2026 is a context engineering system wearing different clothes. Claude Code uses progressive skill loading + MCP servers + CLAUDE.md files. Cursor tunes what goes into the context for every keystroke. Anthropic Managed Agents (2026) abstracts runtime away entirely, leaving context design as the only work left for authors. Fortune 500 AI Centers of Excellence turn context engineering into governance — who decides what is retrievable, loadable, persistable, callable.

Key Findings

1

Context engineering optimizes the whole information environment, not a single prompt — every token is a decision

2

ICLR 2026 ACE paper: treating contexts as evolving playbooks lifts agent accuracy +10.6% on AppWorld and +8.6% on finance benchmarks

3

Six layers compose the window — system prompt, skills, memory, retrieval, tool results, in-context examples — each with its own budget

4

Long context is not free lunch: Lost-in-the-Middle and RULER benchmarks show degradation regardless of window size

5

Production systems (Claude Code, Cursor, Anthropic Managed Agents) are context engineering systems wearing different clothes

6

Format choice matters: McMillan 2026 measured a 21-point accuracy gap between context formats across 9,649 experiments

7

If your agent is unreliable, the first debug is "what's actually in the context?" — not "what should the prompt say?"

Research Transparency

Limitations

  • Fast-moving field — patterns from 2025-Q4 (Anthropic skills, MCP) are less than a year old and best practices are still emerging.
  • Quantified impact (ACE +10.6%) measured on AppWorld/finance benchmarks — generalization to broader agent tasks is plausible but unproven.
  • Most production case studies (Claude Code, Cursor, Managed Agents) are Anthropic-ecosystem; patterns may differ for open-source or multi-vendor stacks.

What We Don't Know

  • ?Whether context engineering consolidates into a formal engineering discipline (like DevOps or SRE) or remains a collection of patterns.
  • ?How enterprise governance of context (who can retrieve what, which skills are approved) will evolve as regulated industries adopt agents.
  • ?Whether "context rot" can be fully solved or is an intrinsic property of attention-based models at scale.
Evidence Grade:Grade A(Peer-reviewed / meta-analyses)

Frequently Asked Questions

RAG is one tool inside context engineering. Context engineering is the discipline of deciding when to RAG, what else to include alongside retrieval, and how to budget the window. RAG is a technique; context engineering is the architecture.