You can write the perfect prompt and still get a wrong answer. The model isn’t broken. It just didn’t have the right information in front of it when it decided what to do.
That gap is what context engineering fixes. Prompt engineering is about how you ask. Context engineering is about what the model knows, sees, and remembers at the moment it acts. One tunes a single string. The other designs the whole input the model reads on every step.
This is Part 2 of the Designing AI-Native Applications series. In Part 1 we saw that an agent decides its own path. This post is about the thing it decides from — the context window — and how to engineer it on purpose.
- Context engineering = deciding what’s in the window when the model runs: instructions, retrieved facts, tool results, and memory — not just the prompt wording.
- The context window is RAM, not storage. It’s small, it’s working memory, and you assemble it fresh on a token budget every step.
- More context is not better. Past a point, extra tokens make output worse — so the job is choosing what gets in, not stuffing everything in.
Why a Good Prompt Isn’t Enough
Prompt engineering treats the input as one block of text you word carefully. That works when the task is self-contained. It stops working the moment the model needs facts it wasn’t trained on, tools to act with, or memory of what happened earlier.
Real systems need all three. So the input stops being a prompt and becomes an assembly — pulled from many sources, every single step.
The model only knows what’s in that window when it runs. Not what’s in your database, not what it said yesterday — only what you placed in front of it this step. Context engineering is the work of deciding what that is. A 2026 industry survey found most teams now agree that prompting alone can’t power AI at scale, and this is why: the hard part moved from wording to assembly.
Your Context Window Is RAM, Not Storage
Here’s the mental model that makes everything click: treat the context window like RAM, not a hard drive. It’s small. It’s temporary. It holds only what the current step needs. Your databases, files, and past conversations are storage — they sit outside and get loaded in only when relevant.
That reframe matters because RAM is something you manage. You decide what to load, in what order, and what to drop when space runs low. Here’s what competes for that space:
| What goes in | Why it’s there | Watch out for |
|---|---|---|
| System prompt | Who the model is, the rules | Quietly bloats over time |
| Instructions & policies | The task and its guardrails | Can clash with retrieved text |
| Retrieved knowledge (RAG) | Facts for this question | Too many chunks crowd out the rest |
| Tool definitions & results | What it can do, what it found | Results pile up every call |
| Memory & history | What happened earlier | Grows on every turn |
| User input | The actual request | Easy to bury under everything else |
Notice that most of these grow on their own. History and tool results expand with every step until they crowd out the very thing you care about.
A 200,000-token window sounds roomy, but it fills fast: half of it becomes stale tool output and old turns, and the model quietly stops following your instructions. Big windows don’t remove the problem — they just hide it for longer. Managing that is the architecture. If retrieval is one of your sources, the mechanics of it live in What Is RAG and the embeddings primer — this post is about how that retrieved text shares the window with everything else.
Context Engineering Is a Pipeline, Not a String
Because the window is RAM, you don’t write it — you build it, fresh, each step. A simple assembly pipeline looks like this:
# Context is built fresh each step — to a token budget.ctx = system_prompt() # who + the rulesctx += retrieve(query, k=3) # only the top matchesctx += summarize(history, max_tokens=500) # compress, don't dumpctx += [user_input]ctx = fit_to_budget(ctx, limit=8000) # drop least-relevant if overanswer = model(ctx)
Read the steps. You gather from each source, select only what’s relevant (top matches, not everything), compress what’s bulky (summarize history instead of pasting it), order it so the important parts sit where the model reads best, and fit it to a budget by dropping the least useful when you run out of room. That five-step loop — gather, select, compress, order, fit — is context engineering in practice.
Where Context Breaks
This is the part that surprises people: a window can fail long before it’s full. The common failure modes:
- Context rot. As the input grows, output quality drops — even when you’re nowhere near the token limit. More tokens in can mean worse answers out (Context Rot, Morph).
- Lost in the middle. Models pay most attention to the start and end of the input. Facts buried in the middle get skimmed, so instructions get ignored and details get missed.
- Context pollution. Old tool results and stale turns accumulate. The window fills with noise that drags the model off course.
- Hallucination propagation. Once a wrong fact enters the context, later steps build on it. The error compounds instead of correcting itself.
Every one of these looks like “the model is dumb.” It usually isn’t. It’s a context problem wearing a model costume — which is exactly why this is an architecture concern, not a prompting tweak.
When to Keep It Simple
Context engineering is a tool, not a tax. For a small, self-contained task, a clear prompt is still the right answer. If the model already has everything it needs in the question, don’t build a retrieval pipeline around it.
Reach for real context engineering when:
- The model needs outside facts it wasn’t trained on — your docs, your data, today’s numbers.
- It uses tools whose results must feed the next step.
- It runs over many turns and has to remember what happened.
If none of those apply, a good prompt wins on cost and simplicity — the same “use the simplest thing that works” rule from Part 1.
Quick Recap
- Context engineering decides what’s in the window when the model runs — not just the prompt wording.
- The window is RAM: small, temporary, assembled on a token budget each step.
- The pipeline: gather, select, compress, order, fit.
- It breaks early: context rot, lost-in-the-middle, pollution, and compounding hallucinations.
- Keep it simple when the task is self-contained; engineer context when the model needs facts, tools, or memory.
Conclusion
Context engineering is the shift from wording a prompt to designing what the model sees. Treat the window as RAM you assemble on a budget, choose what gets in as carefully as what you leave out, and watch for the failures that hit before the window is even full. Get that right and a lot of “bad model” problems quietly disappear.
What’s the first thing that bloats your context window — tool results, history, or too many retrieved chunks? Tell me in the comments.
Read next: Memory Architecture for AI Agents — Part 3 of Designing AI-Native Applications, on what lives in storage versus the window, and how agents remember across steps.
