"RAG is dead" has been one of the louder takes in AI circles this year. It isn't dead — static RAG is the problem, and the fix isn't a new database, it's letting an agent decide when and what to retrieve. That shift has a name, agentic RAG, and it's the bridge between the retrieval you already know and the agents you've already built.
The momentum is real: Gartner expects worldwide AI spending to jump 47% in 2026, and a growing slice of that is agents that retrieve their own context. By the end of this post you'll know exactly what it adds, see the loop in runnable code, and — just as important — know when plain RAG is still the right call.
- You know what RAG is — if not, start with what RAG actually is
- Ideally you've built the plain version — the from-scratch RAG tutorial gives you the
retrieve()we reuse here - You've seen an agent loop before — the ReAct loop from scratch is the same shape
- Static RAG retrieves once and hopes; the agentic version grades its own results and retries.
- The loop is four moves: retrieve, grade, re-query if weak, then generate.
- It wins on hard, multi-part questions where one retrieval was never going to be enough.
- It costs more — 3–5× the model calls — so you turn it on where it pays, not everywhere.
Where Static RAG Quietly Fails#
Plain RAG is a fixed three-step pipeline: embed the question, retrieve the top chunks, generate an answer from whatever comes back. It works beautifully right up until the first retrieval is wrong — and then it fails silently, because the model answers confidently from the bad context it was handed.
I hit this constantly with compound questions. Ask "how do our refund and cancellation policies differ for annual plans?" and a single retrieval grabs chunks about refunds or cancellations, rarely both. Static RAG has no second chance: one query, one fetch, no way to notice the chunks don't actually answer the question. It can't tell a great retrieval from a useless one, so a vague question quietly produces a vague — or wrong — answer.
"Agentic RAG" sounds like a new technology. It isn't. It's the same retrieve-augment-generate core with one addition: the system is allowed to check its own work and try again.
Plain RAG vs Agentic RAG: The Decision Table#
Here's the difference laid out on the dimensions that actually decide which one you reach for.
| Dimension | Plain (static) RAG | Agentic RAG |
|---|---|---|
| Retrieval | Once, fixed top-k | Loops until good enough |
| On bad results | Answers anyway | Grades, rewrites, retries |
| Query handling | Uses your words as-is | Rewrites and can split it |
| Hard / multi-part questions | Often misses | Much stronger |
| Cost per answer | 1 model call | 3–5× the calls |
| Latency | Low | Higher (extra round trips) |
| Best for | Clear questions, clean docs | Ambiguous or compound questions |
The bottom line: that loop buys accuracy on the questions plain RAG flubs, and you pay for it in calls and latency. That trade — not novelty — is the whole decision.
The Agentic RAG Loop: Retrieve, Grade, Re-Query#
The core idea is a loop, and you've seen its shape before in any agent: act, observe, decide, repeat. Here the action is retrieval and the observation is a relevance check.
You can build this on top of the retrieve() from the from-scratch RAG post — it adds just two small functions around it, a grader and a rewriter:
def grade(question, chunks):
"""Ask the model whether the chunks can actually answer the question."""
joined = "\n\n".join(chunks)
out = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content":
f"Can the context below answer the question? Reply only YES or NO.\n\n"
f"Context:\n{joined}\n\nQuestion: {question}"}],
)
return out.choices[0].message.content.strip().upper().startswith("YES")
def rewrite(question):
"""Rephrase the question so retrieval has a better shot."""
out = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content":
f"Rewrite this question to be clearer for document search:\n{question}"}],
)
return out.choices[0].message.content.strip()
def generate_from(question, chunks):
"""Answer from the exact chunks we just graded — no second retrieval."""
context = "\n\n".join(chunks)
out = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content":
f"Answer using only this context:\n\n{context}\n\nQuestion: {question}"}],
)
return out.choices[0].message.content
def agentic_answer(question, store, max_tries=3):
for _ in range(max_tries):
chunks = retrieve(question, store) # reused from Part 1
if grade(question, chunks): # the decision static RAG never makes
return generate_from(question, chunks) # answer the chunks we just graded
question = rewrite(question) # weak chunks? try a better query
return "I couldn't find enough to answer that confidently."That for loop is the entire difference. The model now judges its own retrieval and gets up to three attempts to find chunks that actually answer the question — and generate_from answers from the exact chunks it just approved, never a fresh fetch. This pattern has a research lineage: it's the practical version of Self-RAG and corrective RAG, where the model learns to critique what it retrieved before it trusts it.
Here's the loop earning its keep on that refund-versus-cancellation question:
- Try 1: retrieving on the original wording pulls two refund chunks;
gradereturns NO — nothing about cancellation came back. - Rewrite: the question becomes "annual plan cancellation terms and refund eligibility."
- Try 2: retrieval now returns one refund chunk and one cancellation chunk;
gradereturns YES, and the model answers from both.
Static RAG stops at Try 1 and answers the refund half as if it were the whole story. That second pass is the accuracy you're paying the extra calls for.
Use a small, fast model for the grade and rewrite steps and save your stronger model for the final answer. The judgment calls are simple yes/no and rephrasing jobs — paying top-tier prices for them is where the bills get scary.
A Minimal Agentic RAG in LangGraph#
The plain loop is perfect for understanding. In production, that loop grows branches — multiple retrievers, hallucination checks, fallbacks — and hand-managing the state gets messy. This is exactly the point where LangGraph earns its place, because it makes the loop an explicit graph with a decision edge.
from langgraph.graph import StateGraph, END
graph = StateGraph(State)
graph.add_node("retrieve", retrieve_node)
graph.add_node("grade", grade_node)
graph.add_node("rewrite", rewrite_node)
graph.add_node("generate", generate_node)
# the conditional edge IS the agentic part
graph.add_conditional_edges(
"grade",
lambda s: "generate" if s["relevant"] else "rewrite",
{"generate": "generate", "rewrite": "retrieve"},
)Same four moves — retrieve, grade, rewrite, generate — but now the retry is a named edge you can trace, log, and extend. The official LangChain LangGraph guide builds out the full version with document grading and query transforms. My advice: write the plain loop first so you understand what the graph is doing, then move to LangGraph when the branching demands it — not before.
The Cost — and When Plain RAG Is Still Right#
Nothing here is free. Each grade and rewrite is another model call, so an answer that took one call now takes three to five, with latency to match. On a high-traffic endpoint that adds up fast, both in your bill and in response time.
So I don't reach for the loop by default. I start with plain RAG and add the loop only when I can name the failure it fixes. Concretely:
- Stay static when questions are direct and your documents are clean — most internal Q&A over a tidy knowledge base never needs the loop.
- Go agentic when questions are ambiguous or compound, when retrieval quality is visibly shaky, or when a wrong answer is expensive enough to justify the extra calls.
The bottom line: this is a targeted upgrade, not a default setting. Turning it on everywhere is how you 5× your cost to fix the 20% of queries that actually needed help.
Quick Recap#
- Static RAG retrieves once and answers, even from bad chunks.
- Agentic RAG adds grade + retry so it can correct a weak retrieval.
- The loop is four moves: retrieve, grade, re-query, generate.
- It costs 3–5× the calls — worth it for hard questions, wasteful for easy ones.
- Start plain, add the loop only when you can name the failure it fixes.
Frequently Asked Questions#
What is agentic RAG? RAG with an agent loop around it: the model grades whether the retrieved chunks answer the question and re-queries if they don't, instead of answering from a single fixed retrieval.
Is RAG dead? No — only static, one-shot RAG is being criticised, and even that overstates it. The retrieve-augment-generate core is intact; it just adds a correction loop.
How is it different from plain RAG? Plain RAG is three fixed steps. Agentic RAG adds a grade step and a retry, trading more model calls for better answers on hard questions.
Does it cost more? Yes — often 3–5× the model calls per answer, plus latency. Keep it for the queries that need it.
Do I need LangGraph? No. A plain Python for loop works; LangGraph helps once the loop grows branches and needs traceable state.
Conclusion#
Agentic RAG isn't a replacement for RAG — it's RAG that's allowed to notice when it failed and try again. The mechanism is small (retrieve, grade, re-query, generate), the wins are real on the hard questions, and the cost is real too, which is why "when" matters more than "how."
Have you hit the wall with plain RAG yet — a question it kept answering wrong no matter how you chunked? Tell me in the comments. And if you haven't built the plain version yet, do that first.
Read next: Build a RAG System in Python From Scratch — the retrieve() this loop wraps.
- Need the foundation? What Is RAG? covers the retrieve-augment-generate core in plain language.
- Want the loop mindset? The ReAct agent loop from scratch is the same act-observe-decide pattern.
- Deciding agent vs pipeline? Agent vs workflow is the same "do I need the extra machinery?" question one level up.
