RAG runs three steps. Retrieve: embed the question and find the most similar chunks of your documents by meaning. Augment: insert those chunks into the prompt as context. Generate: the LLM writes an answer using that context. The whole loop is retrieve, augment, generate.

Do I need a vector database for RAG?

Not to start. For a few hundred chunks you can keep embeddings in memory and compare them with a few lines of NumPy. A vector database earns its place once you have thousands of chunks and need fast search at scale or persistence between runs.

What Is RAG in AI? A Practical Developer's Guide (2026)

Q: What is RAG in simple terms?

RAG (retrieval-augmented generation) is a pattern where you fetch relevant text from your own data, paste it into the prompt, and let the LLM answer from it. Instead of relying only on what the model memorised during training, you give it the right facts at question time — so the answer is grounded in your documents.

Q: Is RAG better than fine-tuning?

They solve different problems. RAG gives a model new facts it can cite and lets you update knowledge by changing documents, not retraining. Fine-tuning changes how the model behaves — tone, format, a narrow skill. For most question-answering over private or changing data, start with RAG; it is cheaper, faster to update, and easier to debug.

Every explainer on what is RAG shows you the same six-box diagram and none of them tell you what to actually type. So let’s fix that. By the end of this guide you’ll have a working retrieval system running over your own files — and you’ll finally know what each box in that diagram does.

RAG stands for retrieval-augmented generation, and the idea is smaller than the acronym makes it sound: before the model answers, you go find the relevant facts in your own data and paste them into the prompt. That’s it. The clever part is how you find the relevant facts, and the payoff is an answer grounded in your documents instead of the model’s hazy memory.

🟢 Beginner⏱️ 9 min readStack: Python 3.10+, an embeddings model, any LLM API

✅ Before you start

You can read a Python list and call a function — new to that? The Python for AI agents primer covers the basics
A loose grasp of embeddings helps — the embeddings and vector search primer is the perfect 12-minute warm-up
An API key to call an LLM — see the LLM API key setup primer

🎯 Key takeaways

RAG = retrieve relevant text, paste it into the prompt, then generate — three steps, nothing more.
It fixes the “the model doesn’t know my data” problem without retraining anything.
You can build a real RAG in about 25 lines of framework-free Python.
Reach for RAG when answers must come from your own or changing documents — not for tone, format, or single-fact lookups.

Why Your LLM Doesn’t Know Your Data

An LLM only knows what it saw during training. Ask it about your company’s refund policy, last week’s incident report, or a private PDF, and it will either admit it doesn’t know or — worse — confidently make something up. That second failure mode is the one that gets people burned.

There are two honest ways to fix this. You can retrain the model on your data (expensive, slow, and stale the moment your docs change), or you can hand the model the right facts at question time. RAG is the second option, and it’s why the original Lewis et al. paper framed it as combining a model’s parametric memory with a non-parametric source it can look things up in.

📌 The one-line definitionRAG doesn't make the model smarter — it makes the model *better informed*. You're not changing the brain, you're handing it the right page before it answers.

The bottom line: if the answer lives in your data and your data changes, you want retrieval, not retraining.

The RAG Mental Model — and a 25-Line RAG You Can Run

Here’s the whole pattern in one picture: a question triggers a retrieve step, the matching chunks get added to the prompt, and the model generates from them.

How RAG works: a question triggers a retrieve step that searches your documents in a vector store, the matching chunks are added to the prompt, and the LLM generates a grounded answer

The flow is always the same three verbs — retrieve, augment, generate. Retrieve the most relevant chunks of your data, augment the prompt by pasting them in as context, then let the LLM generate an answer from that context. Once you see those three steps, every RAG framework on the planet is just a fancier wrapper around them.

And you don’t need a framework to prove it works. This is a complete RAG over a folder of .txt files — about 25 lines, no LangChain, no vector database:

python

import glob
import numpy as np
from openai import OpenAI

client = OpenAI()  # reads OPENAI_API_KEY from your environment

def embed(text):
    r = client.embeddings.create(model="text-embedding-3-small", input=text)
    return np.array(r.data[0].embedding)

# 1. Load + chunk: one chunk per paragraph
chunks = []
for path in glob.glob("docs/*.txt"):
    chunks += open(path).read().split("\n\n")

# 2. Embed every chunk once, keep them in memory
store = [(c, embed(c)) for c in chunks if c.strip()]

def answer(question):
    q = embed(question)
    # 3. Retrieve: top 3 chunks by similarity (dot product = cosine here)
    ranked = sorted(store, key=lambda s: np.dot(q, s[1]), reverse=True)
    context = "\n\n".join(c for c, _ in ranked[:3])
    # 4. Augment the prompt, then generate
    prompt = f"Answer using only this context:\n\n{context}\n\nQuestion: {question}"
    out = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
    )
    return out.choices[0].message.content

print(answer("What is our refund policy?"))

Drop a few text files in a docs/ folder, run it, and you have a question-answering bot over your own content. That’s not a toy version of RAG — it’s the real thing, just without the libraries hiding it. The first time I ran this over my own notes, the “magic” of RAG completely evaporated, in the best way.

💡 Why a plain dot product worksOpenAI's embedding vectors are already unit-length, so a dot product *is* the cosine similarity. With another provider's model, normalise the vectors first or you'll get odd rankings.

The Four Moving Parts

That little script has exactly four pieces, and every RAG system — from this one to a production stack — is built from the same four. Knowing them is how you debug bad answers later.

Chunking — splitting documents into bite-sized pieces. I split on blank lines (paragraphs) for simplicity, but chunk size is the single biggest lever on answer quality. Chunks too big and you bury the answer in noise; too small and you lose context.
Embeddings — turning each chunk into a vector that captures its meaning, usually with a hosted model like OpenAI’s embeddings API. The embeddings primer goes deeper, but the one idea you need is: similar meaning, nearby vectors.
Vector search — finding the chunks closest to the question’s vector. Here it’s a sorted() over a list; at scale it’s a vector database doing the same comparison far faster.
The augmented prompt — pasting the retrieved chunks into the prompt with an instruction to answer from that context. This is the line that turns retrieval into a grounded answer.

The bottom line: when a RAG answer is wrong, it’s almost always retrieval — chunking or search — not the model. I’ve spent hours blaming the LLM only to find the right chunk never made it into the prompt.

RAG vs Fine-Tuning vs Long Context

The question I get most is “why not just fine-tune the model, or paste everything into a giant context window?” All three are valid; they solve different problems.

Approach	Best for	How you update it	Watch out for
RAG	Answering from private or changing data	Edit the documents	Retrieval quality is on you
Fine-tuning	Teaching tone, format, or a narrow skill	Retrain the model	Slow, costly, doesn’t add fresh facts
Long context	One-off Q&A over a small set of docs	Re-paste the docs	Cost and latency grow with every token

My default for question-answering over a company’s own knowledge is RAG, every time — because updating knowledge should mean editing a file, not running a training job. I reach for fine-tuning only when I need the model to behave differently, and for long context when the entire corpus comfortably fits in one prompt and I won’t be asking often.

🔑 They stackThese aren't mutually exclusive. A fine-tuned model that answers in your house style, fed retrieved context through RAG, inside a long window — that's a common production combination, not a contradiction.

When RAG Is the Wrong Tool

RAG is a hammer, and not everything is a nail. I’d skip it when:

The lookup is exact, not fuzzy. “What’s order #4471’s status?” is a database query. Semantic search adds cost and a chance of returning the wrong row.
The whole corpus fits in the context window. If your data is five pages, just paste it. Building a retrieval pipeline for that is over-engineering.
You need new behaviour, not new facts. Wanting the model to always reply in legal-brief format is a fine-tuning job, not a retrieval one.
Freshness must be exact and live. For up-to-the-second prices or inventory, call the real API. RAG retrieves what you indexed, which is only as fresh as your last embed run.

Showing the limits isn’t a weakness of RAG — it’s how you avoid reaching for it on the 30% of problems where something simpler wins.

Quick Recap

The whole guide in five lines:

RAG = retrieve, augment, generate — find relevant text, paste it in, answer from it.
It solves “the model doesn’t know my data” without retraining.
Four parts: chunking, embeddings, vector search, the augmented prompt.
Bad answers are usually retrieval, not the model.
Use RAG for private or changing data; use fine-tuning for behaviour and long context for tiny, rare lookups.

Frequently Asked Questions

What is RAG in simple terms? It’s fetching relevant text from your own data, pasting it into the prompt, and letting the LLM answer from it — so the answer is grounded in your documents, not the model’s memory.

How does RAG work? Three steps: retrieve the most similar chunks by meaning, augment the prompt with them, and generate an answer from that context.

Is RAG better than fine-tuning? For answering over private or changing data, usually yes — it’s cheaper and you update knowledge by editing files. Fine-tuning is for changing behaviour, not adding facts.

Do I need a vector database? Not to start. In-memory NumPy works for a few hundred chunks; a vector store earns its place at thousands-plus or when you need persistence.

Conclusion

RAG is far less mysterious than the diagrams suggest: retrieve the right facts, paste them into the prompt, and let the model generate from them. Once you’ve run the 25-line version over your own files, the production stacks stop looking like magic and start looking like the same four parts with better tooling — faster search, smarter chunking, and a real vector database.

What’s the first set of documents you’d point a RAG system at — your team’s wiki, a folder of PDFs, your own notes? Tell me in the comments.

🧭 Where to go from here

Shaky on the retrieval half? The embeddings and vector search primer explains how “find by meaning” actually works.
See it inside an agent: the agent memory tutorial uses the same retrieve-by-meaning move to give an agent long-term memory.
Newer to all this? Start with What are AI agents? for the bigger picture before wiring RAG into one.

What Is RAG in AI? A Practical Developer's Guide (2026)

Why Your LLM Doesn’t Know Your Data

The RAG Mental Model — and a 25-Line RAG You Can Run

The Four Moving Parts

RAG vs Fine-Tuning vs Long Context

When RAG Is the Wrong Tool

Quick Recap

Frequently Asked Questions

Conclusion

Frequently asked questions

References

Tags

Share

One email when something good ships