Every explainer on what is RAG shows you the same six-box diagram and none of them tell you what to actually type. So let’s fix that. By the end of this guide you’ll have a working retrieval system running over your own files — and you’ll finally know what each box in that diagram does.
RAG stands for retrieval-augmented generation, and the idea is smaller than the acronym makes it sound: before the model answers, you go find the relevant facts in your own data and paste them into the prompt. That’s it. The clever part is how you find the relevant facts, and the payoff is an answer grounded in your documents instead of the model’s hazy memory.
- You can read a Python list and call a function — new to that? The Python for AI agents primer covers the basics
- A loose grasp of embeddings helps — the embeddings and vector search primer is the perfect 12-minute warm-up
- An API key to call an LLM — see the LLM API key setup primer
- RAG = retrieve relevant text, paste it into the prompt, then generate — three steps, nothing more.
- It fixes the “the model doesn’t know my data” problem without retraining anything.
- You can build a real RAG in about 25 lines of framework-free Python.
- Reach for RAG when answers must come from your own or changing documents — not for tone, format, or single-fact lookups.
Why Your LLM Doesn’t Know Your Data
An LLM only knows what it saw during training. Ask it about your company’s refund policy, last week’s incident report, or a private PDF, and it will either admit it doesn’t know or — worse — confidently make something up. That second failure mode is the one that gets people burned.
There are two honest ways to fix this. You can retrain the model on your data (expensive, slow, and stale the moment your docs change), or you can hand the model the right facts at question time. RAG is the second option, and it’s why the original Lewis et al. paper framed it as combining a model’s parametric memory with a non-parametric source it can look things up in.
The bottom line: if the answer lives in your data and your data changes, you want retrieval, not retraining.
The RAG Mental Model — and a 25-Line RAG You Can Run
Here’s the whole pattern in one picture: a question triggers a retrieve step, the matching chunks get added to the prompt, and the model generates from them.
The flow is always the same three verbs — retrieve, augment, generate. Retrieve the most relevant chunks of your data, augment the prompt by pasting them in as context, then let the LLM generate an answer from that context. Once you see those three steps, every RAG framework on the planet is just a fancier wrapper around them.
And you don’t need a framework to prove it works. This is a complete RAG over a folder of .txt files — about 25 lines, no LangChain, no vector database:
import globimport numpy as npfrom openai import OpenAIclient = OpenAI() # reads OPENAI_API_KEY from your environmentdef embed(text):r = client.embeddings.create(model="text-embedding-3-small", input=text)return np.array(r.data[0].embedding)# 1. Load + chunk: one chunk per paragraphchunks = []for path in glob.glob("docs/*.txt"):chunks += open(path).read().split("\n\n")# 2. Embed every chunk once, keep them in memorystore = [(c, embed(c)) for c in chunks if c.strip()]def answer(question):q = embed(question)# 3. Retrieve: top 3 chunks by similarity (dot product = cosine here)ranked = sorted(store, key=lambda s: np.dot(q, s[1]), reverse=True)context = "\n\n".join(c for c, _ in ranked[:3])# 4. Augment the prompt, then generateprompt = f"Answer using only this context:\n\n{context}\n\nQuestion: {question}"out = client.chat.completions.create(model="gpt-4o-mini",messages=[{"role": "user", "content": prompt}],)return out.choices[0].message.contentprint(answer("What is our refund policy?"))
Drop a few text files in a docs/ folder, run it, and you have a question-answering bot over your own content. That’s not a toy version of RAG — it’s the real thing, just without the libraries hiding it. The first time I ran this over my own notes, the “magic” of RAG completely evaporated, in the best way.
The Four Moving Parts
That little script has exactly four pieces, and every RAG system — from this one to a production stack — is built from the same four. Knowing them is how you debug bad answers later.
- Chunking — splitting documents into bite-sized pieces. I split on blank lines (paragraphs) for simplicity, but chunk size is the single biggest lever on answer quality. Chunks too big and you bury the answer in noise; too small and you lose context.
- Embeddings — turning each chunk into a vector that captures its meaning, usually with a hosted model like OpenAI’s embeddings API. The embeddings primer goes deeper, but the one idea you need is: similar meaning, nearby vectors.
- Vector search — finding the chunks closest to the question’s vector. Here it’s a
sorted()over a list; at scale it’s a vector database doing the same comparison far faster. - The augmented prompt — pasting the retrieved chunks into the prompt with an instruction to answer from that context. This is the line that turns retrieval into a grounded answer.
The bottom line: when a RAG answer is wrong, it’s almost always retrieval — chunking or search — not the model. I’ve spent hours blaming the LLM only to find the right chunk never made it into the prompt.
RAG vs Fine-Tuning vs Long Context
The question I get most is “why not just fine-tune the model, or paste everything into a giant context window?” All three are valid; they solve different problems.
| Approach | Best for | How you update it | Watch out for |
|---|---|---|---|
| RAG | Answering from private or changing data | Edit the documents | Retrieval quality is on you |
| Fine-tuning | Teaching tone, format, or a narrow skill | Retrain the model | Slow, costly, doesn’t add fresh facts |
| Long context | One-off Q&A over a small set of docs | Re-paste the docs | Cost and latency grow with every token |
My default for question-answering over a company’s own knowledge is RAG, every time — because updating knowledge should mean editing a file, not running a training job. I reach for fine-tuning only when I need the model to behave differently, and for long context when the entire corpus comfortably fits in one prompt and I won’t be asking often.
When RAG Is the Wrong Tool
RAG is a hammer, and not everything is a nail. I’d skip it when:
- The lookup is exact, not fuzzy. “What’s order #4471’s status?” is a database query. Semantic search adds cost and a chance of returning the wrong row.
- The whole corpus fits in the context window. If your data is five pages, just paste it. Building a retrieval pipeline for that is over-engineering.
- You need new behaviour, not new facts. Wanting the model to always reply in legal-brief format is a fine-tuning job, not a retrieval one.
- Freshness must be exact and live. For up-to-the-second prices or inventory, call the real API. RAG retrieves what you indexed, which is only as fresh as your last embed run.
Showing the limits isn’t a weakness of RAG — it’s how you avoid reaching for it on the 30% of problems where something simpler wins.
Quick Recap
The whole guide in five lines:
- RAG = retrieve, augment, generate — find relevant text, paste it in, answer from it.
- It solves “the model doesn’t know my data” without retraining.
- Four parts: chunking, embeddings, vector search, the augmented prompt.
- Bad answers are usually retrieval, not the model.
- Use RAG for private or changing data; use fine-tuning for behaviour and long context for tiny, rare lookups.
Frequently Asked Questions
What is RAG in simple terms? It’s fetching relevant text from your own data, pasting it into the prompt, and letting the LLM answer from it — so the answer is grounded in your documents, not the model’s memory.
How does RAG work? Three steps: retrieve the most similar chunks by meaning, augment the prompt with them, and generate an answer from that context.
Is RAG better than fine-tuning? For answering over private or changing data, usually yes — it’s cheaper and you update knowledge by editing files. Fine-tuning is for changing behaviour, not adding facts.
Do I need a vector database? Not to start. In-memory NumPy works for a few hundred chunks; a vector store earns its place at thousands-plus or when you need persistence.
Conclusion
RAG is far less mysterious than the diagrams suggest: retrieve the right facts, paste them into the prompt, and let the model generate from them. Once you’ve run the 25-line version over your own files, the production stacks stop looking like magic and start looking like the same four parts with better tooling — faster search, smarter chunking, and a real vector database.
What’s the first set of documents you’d point a RAG system at — your team’s wiki, a folder of PDFs, your own notes? Tell me in the comments.
- Shaky on the retrieval half? The embeddings and vector search primer explains how “find by meaning” actually works.
- See it inside an agent: the agent memory tutorial uses the same retrieve-by-meaning move to give an agent long-term memory.
- Newer to all this? Start with What are AI agents? for the bigger picture before wiring RAG into one.
