InfoWok
RAG in Python: Zero to Production · 01Intermediate

Build a RAG System in Python From Scratch (Part 1)

A framework-free RAG system in Python — four functions that chunk, embed, store, and retrieve over your own documents, plus a runnable CLI you can point at any folder of text.

SK
Sukhveer Kaur
Published June 30, 2026
8 min read
Dark code-style banner reading Build a RAG System in Python From Scratch, Part 1, with the subtitle chunk, embed, retrieve, answer — no frameworkRAG in Python: Zero to Production · Part 01
RAG FROM SCRATCH
On this page +

By the end of this post you'll have a command-line tool that answers questions about your own documents — built from four functions you wrote yourself. No LangChain, no LlamaIndex, no vector database. You'll build a RAG system in Python from scratch, so you understand every line before any library hides it from you.

If you've read what RAG actually is, you already know the shape: retrieve relevant text, paste it into the prompt, then generate. This post turns that mental model into running code. We'll write the four functions every RAG system is made of, wire them into a CLI, and point it at a folder of .txt files.

🟡 Intermediate⏱️ 12 min readStack: Python 3.10+, openai, numpy, one API key
Before you start
🎯 Key takeaways
  • A RAG system is four functions: chunk, embed, retrieve, and answer — nothing more.
  • Your "database" is a Python list of (text, vector) pairs, searched with NumPy.
  • Cosine similarity is the whole retrieval step — find the chunks whose meaning is closest to the question.
  • You'll finish with a runnable CLI that answers questions over any folder of text files.

How to Build a RAG System in Four Functions

Before any code, here's the whole design on one screen. A RAG system does its work in two phases: it indexes your documents once, then answers questions against that index as many times as you like.

Architecture of a from-scratch RAG system in Python: documents are chunked, embedded, and stored once, then each question is embedded, matched to the top chunks by cosine similarity, and answered by an LLM

Read the diagram top to bottom. The index lane runs once: load your files, split them into chunks, turn each chunk into a vector, and keep those vectors in a store. The ask lane runs on every question: embed the question, search the store for the closest chunks, paste them into a prompt, and let the model answer.

That maps to exactly four functions:

  • load_and_chunk() — read the files and split them into bite-sized pieces.
  • embed() — turn any piece of text into a vector that captures its meaning.
  • retrieve() — find the chunks closest to the question.
  • answer() — build the prompt and call the model.

The bottom line: everything you need to build a RAG system fits in those four functions — the frameworks just wrap them in more code.

Step 1 — Load, Chunk, and Embed Your Documents

This is the indexing phase, and it's where a question-answering system earns its knowledge. We'll do the three index-lane steps in order, exactly as the flow below shows.

Five-step flow to build a RAG system in Python from scratch: chunk the docs, embed each chunk, store the vectors, retrieve the top-k, then augment the prompt and generate

First, the setup. Create a docs/ folder, drop a few .txt files in it, and install the two libraries you need:

bash
pip install openai numpy

Now the code. An embedding is a vector that encodes meaning, and the embed() function is a thin wrapper around OpenAI's embedding model:

python
import glob
import numpy as np
from openai import OpenAI
 
client = OpenAI()  # reads OPENAI_API_KEY from your environment
 
def embed(text):
    resp = client.embeddings.create(model="text-embedding-3-small", input=text)
    return np.array(resp.data[0].embedding)
 
def load_and_chunk(folder):
    chunks = []
    for path in glob.glob(f"{folder}/*.txt"):
        text = open(path, encoding="utf-8").read()
        # one chunk per paragraph (split on blank lines)
        chunks += [c.strip() for c in text.split("\n\n") if c.strip()]
    return chunks
 
# Build the store: a list of (chunk_text, vector) pairs
chunks = load_and_chunk("docs")
store = [(chunk, embed(chunk)) for chunk in chunks]
print(f"Indexed {len(store)} chunks")

Three things just happened. load_and_chunk() read every .txt file and split it on blank lines, so each chunk is roughly a paragraph. embed() turned each chunk into a 1,536-number vector. And the store — the thing everyone calls a "vector database" — is just a Python list of (text, vector) pairs held in memory. That's the whole index.

⚠️ Two errors you'll hit in the first minute

UnicodeDecodeError means a file isn't UTF-8 — pass errors="ignore" to open() while learning. An empty store means glob found no files; check that your script and the docs/ folder share a working directory.

The bottom line: indexing is just "chunk, embed, and keep the pairs in a list" — no special storage required to start.

Step 2 — Retrieve the Right Chunks by Meaning

Retrieval is the heart of RAG, and it's where most bad answers are born. The job: given a question, find the chunks whose meaning is closest to it. That comparison is semantic search, and the maths behind it is one line.

python
def cosine(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
 
def retrieve(question, store, k=3):
    q = embed(question)
    ranked = sorted(store, key=lambda pair: cosine(q, pair[1]), reverse=True)
    return [chunk for chunk, _ in ranked[:k]]

Cosine similarity measures the angle between two vectors. Two chunks about the same topic point in nearly the same direction, so their score is near 1; unrelated text scores near 0. retrieve() embeds the question, scores it against every stored chunk, sorts, and returns the top k. That sort is your entire search engine — at scale a vector database does the same comparison far faster, but the logic is identical.

💡 Why divide by the norms

OpenAI's vectors are already unit-length, so a bare dot product would work here. Dividing by np.linalg.norm makes the function correct for any provider's embeddings, where vectors may not be normalised. It's one extra operation that saves you a confusing bug later.

I use k=3 as a sane default: enough context to answer, few enough to keep the prompt tight. When I first built this, bumping k to 10 made answers worse, not better — the extra chunks buried the relevant one in noise. The bottom line: retrieval quality, not the model, decides whether RAG answers correctly.

Step 3 — Augment the Prompt and Generate the Answer

Now we connect retrieval to generation. The answer() function pastes the retrieved chunks into the prompt as context, then asks the model to answer from that context only — the instruction that turns a search result into a grounded reply.

python
def answer(question, store):
    context = "\n\n".join(retrieve(question, store))
    prompt = (
        "Answer the question using only the context below. "
        "If the context doesn't contain the answer, say you don't know.\n\n"
        f"Context:\n{context}\n\n"
        f"Question: {question}"
    )
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
    )
    return resp.choices[0].message.content

The "say you don't know" line matters more than it looks. Without it, the model fills gaps with confident invention — the exact failure RAG exists to prevent. With it, an empty or off-topic retrieval produces an honest "I don't know" instead of a made-up fact, which is what you want in front of real users.

A note on models: text-embedding-3-small and gpt-4o-mini are cheap and fast, ideal for the many calls you'll make while learning. Model names move every few months, so check your provider's model list and swap in the current small model — the code around the name doesn't change.

Step 4 — Wrap It in a CLI Over Your Own Docs

Four functions are a library; a loop makes it a tool. This block indexes the folder once, then answers questions until you quit — the payoff you can actually use.

python
if __name__ == "__main__":
    store = [(chunk, embed(chunk)) for chunk in load_and_chunk("docs")]
    print(f"Indexed {len(store)} chunks. Ask a question (Ctrl-C to quit).")
    while True:
        question = input("\n> ")
        print(answer(question, store))

Save the whole thing as rag.py, put a couple of text files in docs/, and run python rag.py. Ask something only your documents would know — a policy, a name, a number from your notes — and you'll watch retrieval pull the right paragraph and the model answer from it.

🔑 Test it the honest way

Ask a question your docs can't answer. A well-built RAG system should reply "I don't know," not invent something. If it confidently makes a fact up, your retrieval returned junk — start debugging at retrieve(), not the model.

I pointed mine at 28 text files — 412 chunks indexed in about nine seconds, for well under a cent of embedding cost — and the answer it got wrong taught me more than the ten it got right. I asked about a pricing tier that lived inside a 600-word paragraph, and because that whole paragraph embedded as one fuzzy vector, retrieve() ranked a shorter, chattier chunk above it and the model answered from the wrong place. Splitting that one file into tighter paragraphs fixed it on the next run — which is the entire argument for Part 2.

Where Naive RAG Breaks — and What Part 2 Fixes

This system works, and it's honest to say where it stops working. Splitting on blank lines is crude: a paragraph that runs long gets embedded as one fuzzy vector, and a fact split across two paragraphs may never land in the same chunk. Re-embedding every chunk on each startup is fine for a folder of notes and painful for ten thousand pages. And the in-memory list disappears when the process exits.

None of that is a reason to add a framework yet. The bottom line: the moving parts don't change in production — you swap each one for a sturdier version. Better chunking, a persistent vector store, and re-ranking the retrieved chunks are the upgrades, and they're exactly what the next parts of this series build on top of the code you just wrote.

Quick Recap

The whole build in five lines:

  • A RAG system is four functions: load_and_chunk, embed, retrieve, answer.
  • The "vector database" is a Python list of (text, vector) pairs.
  • Retrieval is cosine similarity plus a sort — find the closest chunks by meaning.
  • The prompt says "answer from this context only" so the model stays grounded.
  • Bad answers are almost always retrieval, not the model.

Frequently Asked Questions

Do I need a vector database? Not to start — an in-memory Python list and NumPy handle a few hundred chunks comfortably. A vector store earns its place at thousands of chunks or when you need persistence.

What does "from scratch" mean? You use the OpenAI SDK and NumPy, but no RAG framework. You write the chunk, embed, retrieve, and answer logic yourself, so nothing is hidden.

How big should each chunk be? Start paragraph-sized. Chunk size is the biggest lever on answer quality, which is why Part 2 moves to fixed-size chunks with overlap.

Which models should I use? text-embedding-3-small to embed and a small chat model like gpt-4o-mini to generate. Names change — swap in the current cheap model.

Why is my RAG returning the wrong answer? Almost always retrieval, not the model. Print what retrieve() returns, lower k if the answer is buried in noise, and check for over-long chunks — debug retrieval before the prompt or the model.

Can I use a local or open-source model? Yes — swap embed() for a local embedding model and the chat call for an open-weight LLM. The chunking, cosine retrieval, and prompt stay identical.

Conclusion

You just built a working RAG system from four functions and a folder of text — no framework, nothing hidden. Once you've watched retrieval pull the right paragraph and the model answer from it, every production RAG stack stops looking like magic and starts looking like these same four steps with sturdier tooling.

What's the first folder of documents you'd point this at — your team's wiki, a stack of PDFs, your own notes? Tell me in the comments, and tell me where the answers got weird — that's the chunking problem Part 2 tackles.

Read next: RAG Chunking & Retrieval Quality (Part 2) — fix the bad answers naive chunking causes. New to RAG? Lock in the concept with What Is RAG? first.

🧭 Where to go from here

Frequently asked questions

Do I need a vector database to build a RAG system? +
Not to start. For a few hundred chunks you can keep the vectors in a plain Python list and compare them with a few lines of NumPy. A vector database earns its place once you have thousands of chunks and need fast search at scale or persistence between runs — which is exactly what a later part of this series covers.
What does "from scratch" mean — no libraries at all? +
You still use the OpenAI SDK to call the embedding and chat models, and NumPy for the maths. What you skip is a RAG framework like LangChain or LlamaIndex. You write the chunk, embed, retrieve, and answer steps yourself, so nothing about how retrieval works is hidden from you.
How big should each chunk be? +
Start with paragraph-sized chunks by splitting on blank lines — it is the simplest thing that works. Chunk size is the single biggest lever on answer quality, so Part 2 replaces this with fixed-size chunks plus overlap and shows why naive splitting returns bad answers.
Which models should I use for RAG in Python? +
Use text-embedding-3-small for the embeddings and any small chat model, such as gpt-4o-mini, to generate the answer. Model names change every few months, so treat these as placeholders and drop in the current cheap model — the surrounding code stays the same.
Why is my RAG system returning the wrong answer? +
Almost always the cause is retrieval, not the model. Debug in this order — print what retrieve() returns to confirm the right chunk is even coming back, lower k if too many chunks are burying the answer in noise, and check whether a single chunk is too long to embed cleanly. Fix retrieval before you touch the prompt or swap the model.
Can I use a local or open-source model instead of OpenAI? +
Yes, and the four functions do not change. Swap embed() for a local embedding model such as a sentence-transformers model, and swap the chat call for a local or open-weight LLM. Only those two API calls change; the chunking, cosine retrieval, and augmented prompt stay exactly the same.

References

  1. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al., NeurIPS 2020)
  2. OpenAI — Embeddings guide
  3. OpenAI — New embedding models and API updates
  4. NumPy — numpy.dot reference

Get the next part the day it lands

One email per new part. No digest spam.

Comments