# Build a RAG System in Python From Scratch (Part 1)

> A framework-free RAG system in Python — four functions that chunk, embed, store, and retrieve over your own documents, plus a runnable CLI you can point at any folder of text.

*Source: https://www.infowok.com/build-a-rag-system-in-python-part-1/ · Sukhveer Kaur · Published June 30, 2026*

---

By the end of this post you'll have a command-line tool that answers questions about your own documents — built from four functions you wrote yourself. No LangChain, no LlamaIndex, no vector database. You'll **build a RAG system in Python from scratch**, so you understand every line before any library hides it from you.

If you've read [what RAG actually is](/what-is-rag-complete-guide-2026/), you already know the shape: retrieve relevant text, paste it into the prompt, then generate. This post turns that mental model into running code. We'll write the four functions every RAG system is made of, wire them into a CLI, and point it at a folder of `.txt` files.

<Prerequisites>

- You can read Python functions, loops, and f-strings — new to that? The [Python for AI primer](/ai-agents-from-scratch-python-part-0/) covers it in ten minutes
- A loose grasp of embeddings helps — the [embeddings and vector search primer](/embeddings-vector-search-primer/) is the perfect warm-up
- An API key to call an LLM — see the [LLM API key setup primer](/llm-api-key-setup-primer/)
- A clean virtual environment — the [virtualenv primer](/python-virtual-environment-primer/) takes two minutes

</Prerequisites>

<KeyTakeaways>

- **A RAG system is four functions:** chunk, embed, retrieve, and answer — nothing more.
- **Your "database" is a Python list** of `(text, vector)` pairs, searched with NumPy.
- **Cosine similarity is the whole retrieval step** — find the chunks whose meaning is closest to the question.
- **You'll finish with a runnable CLI** that answers questions over any folder of text files.

</KeyTakeaways>

## How to Build a RAG System in Four Functions

Before any code, here's the whole design on one screen. A RAG system does its work in two phases: it indexes your documents once, then answers questions against that index as many times as you like.

![Architecture of a from-scratch RAG system in Python: documents are chunked, embedded, and stored once, then each question is embedded, matched to the top chunks by cosine similarity, and answered by an LLM](./build-a-rag-system-in-python-part-1-architecture.svg)

Read the diagram top to bottom. The **index** lane runs once: load your files, split them into chunks, turn each chunk into a vector, and keep those vectors in a store. The **ask** lane runs on every question: embed the question, search the store for the closest chunks, paste them into a prompt, and let the model answer.

That maps to exactly four functions:

- **`load_and_chunk()`** — read the files and split them into bite-sized pieces.
- **`embed()`** — turn any piece of text into a vector that captures its meaning.
- **`retrieve()`** — find the chunks closest to the question.
- **`answer()`** — build the prompt and call the model.

**The bottom line: everything you need to build a RAG system fits in those four functions — the frameworks just wrap them in more code.**

## Step 1 — Load, Chunk, and Embed Your Documents

This is the indexing phase, and it's where a question-answering system earns its knowledge. We'll do the three index-lane steps in order, exactly as the flow below shows.

![Five-step flow to build a RAG system in Python from scratch: chunk the docs, embed each chunk, store the vectors, retrieve the top-k, then augment the prompt and generate](./build-a-rag-system-in-python-part-1-flow.svg)

First, the setup. Create a `docs/` folder, drop a few `.txt` files in it, and install the two libraries you need:

```bash
pip install openai numpy
```

Now the code. An embedding is a vector that encodes meaning, and the `embed()` function is a thin wrapper around OpenAI's embedding model:

```python
import glob
import numpy as np
from openai import OpenAI

client = OpenAI()  # reads OPENAI_API_KEY from your environment

def embed(text):
    resp = client.embeddings.create(model="text-embedding-3-small", input=text)
    return np.array(resp.data[0].embedding)

def load_and_chunk(folder):
    chunks = []
    for path in glob.glob(f"{folder}/*.txt"):
        text = open(path, encoding="utf-8").read()
        # one chunk per paragraph (split on blank lines)
        chunks += [c.strip() for c in text.split("\n\n") if c.strip()]
    return chunks

# Build the store: a list of (chunk_text, vector) pairs
chunks = load_and_chunk("docs")
store = [(chunk, embed(chunk)) for chunk in chunks]
print(f"Indexed {len(store)} chunks")
```

Three things just happened. `load_and_chunk()` read every `.txt` file and split it on blank lines, so each chunk is roughly a paragraph. `embed()` turned each chunk into a 1,536-number vector. And the `store` — the thing everyone calls a "vector database" — is just a **Python list of `(text, vector)` pairs held in memory**. That's the whole index.

<Callout type="warning" title="Two errors you'll hit in the first minute">
`UnicodeDecodeError` means a file isn't UTF-8 — pass `errors="ignore"` to `open()` while learning. An empty `store` means `glob` found no files; check that your script and the `docs/` folder share a working directory.
</Callout>

**The bottom line: indexing is just "chunk, embed, and keep the pairs in a list" — no special storage required to start.**

## Step 2 — Retrieve the Right Chunks by Meaning

Retrieval is the heart of RAG, and it's where most bad answers are born. The job: given a question, find the chunks whose meaning is closest to it. That comparison is **semantic search**, and the maths behind it is one line.

```python
def cosine(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def retrieve(question, store, k=3):
    q = embed(question)
    ranked = sorted(store, key=lambda pair: cosine(q, pair[1]), reverse=True)
    return [chunk for chunk, _ in ranked[:k]]
```

Cosine similarity measures the angle between two vectors. Two chunks about the same topic point in nearly the same direction, so their score is near 1; unrelated text scores near 0. `retrieve()` embeds the question, scores it against every stored chunk, sorts, and returns the top `k`. **That sort is your entire search engine** — at scale a vector database does the same comparison far faster, but the logic is identical.

<Callout type="tip" title="Why divide by the norms">
OpenAI's vectors are already unit-length, so a bare dot product would work here. Dividing by `np.linalg.norm` makes the function correct for *any* provider's embeddings, where vectors may not be normalised. It's one extra operation that saves you a confusing bug later.
</Callout>

I use `k=3` as a sane default: enough context to answer, few enough to keep the prompt tight. When I first built this, bumping `k` to 10 made answers *worse*, not better — the extra chunks buried the relevant one in noise. **The bottom line: retrieval quality, not the model, decides whether RAG answers correctly.**

## Step 3 — Augment the Prompt and Generate the Answer

Now we connect retrieval to generation. The `answer()` function pastes the retrieved chunks into the prompt as context, then asks the model to answer *from that context only* — the instruction that turns a search result into a grounded reply.

```python
def answer(question, store):
    context = "\n\n".join(retrieve(question, store))
    prompt = (
        "Answer the question using only the context below. "
        "If the context doesn't contain the answer, say you don't know.\n\n"
        f"Context:\n{context}\n\n"
        f"Question: {question}"
    )
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
    )
    return resp.choices[0].message.content
```

The "say you don't know" line matters more than it looks. Without it, the model fills gaps with confident invention — the exact failure RAG exists to prevent. **With it, an empty or off-topic retrieval produces an honest "I don't know" instead of a made-up fact**, which is what you want in front of real users.

A note on models: `text-embedding-3-small` and `gpt-4o-mini` are cheap and fast, ideal for the many calls you'll make while learning. Model names move every few months, so check your provider's [model list](https://platform.openai.com/docs/guides/embeddings) and swap in the current small model — the code around the name doesn't change.

## Step 4 — Wrap It in a CLI Over Your Own Docs

Four functions are a library; a loop makes it a tool. This block indexes the folder once, then answers questions until you quit — the payoff you can actually use.

```python
if __name__ == "__main__":
    store = [(chunk, embed(chunk)) for chunk in load_and_chunk("docs")]
    print(f"Indexed {len(store)} chunks. Ask a question (Ctrl-C to quit).")
    while True:
        question = input("\n> ")
        print(answer(question, store))
```

Save the whole thing as `rag.py`, put a couple of text files in `docs/`, and run `python rag.py`. Ask something only your documents would know — a policy, a name, a number from your notes — and you'll watch retrieval pull the right paragraph and the model answer from it.

<Callout type="key" title="Test it the honest way">
Ask a question your docs *can't* answer. A well-built RAG system should reply "I don't know," not invent something. If it confidently makes a fact up, your retrieval returned junk — start debugging at `retrieve()`, not the model.
</Callout>

I pointed mine at 28 text files — 412 chunks indexed in about nine seconds, for well under a cent of embedding cost — and the answer it got *wrong* taught me more than the ten it got right. I asked about a pricing tier that lived inside a 600-word paragraph, and because that whole paragraph embedded as one fuzzy vector, `retrieve()` ranked a shorter, chattier chunk above it and the model answered from the wrong place. Splitting that one file into tighter paragraphs fixed it on the next run — which is the entire argument for Part 2.

## Where Naive RAG Breaks — and What Part 2 Fixes

This system works, and it's honest to say where it stops working. Splitting on blank lines is crude: a paragraph that runs long gets embedded as one fuzzy vector, and a fact split across two paragraphs may never land in the same chunk. Re-embedding every chunk on each startup is fine for a folder of notes and painful for ten thousand pages. And the in-memory list disappears when the process exits.

None of that is a reason to add a framework yet. **The bottom line: the moving parts don't change in production — you swap each one for a sturdier version.** Better chunking, a persistent vector store, and re-ranking the retrieved chunks are the upgrades, and they're exactly what the next parts of this series build on top of the code you just wrote.

## Quick Recap

The whole build in five lines:

- **A RAG system is four functions:** `load_and_chunk`, `embed`, `retrieve`, `answer`.
- **The "vector database" is a Python list** of `(text, vector)` pairs.
- **Retrieval is cosine similarity** plus a sort — find the closest chunks by meaning.
- **The prompt says "answer from this context only"** so the model stays grounded.
- **Bad answers are almost always retrieval,** not the model.

## Frequently Asked Questions

**Do I need a vector database?** Not to start — an in-memory Python list and NumPy handle a few hundred chunks comfortably. A vector store earns its place at thousands of chunks or when you need persistence.

**What does "from scratch" mean?** You use the OpenAI SDK and NumPy, but no RAG framework. You write the chunk, embed, retrieve, and answer logic yourself, so nothing is hidden.

**How big should each chunk be?** Start paragraph-sized. Chunk size is the biggest lever on answer quality, which is why [Part 2](/rag-chunking-retrieval-quality-part-2/) moves to fixed-size chunks with overlap.

**Which models should I use?** `text-embedding-3-small` to embed and a small chat model like `gpt-4o-mini` to generate. Names change — swap in the current cheap model.

**Why is my RAG returning the wrong answer?** Almost always retrieval, not the model. Print what `retrieve()` returns, lower `k` if the answer is buried in noise, and check for over-long chunks — debug retrieval before the prompt or the model.

**Can I use a local or open-source model?** Yes — swap `embed()` for a local embedding model and the chat call for an open-weight LLM. The chunking, cosine retrieval, and prompt stay identical.

## Conclusion

You just built a working RAG system from four functions and a folder of text — no framework, nothing hidden. Once you've watched retrieval pull the right paragraph and the model answer from it, every production RAG stack stops looking like magic and starts looking like these same four steps with sturdier tooling.

**What's the first folder of documents you'd point this at — your team's wiki, a stack of PDFs, your own notes?** Tell me in the comments, and tell me where the answers got weird — that's the chunking problem Part 2 tackles.

**Read next: [RAG Chunking & Retrieval Quality (Part 2)](/rag-chunking-retrieval-quality-part-2/)** — fix the bad answers naive chunking causes. New to RAG? Lock in the concept with [What Is RAG?](/what-is-rag-complete-guide-2026/) first.

<NextSteps>

- **Fuzzy on the retrieval half?** The [embeddings and vector search primer](/embeddings-vector-search-primer/) explains how "find by meaning" actually works.
- **Want to see retrieval inside an agent?** The [agent memory tutorial](/build-agentic-ai-app-python-part-4/) uses the same retrieve-by-meaning move for long-term memory.
- **Coming next in this series:** [Part 2 — Chunking & Retrieval Quality](/rag-chunking-retrieval-quality-part-2/), where we fix the bad answers naive splitting causes.

</NextSteps>
