# Vector Database for RAG: When to Ditch the List (Part 4)

> The Python list store from Part 1 was the right way to learn — and it re-embeds everything on every run. This part moves the index to Chroma without changing the retrieve() contract, adds BM25 + vector hybrid search fused by rank, and gives an honest Chroma vs pgvector vs Qdrant decision.

*Source: https://www.infowok.com/vector-database-for-rag-part-4/ · Sukhveer Kaur · Published July 3, 2026*

---

> **Series: RAG in Python: Zero to Production**
> This is Part 4. [Part 1](/build-a-rag-system-in-python-part-1/) built RAG from four functions; [Part 2](/rag-chunking-retrieval-quality-part-2/) added the hit-rate eval; [Part 3](/semantic-chunking-reranking-rag-part-3/) broke the retrieval ceiling with re-ranking.
> This part answers the question Part 3 deferred: when do you need a real vector database for RAG — and what does hybrid search add on top?

Every time your Part 1 script starts, it re-embeds the entire corpus. That was a feature while you were learning — the whole index was a Python list you could print. But it means every restart costs API calls, nothing persists, and there's no way to say "only search the 2026 docs."

By the end of this post your index will live on disk in Chroma, with metadata filters, and `retrieve()` will keep the exact same contract. You'll also have a **hybrid retriever** that catches the error-code and function-name queries pure embeddings fumble. Just as important: you'll know the three concrete triggers that justify the move, because below them, **the list you already have is not a prototype embarrassment — it's the right tool.**

<Prerequisites>

- You built the [Part 1 RAG system](/build-a-rag-system-in-python-part-1/) — we swap its `store`, keep its functions
- You have the [Part 2 hit-rate eval](/rag-chunking-retrieval-quality-part-2/) — every change here gets measured with it
- New to embeddings? The [embeddings & vector search primer](/embeddings-vector-search-primer/) explains "find by meaning"

</Prerequisites>

<KeyTakeaways>

- **Three triggers justify a vector database:** persistence, metadata filters, and scale — not vibes.
- **Chroma is a drop-in swap** — the four functions from Part 1 don't change, only where vectors live.
- **Hybrid search fuses BM25 keyword rank with vector rank** so exact terms stop falling through.
- **RRF fuses by rank, not score** — which is why it works without normalizing anything.

</KeyTakeaways>

## When Do You Actually Need a Vector Database for RAG?

Before installing anything, be clear on what the list can't do — because raw speed at small scale usually isn't the problem. A brute-force cosine scan over a few thousand vectors is fast enough that you won't feel it. The real triggers are different:

- **Persistence.** The list dies with the process. Re-embedding a growing corpus on every run costs money and minutes, and it gets worse linearly forever.
- **Metadata filtering.** "Search only the API docs" or "only documents newer than March" requires storing structured fields next to each vector and filtering *before* similarity ranking. Bolting that onto a list means reinventing a query engine.
- **Scale.** Past hundreds of thousands of chunks, brute force stops being cute. A vector database builds an ANN index so query time stays flat as the corpus grows.

![A Python list store and a vector database side by side — the list lives in RAM, re-embeds every run and has no filters; the database persists on disk, embeds once, adds metadata filters and an ANN index, while retrieve keeps the same contract in both](./vector-database-for-rag-store.svg)

Notice what's *not* on the trigger list: "everyone uses one." **The bottom line: a vector database for RAG is justified by persistence, filters, or scale — if none of the three bites you yet, keep the list and spend the effort on retrieval quality instead.**

## Step 1 — Swap the List for Chroma

The move should feel anticlimactic, and that's the point — the four functions survive intact. I default to [Chroma](https://docs.trychroma.com/docs/querying-collections/query-and-get) for this series because it runs embedded in your Python process with zero infrastructure: no server, no Docker, just a folder on disk.

```bash
pip install chromadb rank_bm25
```

Index once, persist forever:

```python
import chromadb

db = chromadb.PersistentClient(path="rag_index")
col = db.get_or_create_collection("docs")

# index once — survives restarts
col.add(
    ids=[str(i) for i in range(len(chunks))],
    documents=chunks,
    embeddings=[embed(c).tolist() for c in chunks],
    metadatas=[{"source": s} for s in sources],
)
```

`PersistentClient` writes everything under `rag_index/`, so the next run skips straight to querying. Each chunk carries a metadata dict — that's what unlocks filtered retrieval. The new `retrieve()` keeps Part 1's shape:

```python
def retrieve(question, k=3, where=None):
    res = col.query(
        query_embeddings=[embed(question).tolist()],
        n_results=k,
        where=where,  # e.g. {"source": "api-docs"}
    )
    return res["documents"][0]
```

Same input, same output — but notice the one honest breaking change: **the `store` argument is gone, because Chroma now holds the store.** Downstream callers adapt in one mechanical move: delete `store` from the call. `answer(question, store)` becomes `answer(question)`, and Step 3 shows the two-line version of the same fix for the Part 2 eval. **This is the swap-the-store move: the architecture didn't change, only the address of the vectors — and one argument that no longer needs passing.**

<Callout type="warning" title="Keep embedding with YOUR embed()">
If you call `col.add(documents=...)` without passing `embeddings`, Chroma silently embeds with its own default local model — which is *not* the model your queries use if you keep Part 1's `embed()`. Index vectors and query vectors from different models don't live in the same space, and retrieval quietly degrades. Always pass both, from the same function.
</Callout>

## Step 2 — Add Hybrid Search: BM25 + Vectors, Fused by Rank

Here's a failure you've probably already met. Ask your system about `ERR_QUOTA_429` or a function called `chunk_text`, and pure vector retrieval often shrugs. Embeddings compress meaning, and **rare exact identifiers are exactly what that compression throws away.** Meanwhile BM25 — boring, keyword-based, pre-LLM — nails those queries and can't handle paraphrase at all. Neither wins alone. Analyses of production RAG failures keep landing on retrieval, not generation, as the thing to fix. So run both and fuse:

![Hybrid search flow — the question fans out to a BM25 keyword ranking and a vector cosine ranking in parallel, both feed reciprocal rank fusion which fuses by rank because scores don't mix, and the fused top-k goes on to the reranker and answer](./vector-database-for-rag-hybrid-flow.svg)

The fusion trick is [reciprocal rank fusion](https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking) (RRF). BM25 scores and cosine similarities live on incompatible scales — averaging them is meaningless — so RRF ignores scores entirely and rewards *rank positions*:

```python
from rank_bm25 import BM25Okapi

bm25 = BM25Okapi([c.lower().split() for c in chunks])

def rrf(rankings, k=60):
    scores = {}
    for ranking in rankings:
        for rank, doc in enumerate(ranking):
            scores[doc] = scores.get(doc, 0) + 1 / (k + rank + 1)
    return sorted(scores, key=scores.get, reverse=True)

def hybrid_retrieve(question, k=3, pool=20):
    vec_hits = retrieve(question, k=pool)
    kw_hits = bm25.get_top_n(question.lower().split(), chunks, n=pool)
    return rrf([vec_hits, kw_hits])[:k]
```

A chunk ranked #1 by either method gets a big boost; a chunk ranked mid-list by *both* still beats a chunk only one method liked. That's the whole algorithm — ten lines, no tuning, no score normalization.

One caveat to keep the persistence story honest: **the BM25 index doesn't persist — it's rebuilt from `chunks` at startup.** That's fine, because building it is a cheap local computation, not an API bill; re-embedding was the expensive part. On a fresh run, pull the chunks back out of Chroma instead of re-reading files: `chunks = col.get()["documents"]`.

<Callout type="note" title="Why k=60?">
The `k=60` constant comes from the original RRF paper and dampens how much the very top ranks dominate. It is remarkably insensitive — production systems from Azure AI Search to Elasticsearch ship it as the default. Leave it alone.
</Callout>

**The bottom line: hybrid search isn't an alternative to your vector store — it's a second, cheap ranking that rescues the exact-match queries embeddings systematically miss.**

## Step 3 — Measure It (Same Eval, New Retriever)

You know the Part 2 discipline: **change one thing, re-run `hit_rate`, watch the number.** The eval needs the same one-line adaptation as everything else — drop `store` — plus one upgrade that makes it better: take the retriever as a parameter, so the same function scores both:

```python
def hit_rate(gold, k=3, retriever=retrieve):
    hits = 0
    for question, needle in gold:
        retrieved = " ".join(retriever(question, k=k)).lower()
        hits += needle.lower() in retrieved
    return hits / len(gold)

print(f"vector only: {hit_rate(gold):.0%}")
print(f"hybrid     : {hit_rate(gold, retriever=hybrid_retrieve):.0%}")
```

Two lines changed from Part 2: `store` is gone, and `retriever` is now an argument. Where you should expect movement: questions containing rare exact terms — error codes, config keys, product names, function names. Where you shouldn't: purely conceptual questions, which vector search already handled.

> **Common mistake:** benchmarking hybrid search on a gold set with no keyword-style questions in it. The fused retriever will score the same as plain vector search and you'll conclude hybrid "does nothing." Your gold set has to contain the failure you're trying to fix — add a few identifier-heavy questions before you judge.

And the Part 3 reranker still applies afterwards: hybrid widens *what makes the pool*, the cross-encoder reorders the pool. They stack, and the eval tells you what each layer earned. As always, any specific numbers you see in posts like this — including mine — are illustrative until you've measured your own corpus.

## Chroma vs pgvector vs Qdrant: An Honest Decision

Sooner or later someone asks why you didn't pick a different database, so here's the short version of a long argument:

| | **Chroma** | **pgvector** | **Qdrant** |
| --- | --- | --- | --- |
| **What it is** | Embedded library, runs in-process | Postgres extension | Dedicated Rust vector service |
| **Setup cost** | `pip install`, zero infra | You already run it (if you run Postgres) | A server to operate |
| **Sweet spot** | Local dev, prototypes, small apps | Vectors next to relational data, SQL joins | Millions of vectors, heavy filtering, scale |
| **Watch out for** | Not built for high-throughput production | Postgres tuning is on you | Operational overhead you may not need yet |

My take: **for this series' scale, Chroma is correct.** The moment your app already has Postgres, [pgvector](https://github.com/pgvector/pgvector) is the pragmatic answer — it adds no new system to operate. **Qdrant is what you graduate to when retrieval is a real service with real load** — not before. Picking the scale-tier database on day one is how you spend your first month on infrastructure instead of retrieval quality. If your corpus is messy PDFs rather than clean text, that's a parsing problem no database fixes — see the [RAGFlow deep-dive](/ragflow-fix-bad-rag-retrieval-2026/).

## Quick Recap

- **Three triggers** — persistence, metadata filters, scale — justify a vector database; below them, the list is fine.
- **Chroma is a drop-in:** `PersistentClient` + a collection, and `retrieve()` keeps its contract.
- **Pass your own embeddings** to Chroma, or index and query vectors won't share a space.
- **Hybrid search = BM25 + vectors fused by RRF** — rank-based, no score normalization, ten lines.
- **Measure with a gold set that contains keyword-style questions,** or hybrid will look useless.
- **Chroma → pgvector → Qdrant** is a graduation path, not a day-one menu.

## Frequently Asked Questions

**When should I move to a vector database?** When re-embedding on every run hurts, when you need metadata filters, or when the corpus outgrows a brute-force scan. Not because a tutorial told you to.

**What is hybrid search in RAG?** BM25 keyword ranking and vector ranking run in parallel, fused by reciprocal rank fusion — exact terms and semantic matches both make the pool.

**Do I still need the Part 3 reranker?** Yes — hybrid changes what gets retrieved, the reranker changes the order. They fix different failures and stack cleanly.

**Is a Python list ever the right store?** Absolutely — for learning, prototypes, and small stable corpora it's simpler, debuggable, and has no moving parts.

## Conclusion

Part 4 closes the storage question the series opened deliberately naive. The list taught you what the database does, so you can now adopt a vector database for RAG for reasons instead of fashion. The index persists, filters work, exact-term queries stop falling through — and the four functions from Part 1 are still recognizably the same system.

**Which trigger pushed you off the list — persistence, filters, or scale — or are you happily still on it?** Tell me in the comments.

**Read next: [Semantic Chunking & Re-Ranking (Part 3)](/semantic-chunking-reranking-rag-part-3/)** — the retrieval-quality layer this store now serves.

<NextSteps>

- **Just joining the series?** [Part 1](/build-a-rag-system-in-python-part-1/) builds the four-function RAG system this post upgrades.
- **Retrieval still failing on hard questions?** [Agentic RAG](/agentic-rag-vs-static-rag-2026/) adds a grade-and-retry loop above the store.
- **Corpus full of messy PDFs?** [RAGFlow](/ragflow-fix-bad-rag-retrieval-2026/) fixes parsing — the failure no vector database can.

</NextSteps>
