RAG in Python: Zero to Production · 04Intermediate

Vector Database for RAG: When to Ditch the List (Part 4)

The Python list store from Part 1 was the right way to learn — and it re-embeds everything on every run. This part moves the index to Chroma without changing the retrieve() contract, adds BM25 + vector hybrid search fused by rank, and gives an honest Chroma vs pgvector vs Qdrant decision.

SK

Sukhveer Kaur

Published July 3, 2026

7 min read

Open in ChatGPT Open in Claude

On this page +

When Do You Actually Need a Vector Database for RAG?Step 1 — Swap the List for Chroma Step 2 — Add Hybrid Search: BM25 + Vectors, Fused by Rank Step 3 — Measure It (Same Eval, New Retriever)Chroma vs pgvector vs Qdrant: An Honest Decision Quick Recap Frequently Asked Questions Conclusion

🧰 New here? Set up your environment first · ~5 min

Install Python 3.11+ — confirm with python3 --version.
Create and activate a virtual environment: python3 -m venv .venv then source .venv/bin/activate (Windows: .venv\Scripts\activate). venv, pip & uv primer →
Install the packages this tutorial lists: pip install -U pip <packages>.
Put your LLM API key in a .env file and never commit it. API key + .env primer →

Full walkthrough → Environment Setup primer

Series: RAG in Python: Zero to Production This is Part 4. Part 1 built RAG from four functions; Part 2 added the hit-rate eval; Part 3 broke the retrieval ceiling with re-ranking. This part answers the question Part 3 deferred: when do you need a real vector database for RAG — and what does hybrid search add on top?

Every time your Part 1 script starts, it re-embeds the entire corpus. That was a feature while you were learning — the whole index was a Python list you could print. But it means every restart costs API calls, nothing persists, and there’s no way to say “only search the 2026 docs.”

By the end of this post your index will live on disk in Chroma, with metadata filters, and retrieve() will keep the exact same contract. You’ll also have a hybrid retriever that catches the error-code and function-name queries pure embeddings fumble. Just as important: you’ll know the three concrete triggers that justify the move, because below them, the list you already have is not a prototype embarrassment — it’s the right tool.

🟡 Intermediate⏱️ 11 min readStack: Python, the Part 1–3 RAG code, chromadb, rank-bm25

✅ Before you start

You built the Part 1 RAG system — we swap its store, keep its functions
You have the Part 2 hit-rate eval — every change here gets measured with it
New to embeddings? The embeddings & vector search primer explains “find by meaning”

🎯 Key takeaways

Three triggers justify a vector database: persistence, metadata filters, and scale — not vibes.
Chroma is a drop-in swap — the four functions from Part 1 don’t change, only where vectors live.
Hybrid search fuses BM25 keyword rank with vector rank so exact terms stop falling through.
RRF fuses by rank, not score — which is why it works without normalizing anything.

When Do You Actually Need a Vector Database for RAG?#

Before installing anything, be clear on what the list can’t do — because raw speed at small scale usually isn’t the problem. A brute-force cosine scan over a few thousand vectors is fast enough that you won’t feel it. The real triggers are different:

Persistence. The list dies with the process. Re-embedding a growing corpus on every run costs money and minutes, and it gets worse linearly forever.
Metadata filtering. “Search only the API docs” or “only documents newer than March” requires storing structured fields next to each vector and filtering before similarity ranking. Bolting that onto a list means reinventing a query engine.
Scale. Past hundreds of thousands of chunks, brute force stops being cute. A vector database builds an ANN index so query time stays flat as the corpus grows.

Notice what’s not on the trigger list: “everyone uses one.” The bottom line: a vector database for RAG is justified by persistence, filters, or scale — if none of the three bites you yet, keep the list and spend the effort on retrieval quality instead.

Step 1 — Swap the List for Chroma#

The move should feel anticlimactic, and that’s the point — the four functions survive intact. I default to Chroma for this series because it runs embedded in your Python process with zero infrastructure: no server, no Docker, just a folder on disk.

bash

pip install chromadb rank_bm25

Index once, persist forever:

python

import chromadb
 
db = chromadb.PersistentClient(path="rag_index")
col = db.get_or_create_collection("docs")
 
# index once — survives restarts
col.add(
    ids=[str(i) for i in range(len(chunks))],
    documents=chunks,
    embeddings=[embed(c).tolist() for c in chunks],
    metadatas=[{"source": s} for s in sources],
)

PersistentClient writes everything under rag_index/, so the next run skips straight to querying. Each chunk carries a metadata dict — that’s what unlocks filtered retrieval. The new retrieve() keeps Part 1’s shape:

python

def retrieve(question, k=3, where=None):
    res = col.query(
        query_embeddings=[embed(question).tolist()],
        n_results=k,
        where=where,  # e.g. {"source": "api-docs"}
    )
    return res["documents"][0]

Same input, same output — but notice the one honest breaking change: the store argument is gone, because Chroma now holds the store. Downstream callers adapt in one mechanical move: delete store from the call. answer(question, store) becomes answer(question), and Step 3 shows the two-line version of the same fix for the Part 2 eval. This is the swap-the-store move: the architecture didn’t change, only the address of the vectors — and one argument that no longer needs passing.

⚠️ Keep embedding with YOUR embed()

If you call col.add(documents=...) without passing embeddings, Chroma silently embeds with its own default local model — which is not the model your queries use if you keep Part 1’s embed(). Index vectors and query vectors from different models don’t live in the same space, and retrieval quietly degrades. Always pass both, from the same function.

Step 2 — Add Hybrid Search: BM25 + Vectors, Fused by Rank#

Here’s a failure you’ve probably already met. Ask your system about ERR_QUOTA_429 or a function called chunk_text, and pure vector retrieval often shrugs. Embeddings compress meaning, and rare exact identifiers are exactly what that compression throws away. Meanwhile BM25 — boring, keyword-based, pre-LLM — nails those queries and can’t handle paraphrase at all. Neither wins alone. Analyses of production RAG failures keep landing on retrieval, not generation, as the thing to fix. So run both and fuse:

The fusion trick is reciprocal rank fusion (RRF). BM25 scores and cosine similarities live on incompatible scales — averaging them is meaningless — so RRF ignores scores entirely and rewards rank positions:

python

from rank_bm25 import BM25Okapi
 
bm25 = BM25Okapi([c.lower().split() for c in chunks])
 
def rrf(rankings, k=60):
    scores = {}
    for ranking in rankings:
        for rank, doc in enumerate(ranking):
            scores[doc] = scores.get(doc, 0) + 1 / (k + rank + 1)
    return sorted(scores, key=scores.get, reverse=True)
 
def hybrid_retrieve(question, k=3, pool=20):
    vec_hits = retrieve(question, k=pool)
    kw_hits = bm25.get_top_n(question.lower().split(), chunks, n=pool)
    return rrf([vec_hits, kw_hits])[:k]

A chunk ranked #1 by either method gets a big boost; a chunk ranked mid-list by both still beats a chunk only one method liked. That’s the whole algorithm — ten lines, no tuning, no score normalization.

One caveat to keep the persistence story honest: the BM25 index doesn’t persist — it’s rebuilt from chunks at startup. That’s fine, because building it is a cheap local computation, not an API bill; re-embedding was the expensive part. On a fresh run, pull the chunks back out of Chroma instead of re-reading files: chunks = col.get()["documents"].

📌 Why k=60?

The k=60 constant comes from the original RRF paper and dampens how much the very top ranks dominate. It is remarkably insensitive — production systems from Azure AI Search to Elasticsearch ship it as the default. Leave it alone.

The bottom line: hybrid search isn’t an alternative to your vector store — it’s a second, cheap ranking that rescues the exact-match queries embeddings systematically miss.

Step 3 — Measure It (Same Eval, New Retriever)#

You know the Part 2 discipline: change one thing, re-run hit_rate, watch the number. The eval needs the same one-line adaptation as everything else — drop store — plus one upgrade that makes it better: take the retriever as a parameter, so the same function scores both:

python

def hit_rate(gold, k=3, retriever=retrieve):
    hits = 0
    for question, needle in gold:
        retrieved = " ".join(retriever(question, k=k)).lower()
        hits += needle.lower() in retrieved
    return hits / len(gold)
 
print(f"vector only: {hit_rate(gold):.0%}")
print(f"hybrid     : {hit_rate(gold, retriever=hybrid_retrieve):.0%}")

Two lines changed from Part 2: store is gone, and retriever is now an argument. Where you should expect movement: questions containing rare exact terms — error codes, config keys, product names, function names. Where you shouldn’t: purely conceptual questions, which vector search already handled.

Common mistake: benchmarking hybrid search on a gold set with no keyword-style questions in it. The fused retriever will score the same as plain vector search and you’ll conclude hybrid “does nothing.” Your gold set has to contain the failure you’re trying to fix — add a few identifier-heavy questions before you judge.

And the Part 3 reranker still applies afterwards: hybrid widens what makes the pool, the cross-encoder reorders the pool. They stack, and the eval tells you what each layer earned. As always, any specific numbers you see in posts like this — including mine — are illustrative until you’ve measured your own corpus.

Chroma vs pgvector vs Qdrant: An Honest Decision#

Sooner or later someone asks why you didn’t pick a different database, so here’s the short version of a long argument:

	Chroma	pgvector	Qdrant
What it is	Embedded library, runs in-process	Postgres extension	Dedicated Rust vector service
Setup cost	`pip install`, zero infra	You already run it (if you run Postgres)	A server to operate
Sweet spot	Local dev, prototypes, small apps	Vectors next to relational data, SQL joins	Millions of vectors, heavy filtering, scale
Watch out for	Not built for high-throughput production	Postgres tuning is on you	Operational overhead you may not need yet

My take: for this series’ scale, Chroma is correct. The moment your app already has Postgres, pgvector is the pragmatic answer — it adds no new system to operate. Qdrant is what you graduate to when retrieval is a real service with real load — not before. Picking the scale-tier database on day one is how you spend your first month on infrastructure instead of retrieval quality. If your corpus is messy PDFs rather than clean text, that’s a parsing problem no database fixes — see the RAGFlow deep-dive.

Quick Recap#

Three triggers — persistence, metadata filters, scale — justify a vector database; below them, the list is fine.
Chroma is a drop-in: PersistentClient + a collection, and retrieve() keeps its contract.
Pass your own embeddings to Chroma, or index and query vectors won’t share a space.
Hybrid search = BM25 + vectors fused by RRF — rank-based, no score normalization, ten lines.
Measure with a gold set that contains keyword-style questions, or hybrid will look useless.
Chroma → pgvector → Qdrant is a graduation path, not a day-one menu.

Frequently Asked Questions#

When should I move to a vector database? When re-embedding on every run hurts, when you need metadata filters, or when the corpus outgrows a brute-force scan. Not because a tutorial told you to.

What is hybrid search in RAG? BM25 keyword ranking and vector ranking run in parallel, fused by reciprocal rank fusion — exact terms and semantic matches both make the pool.

Do I still need the Part 3 reranker? Yes — hybrid changes what gets retrieved, the reranker changes the order. They fix different failures and stack cleanly.

Is a Python list ever the right store? Absolutely — for learning, prototypes, and small stable corpora it’s simpler, debuggable, and has no moving parts.

Conclusion#

Part 4 closes the storage question the series opened deliberately naive. The list taught you what the database does, so you can now adopt a vector database for RAG for reasons instead of fashion. The index persists, filters work, exact-term queries stop falling through — and the four functions from Part 1 are still recognizably the same system.

Which trigger pushed you off the list — persistence, filters, or scale — or are you happily still on it? Tell me in the comments.

Read next: Semantic Chunking & Re-Ranking (Part 3) — the retrieval-quality layer this store now serves.

🧭 Where to go from here

Just joining the series? Part 1 builds the four-function RAG system this post upgrades.
Retrieval still failing on hard questions? Agentic RAG adds a grade-and-retry loop above the store.
Corpus full of messy PDFs? RAGFlow fixes parsing — the failure no vector database can.

Frequently asked questions

When should I move my RAG system to a vector database? +

When one of three things happens — the index must survive between runs so you stop re-embedding on every start, you need metadata filtering such as "only search the 2026 docs", or the corpus outgrows what a brute-force scan over a list handles comfortably. Below those triggers, a Python list is a perfectly good store.

What is hybrid search in RAG? +

Running two retrievals in parallel — BM25 keyword ranking for exact terms and vector search for meaning — and fusing the two ranked lists with reciprocal rank fusion (RRF). It catches the identifier-style queries that embeddings fumble without giving up semantic matching.

Do I need hybrid search if I already added a reranker? +

They fix different failures and stack well. Hybrid search widens what gets retrieved so exact-term matches make the candidate pool at all; the Part 3 reranker reorders that pool more precisely. Add hybrid when your eval fails on queries with rare exact terms.

Should I pick Chroma, pgvector, or Qdrant? +

Chroma for local development and small deployments — zero infrastructure. pgvector when your team already runs Postgres and wants vectors next to relational data. Qdrant when you are shipping a dedicated retrieval service at millions-of-vectors scale.

References

#VectorDatabase #RAG #HybridSearch #ChromaDB #PythonForAI #AIForDevelopers

Share

Written by

Sukhveer KaurSoftware Developer & AI Engineer

Sukhveer is a software developer specialising in AI systems and backend engineering. She has hands-on experience designing agentic AI applications, working with large language model pipelines, autonomous agent frameworks, and cloud-native services in Java and Python. At InfoWok, she bridges the gap between cutting-edge AI research and practical implementation — helping developers understand and apply emerging technologies through clear, experience-backed writing.

Linkedin ↗

Related guides

Guide · 6 minWhat Is RAG in AI? A Practical Developer's Guide (2026)Sukhveer Kaur · Jun 25, 2026 Beginner · 4 minEmbeddings and Vector Search: A Primer for AI Agents (2026)Sukhveer Kaur · Jun 22, 2026 Guide · 5 minRAGFlow: Fix Bad RAG Retrieval on Real PDFs (2026)Sukhveer Kaur · Jul 2, 2026

More by Sukhveer Kaur

Guide · 5 minSoftware Engineer Skills in 2026: What the Job Now ExpectsSukhveer Kaur · Jul 4, 2026 Review · 7 minOpenRouter Review (2026): One API, 300+ Models — Worth It?Sukhveer Kaur · Jul 3, 2026 Guide · 6 minSemantic Chunking & Re-Ranking for Better RAG (Part 3)Sukhveer Kaur · Jul 2, 2026

Continue the series

← Part 03

Semantic Chunking & Re-Ranking for Better RAG (Part 3)

Get the next part the day it lands

One email per new part. No digest spam.