Series: RAG in Python: Zero to Production This is Part 3. Part 1 built a RAG system from four functions; Part 2 fixed the chunking and added a hit-rate eval. Now we break the ceiling that fixed-size chunking hits — with re-ranking and semantic chunking. New here? You need Part 1's
retrieve()and Part 2'shit_rateeval to follow along.
In Part 2 you measured your retrieval and tuned the chunk size. Then the number stopped moving. Mine parked at hit rate@3 = 85% and no amount of size-fiddling budged it — because fixed-size chunking has a ceiling, and you don't break it with a bigger dial. Two upgrades do: re-ranking what you retrieved, and semantic chunking that splits on meaning. One of them is a much better first move than the other, and I'll show you which — with the eval to prove it.
- You built and measured the Part 2 RAG system — we reuse its
retrieve()andhit_rate - You know what embeddings are — the embeddings primer covers "find by meaning"
- New to RAG entirely? Start with the RAG explainer
- Fixed-size chunking has a ceiling that a bigger chunk size won't break.
- Re-ranking is the higher-ROI fix — a cross-encoder rescores what you retrieved.
- Semantic chunking splits on meaning, starting a new chunk where similarity drops.
- Add the reranker first; reach for semantic chunking only if the eval still shows a gap.
Two Ceilings Fixed-Size Chunking Can't Break#
Part 2's system does two things that quietly cap its accuracy, and neither is about chunk size.
First, the ranking is approximate. The retrieve() from Part 1 compares one question vector against one chunk vector — a bi-encoder setup. It's fast because every chunk was embedded ahead of time, but each vector is a lossy summary, so the top result by cosine isn't always the most relevant chunk. It's a good guess, not a precise judgement.
Second, a fixed window still cuts across ideas. An 1,800-character chunk can start in the middle of one topic and end in another, so its embedding blurs two meanings together. Bigger or smaller windows just move the blur around.
The bottom line: you've hit the limit of "retrieve once, trust the cosine order." Breaking it means either judging relevance more carefully after retrieval, or cutting chunks on meaning in the first place.
Re-Ranking: The Higher-ROI Fix (Do This First)#
Re-ranking keeps your fast retrieval but adds a second, sharper opinion. You pull a wide pool of candidates cheaply with the bi-encoder, then a cross-encoder reads the question and each candidate together and scores the real relevance. Because it attends across both texts at once, it catches matches the cosine missed.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def retrieve_reranked(question, store, k=3, pool=20):
# 1. cheap, wide recall with the bi-encoder from Part 1
candidates = retrieve(question, store, k=pool)
# 2. precise: the cross-encoder scores each (question, chunk) pair
scores = reranker.predict([(question, c) for c in candidates])
ranked = [c for _, c in sorted(zip(scores, candidates), key=lambda x: x[0], reverse=True)]
return ranked[:k]Swap retrieve for retrieve_reranked and nothing else changes — same store, same chunks, same answer step. That's why I reach for the reranker first: it's a drop-in that works on the chunks you already have. The pattern is retrieve-wide-then-rerank (Sentence-Transformers documents it well): the bi-encoder gives cheap recall, the cross-encoder gives precise ranking.
The cross-encoder scores every candidate, so cost scales with pool. Twenty candidates down to three is plenty for most apps — a pool of 100 quadruples the rerank latency for a gain you probably can't measure.
Semantic Chunking: Split on Meaning, Not Length#
The other lever attacks the problem earlier — at how you cut the text. Semantic chunking embeds adjacent sentences and starts a new chunk wherever their meaning diverges, so each chunk holds one coherent idea instead of a fixed character count.
import re
def semantic_chunks(text, threshold=0.6):
sentences = re.split(r"(?<=[.!?])\s+", text.strip())
vecs = [embed(s) for s in sentences]
chunks, current = [], [sentences[0]]
for i in range(1, len(sentences)):
if cosine(vecs[i - 1], vecs[i]) < threshold: # meaning shifted → new chunk
chunks.append(" ".join(current))
current = []
current.append(sentences[i])
chunks.append(" ".join(current))
return chunksIt reuses the embed() and cosine() you already wrote. The one knob is threshold: lower it and you get fewer, longer chunks; raise it and boundaries appear at the smallest topic shift. It costs an embedding call per sentence at index time, which is why it's the second upgrade, not the first — you pay more to build it and the payoff depends on your documents.
Splitting on .!? breaks on abbreviations ("Inc.", "e.g.") and falls apart on tables or code. If your corpus is messy, a reranker on fixed-size chunks will beat a fragile semantic splitter — clean sentence segmentation is a prerequisite, not a given.
Measure It: Did the Ceiling Move?#
You know the rule from Part 2 — change one thing, re-run the eval, watch the number. On a small hit_rate gold set you'll typically see a pattern like this, one change at a time (illustrative figures — measure your own):
| Setup | hit rate@3 |
|---|---|
| Part 2: fixed-size chunks, plain retrieve | 85% |
| + cross-encoder reranker | 93% |
| + semantic chunking on top | 95% |
The reranker did the heavy lifting for a five-line change; chunking on meaning added a smaller bump for a lot more build cost. That order — reranker first, chunking on meaning only if a gap remains — is the whole recommendation, and your own numbers might rank them differently. That's the point of measuring instead of trusting a blog's defaults, including mine.
When to Stop#
These two upgrades cover most of the retrieval-quality gap. Past them, the returns get thinner and the cost climbs: hierarchical chunking (retrieve on small chunks, generate from their larger parents), hybrid search (blend keyword BM25 with vectors for exact terms), and a real vector database once you're past a few thousand chunks. Each is worth it only when your eval says the simpler stack has run dry.
The bottom line: stop upgrading when the hit rate plateaus and your answers are good enough — not when you run out of techniques to add. Over-engineering retrieval is as real a failure as under-building it.
Quick Recap#
- Fixed-size chunking plateaus because ranking is approximate and windows cut across ideas.
- Re-ranking rescoring with a cross-encoder is the biggest win for the smallest change.
- Semantic chunking splits on meaning — more setup, corpus-dependent payoff.
- Order matters: reranker first, meaning-based chunks only if the eval still shows a gap.
- Stop when the number plateaus, not when you run out of ideas.
Frequently Asked Questions#
What is semantic chunking? Splitting text on meaning: embed adjacent sentences and start a new chunk wherever their similarity drops below a threshold, so each chunk holds one idea.
What is re-ranking in RAG? A second scoring pass — a cross-encoder reads the question and each retrieved chunk together and reorders them by true relevance, more accurately than the bi-encoder cosine.
Which do I add first? The reranker. It's a drop-in on your existing chunks and usually the bigger jump; the meaning-based split is the follow-up if a gap remains.
Does re-ranking slow things down? Somewhat — cost scales with the candidate pool, so keep it around 20 and use a GPU or hosted reranker if latency matters.
Is semantic chunking always better? No — fixed-size plus a reranker is the 2026 baseline. Measure before and after; keep it only if the number moves.
Conclusion#
Part 3 is where RAG retrieval gets good: a cross-encoder reranker turns approximate cosine order into precise relevance, and semantic chunking cuts text where the meaning actually changes. But the real lesson is the discipline from Part 2 — both upgrades earn their place only when the hit-rate eval says so, and they earn it in a specific order.
What broke your retrieval ceiling — the reranker, chunking on meaning, or something else entirely? Tell me in the comments. If you're just joining, build the base system in Part 1 first, then measure it in Part 2.
Read next: RAG Chunking & Retrieval Quality (Part 2) — the eval this post builds on.
- Missed the measurement step? Part 2 builds the
hit_rateeval these upgrades are judged against. - Retrieval still weak on hard questions? Agentic RAG adds a grade-and-retry loop on top of better chunks and ranking.
- Heading to production? A full evaluation harness extends hit rate to faithfulness and answer quality.

