RAG in Python: Zero to Production · 03Intermediate

Semantic Chunking & Re-Ranking for Better RAG (Part 3)

When fixed-size chunking plateaus, two upgrades break the ceiling — re-ranking retrieved chunks with a cross-encoder, and semantic chunking that splits on meaning. Which to reach for first, with a runnable eval.

SK

Sukhveer Kaur

Published July 2, 2026

5 min read

Open in ChatGPT Open in Claude

On this page +

Two Ceilings Fixed-Size Chunking Can't Break Re-Ranking: The Higher-ROI Fix (Do This First)Semantic Chunking: Split on Meaning, Not Length Measure It: Did the Ceiling Move?When to Stop Quick Recap Frequently Asked Questions Conclusion

🧰 New here? Set up your environment first · ~5 min

Install Python 3.11+ — confirm with python3 --version.
Create and activate a virtual environment: python3 -m venv .venv then source .venv/bin/activate (Windows: .venv\Scripts\activate). venv, pip & uv primer →
Install the packages this tutorial lists: pip install -U pip <packages>.
Put your LLM API key in a .env file and never commit it. API key + .env primer →

Full walkthrough → Environment Setup primer

Series: RAG in Python: Zero to Production This is Part 3. Part 1 built a RAG system from four functions; Part 2 fixed the chunking and added a hit-rate eval. Now we break the ceiling that fixed-size chunking hits — with re-ranking and semantic chunking. New here? You need Part 1's retrieve() and Part 2's hit_rate eval to follow along.

In Part 2 you measured your retrieval and tuned the chunk size. Then the number stopped moving. Mine parked at hit rate@3 = 85% and no amount of size-fiddling budged it — because fixed-size chunking has a ceiling, and you don't break it with a bigger dial. Two upgrades do: re-ranking what you retrieved, and semantic chunking that splits on meaning. One of them is a much better first move than the other, and I'll show you which — with the eval to prove it.

🟡 Intermediate⏱️ 12 min readStack: Python, the Part 1–2 RAG code, sentence-transformers

✅ Before you start

You built and measured the Part 2 RAG system — we reuse its retrieve() and hit_rate
You know what embeddings are — the embeddings primer covers "find by meaning"
New to RAG entirely? Start with the RAG explainer

🎯 Key takeaways

Fixed-size chunking has a ceiling that a bigger chunk size won't break.
Re-ranking is the higher-ROI fix — a cross-encoder rescores what you retrieved.
Semantic chunking splits on meaning, starting a new chunk where similarity drops.
Add the reranker first; reach for semantic chunking only if the eval still shows a gap.

Two Ceilings Fixed-Size Chunking Can't Break#

Part 2's system does two things that quietly cap its accuracy, and neither is about chunk size.

First, the ranking is approximate. The retrieve() from Part 1 compares one question vector against one chunk vector — a bi-encoder setup. It's fast because every chunk was embedded ahead of time, but each vector is a lossy summary, so the top result by cosine isn't always the most relevant chunk. It's a good guess, not a precise judgement.

Second, a fixed window still cuts across ideas. An 1,800-character chunk can start in the middle of one topic and end in another, so its embedding blurs two meanings together. Bigger or smaller windows just move the blur around.

The bottom line: you've hit the limit of "retrieve once, trust the cosine order." Breaking it means either judging relevance more carefully after retrieval, or cutting chunks on meaning in the first place.

Re-Ranking: The Higher-ROI Fix (Do This First)#

Re-ranking keeps your fast retrieval but adds a second, sharper opinion. You pull a wide pool of candidates cheaply with the bi-encoder, then a cross-encoder reads the question and each candidate together and scores the real relevance. Because it attends across both texts at once, it catches matches the cosine missed.

Retrieve then re-rank: a bi-encoder cheaply retrieves the top 20 candidate chunks, a cross-encoder rescores each question-chunk pair for precise relevance, and only the top 3 go to the model — the rerank step static RAG skips

python

from sentence_transformers import CrossEncoder
 
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
 
def retrieve_reranked(question, store, k=3, pool=20):
    # 1. cheap, wide recall with the bi-encoder from Part 1
    candidates = retrieve(question, store, k=pool)
    # 2. precise: the cross-encoder scores each (question, chunk) pair
    scores = reranker.predict([(question, c) for c in candidates])
    ranked = [c for _, c in sorted(zip(scores, candidates), key=lambda x: x[0], reverse=True)]
    return ranked[:k]

Swap retrieve for retrieve_reranked and nothing else changes — same store, same chunks, same answer step. That's why I reach for the reranker first: it's a drop-in that works on the chunks you already have. The pattern is retrieve-wide-then-rerank (Sentence-Transformers documents it well): the bi-encoder gives cheap recall, the cross-encoder gives precise ranking.

💡 Keep the pool small

The cross-encoder scores every candidate, so cost scales with pool. Twenty candidates down to three is plenty for most apps — a pool of 100 quadruples the rerank latency for a gain you probably can't measure.

Semantic Chunking: Split on Meaning, Not Length#

The other lever attacks the problem earlier — at how you cut the text. Semantic chunking embeds adjacent sentences and starts a new chunk wherever their meaning diverges, so each chunk holds one coherent idea instead of a fixed character count.

Semantic chunking: adjacent sentences are embedded and the similarity between each pair is checked; where it drops below the threshold, sentences 1 to 3 form one chunk and sentence 4 begins another

python

import re
 
def semantic_chunks(text, threshold=0.6):
    sentences = re.split(r"(?<=[.!?])\s+", text.strip())
    vecs = [embed(s) for s in sentences]
    chunks, current = [], [sentences[0]]
    for i in range(1, len(sentences)):
        if cosine(vecs[i - 1], vecs[i]) < threshold:   # meaning shifted → new chunk
            chunks.append(" ".join(current))
            current = []
        current.append(sentences[i])
    chunks.append(" ".join(current))
    return chunks

It reuses the embed() and cosine() you already wrote. The one knob is threshold: lower it and you get fewer, longer chunks; raise it and boundaries appear at the smallest topic shift. It costs an embedding call per sentence at index time, which is why it's the second upgrade, not the first — you pay more to build it and the payoff depends on your documents.

⚠️ It's only as good as your sentences

Splitting on .!? breaks on abbreviations ("Inc.", "e.g.") and falls apart on tables or code. If your corpus is messy, a reranker on fixed-size chunks will beat a fragile semantic splitter — clean sentence segmentation is a prerequisite, not a given.

Measure It: Did the Ceiling Move?#

You know the rule from Part 2 — change one thing, re-run the eval, watch the number. On a small hit_rate gold set you'll typically see a pattern like this, one change at a time (illustrative figures — measure your own):

Setup	hit rate@3
Part 2: fixed-size chunks, plain retrieve	85%
+ cross-encoder reranker	93%
+ semantic chunking on top	95%

The reranker did the heavy lifting for a five-line change; chunking on meaning added a smaller bump for a lot more build cost. That order — reranker first, chunking on meaning only if a gap remains — is the whole recommendation, and your own numbers might rank them differently. That's the point of measuring instead of trusting a blog's defaults, including mine.

When to Stop#

These two upgrades cover most of the retrieval-quality gap. Past them, the returns get thinner and the cost climbs: hierarchical chunking (retrieve on small chunks, generate from their larger parents), hybrid search (blend keyword BM25 with vectors for exact terms), and a real vector database once you're past a few thousand chunks. Each is worth it only when your eval says the simpler stack has run dry.

The bottom line: stop upgrading when the hit rate plateaus and your answers are good enough — not when you run out of techniques to add. Over-engineering retrieval is as real a failure as under-building it.

Quick Recap#

Fixed-size chunking plateaus because ranking is approximate and windows cut across ideas.
Re-ranking rescoring with a cross-encoder is the biggest win for the smallest change.
Semantic chunking splits on meaning — more setup, corpus-dependent payoff.
Order matters: reranker first, meaning-based chunks only if the eval still shows a gap.
Stop when the number plateaus, not when you run out of ideas.

Frequently Asked Questions#

What is semantic chunking? Splitting text on meaning: embed adjacent sentences and start a new chunk wherever their similarity drops below a threshold, so each chunk holds one idea.

What is re-ranking in RAG? A second scoring pass — a cross-encoder reads the question and each retrieved chunk together and reorders them by true relevance, more accurately than the bi-encoder cosine.

Which do I add first? The reranker. It's a drop-in on your existing chunks and usually the bigger jump; the meaning-based split is the follow-up if a gap remains.

Does re-ranking slow things down? Somewhat — cost scales with the candidate pool, so keep it around 20 and use a GPU or hosted reranker if latency matters.

Is semantic chunking always better? No — fixed-size plus a reranker is the 2026 baseline. Measure before and after; keep it only if the number moves.

Conclusion#

Part 3 is where RAG retrieval gets good: a cross-encoder reranker turns approximate cosine order into precise relevance, and semantic chunking cuts text where the meaning actually changes. But the real lesson is the discipline from Part 2 — both upgrades earn their place only when the hit-rate eval says so, and they earn it in a specific order.

What broke your retrieval ceiling — the reranker, chunking on meaning, or something else entirely? Tell me in the comments. If you're just joining, build the base system in Part 1 first, then measure it in Part 2.

Read next: RAG Chunking & Retrieval Quality (Part 2) — the eval this post builds on.

🧭 Where to go from here

Missed the measurement step? Part 2 builds the hit_rate eval these upgrades are judged against.
Retrieval still weak on hard questions? Agentic RAG adds a grade-and-retry loop on top of better chunks and ranking.
Heading to production? A full evaluation harness extends hit rate to faithfulness and answer quality.

Frequently asked questions

What is semantic chunking? +

Semantic chunking splits text on meaning instead of length. You embed adjacent sentences, measure the cosine similarity between each pair, and start a new chunk wherever the similarity drops below a threshold — so a chunk holds one coherent idea rather than a fixed character count.

What is re-ranking in RAG? +

Re-ranking adds a second, more accurate scoring pass after retrieval. A cross-encoder reads the question and a candidate chunk together and scores how well they match, which is far more precise than the bi-encoder cosine that pulled the candidates. You retrieve a wide pool cheaply, then re-rank it to pick the real top-k.

Should I add semantic chunking or a reranker first? +

A reranker first. It's a smaller change, works on your existing chunks, and usually gives the bigger accuracy jump — the 2026 baseline is fixed-size chunks plus a cross-encoder reranker. Reach for semantic chunking only if your eval still shows a gap after re-ranking.

Does re-ranking slow down RAG? +

A little. The cross-encoder scores every candidate in the pool, so cost scales with the pool size. Keep the pool small — 20 candidates down to 3 is plenty for most apps — and run the reranker on GPU or a hosted API if latency matters.

Is semantic chunking always better than fixed-size? +

No. Fixed-size plus a reranker is the strong 2026 baseline, and some corpora see no gain from semantic chunking while paying more to build it. Treat it as an experiment — measure hit rate before and after, and keep it only if the number moves.

References

#SemanticChunking #RAG #Reranking #RetrievalAugmentedGeneration #PythonForAI #AIForDevelopers

Share

Written by

Sukhveer KaurSoftware Developer & AI Engineer

Sukhveer is a software developer specialising in AI systems and backend engineering. She has hands-on experience designing agentic AI applications, working with large language model pipelines, autonomous agent frameworks, and cloud-native services in Java and Python. At InfoWok, she bridges the gap between cutting-edge AI research and practical implementation — helping developers understand and apply emerging technologies through clear, experience-backed writing.

Linkedin ↗

Related guides

Guide · 6 minWhat Is RAG in AI? A Practical Developer's Guide (2026)Sukhveer Kaur · Jun 25, 2026 Opinion · 6 minAgentic RAG: Why Static Retrieval Isn't Enough (2026)Sukhveer Kaur · Jun 30, 2026 Beginner · 4 minEmbeddings and Vector Search: A Primer for AI Agents (2026)Sukhveer Kaur · Jun 22, 2026

More by Sukhveer Kaur

Guide · 4 minPython Environment Setup for AI Agents: The 5-Minute Primer (2026)Sukhveer Kaur · Jul 1, 2026 Guide · 8 minBuild a RAG System in Python From Scratch (Part 1)Sukhveer Kaur · Jun 30, 2026 Guide · 6 minRAG Chunking & Retrieval Quality: Fix Bad Answers (Part 2)Sukhveer Kaur · Jun 30, 2026

Continue the series

← Part 02

RAG Chunking & Retrieval Quality: Fix Bad Answers (Part 2)

Get the next part the day it lands

One email per new part. No digest spam.