# Semantic Chunking & Re-Ranking for Better RAG (Part 3)

> When fixed-size chunking plateaus, two upgrades break the ceiling — re-ranking retrieved chunks with a cross-encoder, and semantic chunking that splits on meaning. Which to reach for first, with a runnable eval.

*Source: https://www.infowok.com/semantic-chunking-reranking-rag-part-3/ · Sukhveer Kaur · Published July 2, 2026*

---

> **Series: RAG in Python: Zero to Production**
> This is Part 3. [Part 1](/build-a-rag-system-in-python-part-1/) built a RAG system from four functions; [Part 2](/rag-chunking-retrieval-quality-part-2/) fixed the chunking and added a hit-rate eval. Now we break the ceiling that fixed-size chunking hits — with re-ranking and semantic chunking.
> New here? You need Part 1's `retrieve()` and Part 2's `hit_rate` eval to follow along.

In Part 2 you measured your retrieval and tuned the chunk size. Then the number stopped moving. Mine parked at **hit rate@3 = 85%** and no amount of size-fiddling budged it — because fixed-size chunking has a ceiling, and you don't break it with a bigger dial. Two upgrades do: **re-ranking** what you retrieved, and **semantic chunking** that splits on meaning. One of them is a much better first move than the other, and I'll show you which — with the eval to prove it.

<Prerequisites>

- You built and measured the [Part 2 RAG system](/rag-chunking-retrieval-quality-part-2/) — we reuse its `retrieve()` and `hit_rate`
- You know what embeddings are — the [embeddings primer](/embeddings-vector-search-primer/) covers "find by meaning"
- New to RAG entirely? Start with the [RAG explainer](/what-is-rag-complete-guide-2026/)

</Prerequisites>

<KeyTakeaways>

- **Fixed-size chunking has a ceiling** that a bigger chunk size won't break.
- **Re-ranking is the higher-ROI fix** — a cross-encoder rescores what you retrieved.
- **Semantic chunking splits on meaning,** starting a new chunk where similarity drops.
- **Add the reranker first;** reach for semantic chunking only if the eval still shows a gap.

</KeyTakeaways>

## Two Ceilings Fixed-Size Chunking Can't Break

Part 2's system does two things that quietly cap its accuracy, and neither is about chunk size.

First, **the ranking is approximate.** The `retrieve()` from Part 1 compares one question vector against one chunk vector — a *bi-encoder* setup. It's fast because every chunk was embedded ahead of time, but each vector is a lossy summary, so the top result by cosine isn't always the most relevant chunk. It's a good guess, not a precise judgement.

Second, **a fixed window still cuts across ideas.** An 1,800-character chunk can start in the middle of one topic and end in another, so its embedding blurs two meanings together. Bigger or smaller windows just move the blur around.

**The bottom line: you've hit the limit of "retrieve once, trust the cosine order."** Breaking it means either judging relevance more carefully after retrieval, or cutting chunks on meaning in the first place.

## Re-Ranking: The Higher-ROI Fix (Do This First)

Re-ranking keeps your fast retrieval but adds a second, sharper opinion. You pull a **wide pool** of candidates cheaply with the bi-encoder, then a **cross-encoder** reads the question and each candidate *together* and scores the real relevance. Because it attends across both texts at once, it catches matches the cosine missed.

![Retrieve then re-rank: a bi-encoder cheaply retrieves the top 20 candidate chunks, a cross-encoder rescores each question-chunk pair for precise relevance, and only the top 3 go to the model — the rerank step static RAG skips](./semantic-chunking-reranking-rag-part-3-rerank.svg)

```python
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def retrieve_reranked(question, store, k=3, pool=20):
    # 1. cheap, wide recall with the bi-encoder from Part 1
    candidates = retrieve(question, store, k=pool)
    # 2. precise: the cross-encoder scores each (question, chunk) pair
    scores = reranker.predict([(question, c) for c in candidates])
    ranked = [c for _, c in sorted(zip(scores, candidates), key=lambda x: x[0], reverse=True)]
    return ranked[:k]
```

Swap `retrieve` for `retrieve_reranked` and nothing else changes — same store, same chunks, same answer step. **That's why I reach for the reranker first: it's a drop-in that works on the chunks you already have.** The pattern is retrieve-wide-then-rerank ([Sentence-Transformers documents it well](https://www.sbert.net/examples/applications/retrieve_rerank/README.html)): the bi-encoder gives cheap recall, the cross-encoder gives precise ranking.

<Callout type="tip" title="Keep the pool small">
The cross-encoder scores every candidate, so cost scales with `pool`. Twenty candidates down to three is plenty for most apps — a pool of 100 quadruples the rerank latency for a gain you probably can't measure.
</Callout>

## Semantic Chunking: Split on Meaning, Not Length

The other lever attacks the problem earlier — at how you cut the text. **Semantic chunking** embeds adjacent sentences and starts a new chunk wherever their meaning diverges, so each chunk holds one coherent idea instead of a fixed character count.

![Semantic chunking: adjacent sentences are embedded and the similarity between each pair is checked; where it drops below the threshold, sentences 1 to 3 form one chunk and sentence 4 begins another](./semantic-chunking-reranking-rag-part-3-semantic.svg)

```python
import re

def semantic_chunks(text, threshold=0.6):
    sentences = re.split(r"(?<=[.!?])\s+", text.strip())
    vecs = [embed(s) for s in sentences]
    chunks, current = [], [sentences[0]]
    for i in range(1, len(sentences)):
        if cosine(vecs[i - 1], vecs[i]) < threshold:   # meaning shifted → new chunk
            chunks.append(" ".join(current))
            current = []
        current.append(sentences[i])
    chunks.append(" ".join(current))
    return chunks
```

It reuses the `embed()` and `cosine()` you already wrote. The one knob is `threshold`: lower it and you get fewer, longer chunks; raise it and boundaries appear at the smallest topic shift. **It costs an embedding call per sentence at index time**, which is why it's the second upgrade, not the first — you pay more to build it and the payoff depends on your documents.

<Callout type="warning" title="It's only as good as your sentences">
Splitting on `.!?` breaks on abbreviations ("Inc.", "e.g.") and falls apart on tables or code. If your corpus is messy, a reranker on fixed-size chunks will beat a fragile semantic splitter — clean sentence segmentation is a prerequisite, not a given.
</Callout>

## Measure It: Did the Ceiling Move?

You know the rule from Part 2 — **change one thing, re-run the eval, watch the number.** On a small `hit_rate` gold set you'll typically see a pattern like this, one change at a time (illustrative figures — measure your own):

| Setup | hit rate@3 |
| --- | --- |
| Part 2: fixed-size chunks, plain retrieve | 85% |
| **+ cross-encoder reranker** | **93%** |
| + semantic chunking on top | 95% |

The reranker did the heavy lifting for a five-line change; chunking on meaning added a smaller bump for a lot more build cost. **That order — reranker first, chunking on meaning only if a gap remains — is the whole recommendation, and your own numbers might rank them differently.** That's the point of measuring instead of trusting a blog's defaults, including mine.

## When to Stop

These two upgrades cover most of the retrieval-quality gap. Past them, the returns get thinner and the cost climbs: **hierarchical chunking** (retrieve on small chunks, generate from their larger parents), **hybrid search** (blend keyword BM25 with vectors for exact terms), and a real **vector database** once you're past a few thousand chunks. Each is worth it only when your eval says the simpler stack has run dry.

**The bottom line: stop upgrading when the hit rate plateaus and your answers are good enough — not when you run out of techniques to add.** Over-engineering retrieval is as real a failure as under-building it.

## Quick Recap

- **Fixed-size chunking plateaus** because ranking is approximate and windows cut across ideas.
- **Re-ranking** rescoring with a cross-encoder is the biggest win for the smallest change.
- **Semantic chunking** splits on meaning — more setup, corpus-dependent payoff.
- **Order matters:** reranker first, meaning-based chunks only if the eval still shows a gap.
- **Stop when the number plateaus,** not when you run out of ideas.

## Frequently Asked Questions

**What is semantic chunking?** Splitting text on meaning: embed adjacent sentences and start a new chunk wherever their similarity drops below a threshold, so each chunk holds one idea.

**What is re-ranking in RAG?** A second scoring pass — a cross-encoder reads the question and each retrieved chunk together and reorders them by true relevance, more accurately than the bi-encoder cosine.

**Which do I add first?** The reranker. It's a drop-in on your existing chunks and usually the bigger jump; the meaning-based split is the follow-up if a gap remains.

**Does re-ranking slow things down?** Somewhat — cost scales with the candidate pool, so keep it around 20 and use a GPU or hosted reranker if latency matters.

**Is semantic chunking always better?** No — fixed-size plus a reranker is the 2026 baseline. Measure before and after; keep it only if the number moves.

## Conclusion

Part 3 is where RAG retrieval gets good: a cross-encoder reranker turns approximate cosine order into precise relevance, and semantic chunking cuts text where the meaning actually changes. But the real lesson is the discipline from Part 2 — both upgrades earn their place only when the hit-rate eval says so, and they earn it in a specific order.

**What broke your retrieval ceiling — the reranker, chunking on meaning, or something else entirely?** Tell me in the comments. If you're just joining, build the base system in Part 1 first, then measure it in Part 2.

**Read next: [RAG Chunking & Retrieval Quality (Part 2)](/rag-chunking-retrieval-quality-part-2/)** — the eval this post builds on.

<NextSteps>

- **Missed the measurement step?** [Part 2](/rag-chunking-retrieval-quality-part-2/) builds the `hit_rate` eval these upgrades are judged against.
- **Retrieval still weak on hard questions?** [Agentic RAG](/agentic-rag-vs-static-rag-2026/) adds a grade-and-retry loop on top of better chunks and ranking.
- **Heading to production?** A full [evaluation harness](/ai-agent-evaluation-metrics-frameworks-2026/) extends hit rate to faithfulness and answer quality.

</NextSteps>
