Same question, same document, two very different answers. Feed a PDF with a pricing table to a hand-built RAG and it will confidently report a tier that doesn't exist — because the table gets flattened into a jumbled line of numbers before it is ever embedded. That's the failure RAGFlow is built to fix: it reads the table as rows and columns instead of shredding it. This post is about that gap — why real PDFs break roll-your-own RAG, and how RAGFlow's deep parsing closes it.
- You know what RAG is — the RAG explainer covers the retrieve-augment-generate core
- Ideally you've built one by hand — the from-scratch RAG series is the baseline this graduates from
- Docker installed, with a few gigabytes of free RAM and disk
- Bad RAG answers on PDFs are a parsing problem, not a model problem.
- RAGFlow leads with deep document understanding — it reads tables and layout instead of flattening them.
- You self-host it with Docker in about fifteen minutes, no code required to start.
- Reach for it on messy real documents; keep rolling your own for clean text and full control.
Why Your RAG Returns Garbage on Real PDFs#
A from-scratch RAG system reads a PDF as one long string. That's fine for prose and a disaster for anything structured. A table becomes a run-on of cell values with no rows or columns, a two-column layout interleaves into gibberish, and a scanned page returns nothing at all. The embedding faithfully captures the meaning of that nonsense — and the model answers from it.
The bottom line: on real documents the retrieval was doomed at parse time, long before the model ran. You can't out-prompt a chunk that never held the answer in a readable form. That's the specific failure RAGFlow was built to fix.
RAGFlow in ~15 Minutes: Install and Ingest#
RAGFlow is a self-hosted engine (open-source on GitHub), so you run it with Docker rather than pip install. The quickstart is a clone and a compose-up:
git clone https://github.com/infiniflow/ragflow.git
cd ragflow/docker
docker compose -f docker-compose.yml up -d
# then open http://localhost and create your first knowledge baseFrom the web UI you create a knowledge base, upload a few PDFs, and watch it parse them. The thing to look at is the parsed preview — RAGFlow shows you how it segmented each document, tables and all, before you ask a single question. That preview is the whole value proposition made visible.
RAGFlow runs several services — a document engine, a store, and the API — so give Docker a few gigabytes of RAM. If containers restart on boot, it's almost always memory, not configuration. A cheap VPS with 4 GB is the realistic floor.
Deep Document Understanding: The Part That Fixes Accuracy#
The feature that earns RAGFlow its ~78,000 GitHub stars is deep document understanding. Instead of dumping text, it runs layout and table models that recognise a document's structure — headers, columns, and especially tables — and keeps that structure intact through chunking.
- Table extraction keeps cells addressable, so "Pro plan, monthly price" survives as a fact instead of a stray number.
- Layout awareness reads multi-column pages and figures in the right order.
- Template-driven chunking picks a splitting strategy per document type rather than a blind fixed size — the automated version of the chunking work you'd otherwise do by hand.
The bottom line: RAGFlow spends its effort before retrieval, on getting the text right — which is exactly where hand-built pipelines cut corners.
Hybrid Retrieval, Reranking, and the Agentic Layer#
Good parsing feeds better retrieval, and RAGFlow doesn't stop at vector search. It runs two-way retrieval — full-text keyword search alongside vector similarity — so exact terms like part numbers aren't lost to fuzzy matching, then reranks the merged results for precision. It's the retrieve-wide-then-rerank pattern from Part 3, built in.
As of the v0.8 release and beyond, RAGFlow added an agentic layer: a graph-based, no-code workflow editor with prebuilt pipeline templates — Retrieve → Rerank → Answer, plus deep-research and multi-agent flows. If you've read about agentic RAG, this is that idea packaged as drag-and-drop instead of code.
RAGFlow's defaults are sensible, not magic. Check the parsed preview on your worst document, adjust the chunking template, and confirm a known table value comes back correctly before you point real users at it — the same measure-first habit that keeps any RAG honest.
When to Use RAGFlow vs Roll-Your-Own#
RAGFlow isn't a replacement for understanding RAG — it's what you graduate to once you do. Here's the honest split.
| Roll-your-own (Parts 1–3) | RAGFlow | |
|---|---|---|
| Documents | Clean text, Markdown | Real PDFs, tables, scans |
| Setup | A Python script | Docker stack |
| Parsing | You (and it's the hard part) | Deep layout + table models |
| Chunking / rerank | You build it | Built in |
| Control & transparency | Total | Configurable, not total |
| Best for | Learning, clean corpora, custom logic | Messy documents, shipping fast |
My rule: build it yourself once so you understand every moving part, then reach for RAGFlow the moment real PDFs enter the picture. If your data is a tidy wiki, a from-scratch pipeline is lighter and fully yours. If it's a folder of contracts and financial statements, fighting your own PDF parser is a worse use of time than deploying an engine that already solved it.
Quick Recap#
- PDF and table parsing is the hidden cause of most wrong RAG answers.
- RAGFlow's deep document understanding reads structure a text extractor destroys.
- Self-host with Docker; inspect the parsed preview before trusting it.
- Hybrid retrieval + reranking + an agentic layer come built in.
- Roll your own to learn and for clean text; use RAGFlow for messy, real-world documents.
Frequently Asked Questions#
What is RAGFlow? An open-source, self-hosted RAG engine built around deep document parsing — it reads PDFs, tables, and layouts, then chunks, indexes, reranks, and serves answers.
How is it different from LangChain or LlamaIndex? Those are libraries you build a pipeline with; RAGFlow is a running engine you deploy, and its standout is parsing messy documents correctly.
Does it need a GPU? No, but one speeds up the parsing models on big document sets. Budget a few gigabytes of RAM for the Docker stack.
When should I build my own instead? When the corpus is clean text, you want full control, or you're learning how retrieval works under the hood.
Is it free? Yes — open-source on GitHub. You still pay for the LLM and embedding calls and the compute you host it on.
Conclusion#
RAGFlow doesn't make RAG smarter — it makes the input honest. Most "the model hallucinated" bugs on real documents are really "the table was shredded before anyone embedded it," and no prompt fixes that. Build a RAG system by hand first so you know what the engine is doing for you, then let RAGFlow handle the parsing you don't want to reinvent.
What finally broke your hand-built RAG — a table, a scanned PDF, a two-column layout? Tell me in the comments. If you haven't built one from scratch yet, do that first so you can tell what RAGFlow is actually buying you.
Read next: Build a RAG System in Python From Scratch (Part 1) — the baseline this graduates from.
- Never built one? The from-scratch RAG series shows every part RAGFlow automates.
- Curious how the chunking works? Part 2 and Part 3 are the manual versions of RAGFlow's parsing and reranking.
- Comparing tools? Best AI agent frameworks (2026) covers where a full engine fits versus a library.

