# Local RAG: Chat With Your Documents 100% Privately (2026)

> Set up local RAG and chat with your documents fully offline: Open WebUI knowledge bases for the no-code path, plus a tiny Python pipeline with Ollama.

*Source: https://www.infowok.com/local-rag-chat-with-your-documents/ · Sukhveer Kaur · Published July 5, 2026*

---

> **Local AI, Zero Cost — Part 3.** [Part 1](/best-local-llm-for-your-laptop/) put a free model on your laptop; [Part 2](/local-ai-coding-assistant-vscode/) turned it into a coding assistant. Now the same stack learns to answer questions about *your* files — with **local RAG**, fully private and offline.

Your notes, contracts, research papers and meeting minutes hold answers no chatbot knows. The usual fix is uploading them to a cloud AI. That's exactly what you can't do with client contracts, medical records or anything under NDA. Local RAG solves this the other way around: instead of sending documents to the model, you bring the model to the documents. Everything — the files, the search index, the conversation — stays on your laptop.

This guide gives you both paths from the series so far. The no-code route uses Open WebUI. The developer route is a ~30-line Python pipeline that shows you the machinery.

<Prerequisites>

- Ollama running with a chat model from [Part 1](/best-local-llm-for-your-laptop/)
- New to embeddings? The [embeddings and vector search primer](/embeddings-vector-search-primer/) explains the core idea in ten minutes
- For the Python path: basic Python and `pip` — nothing else

</Prerequisites>

<KeyTakeaways>

- **Local RAG = your documents + a local embedding model + a local LLM.** The pipeline chunks your files, embeds them with `nomic-embed-text`, stores them in a local vector database, and answers with the model you already run — no upload, ever.
- **Open WebUI gives you document chat with zero code.** Create a knowledge base, drag in PDFs, and reference it in any chat.
- **The #1 failure is Ollama's 2048-token default context.** Retrieved chunks don't fit, answers go vague — raise it to 8192+ and most "RAG is broken" complaints disappear.

</KeyTakeaways>

## Why local RAG instead of uploading your files

A local model knows nothing about your world — its training data ends months before you asked. RAG fixes that by retrieving the right passages from your files at question time. Doing it locally adds the properties that matter for private material:

- **Confidential by construction.** Contracts, patient notes, unpublished research — the entire pipeline runs on hardware you own. There is no upload step to audit because there is no upload step.
- **Offline and permanent.** Your document chat works on a plane, and no provider can retire the feature or change its pricing.
- **Free at any scale.** Embedding a thousand pages costs electricity, not tokens. Ask unlimited questions.
- **You learn the real pipeline.** Chunking, embeddings, retrieval — the same concepts behind every production RAG system, running where you can poke at them.

The honest trade-off is the same one from Parts 1 and 2: a laptop model summarizes and answers from retrieved context well, but a frontier cloud model reasons better over long, tangled documents. For private material, local is not the compromise — it's the requirement.

**Bottom line: local RAG is how you get "chat with your documents" for files you'd never be allowed to upload.**

## How local RAG works

One diagram, two phases — index once, then ask forever:

![How local RAG works — documents are split into chunks, embedded with nomic-embed-text and stored in a local vector store; each question retrieves the top matching chunks and a local model answers, all on one laptop](./local-rag-pipeline.svg)

The indexing phase splits each document into chunks and runs each through an embedding model. [nomic-embed-text](https://ollama.com/library/nomic-embed-text) is the local standard, small enough to run instantly on CPU. The resulting vectors go into a local vector store. At question time, the pipeline embeds your query the same way. The store returns the most similar chunks, and your chat model answers using only that context.

**Bottom line: two small models split the work — one turns text into searchable vectors, the other writes the answer.**

## The no-code path: Open WebUI knowledge bases

[Open WebUI](https://docs.openwebui.com/features/chat-conversations/rag/) is a free, self-hosted ChatGPT-style interface that sits on top of Ollama and has RAG built in:

1. **Install and start it:**

```bash
pip install open-webui
open-webui serve
# open http://localhost:8080 in your browser
```

2. **Pull the embedding model** once: `ollama pull nomic-embed-text`, then in Open WebUI's admin settings set the embedding engine to Ollama with `nomic-embed-text`.
3. **Create a knowledge base** (Workspace → Knowledge), and drag in your files — PDF, TXT, Markdown and DOCX all work.
4. **Chat with it.** In any conversation, type `#` and select your knowledge base. Ask away; Open WebUI retrieves the relevant chunks and cites which file they came from.

<Callout type="warning">

Before judging the answers, fix the context window. Ollama defaults to 2048 tokens, too small to hold the retrieved chunks. That's the classic cause of vague "I don't see that in the document" answers. In Open WebUI: Admin Panel → Models → your model → Advanced Parameters → set context length to 8192 or more. [Open WebUI's own troubleshooting guide](https://docs.openwebui.com/troubleshooting/rag/) calls this the most common RAG failure.

</Callout>

**Bottom line: pip install, pull one embedding model, drag in files — private document chat in a browser tab.**

## The developer path: a ~30-line Python pipeline

The same pipeline in code, using [Chroma](https://docs.trychroma.com/) as the local vector store:

```bash
pip install ollama chromadb
ollama pull nomic-embed-text
```

```python
import ollama
import chromadb

# 1. Index once: chunk, embed, store
docs = [
    "The notice period for termination is 60 days...",
    "Payment terms: invoices are due within 30 days...",
    # in real use: read your files and split into ~1,000-char chunks
]
client = chromadb.Client()
col = client.create_collection("my_docs")
for i, doc in enumerate(docs):
    emb = ollama.embed(model="nomic-embed-text", input=doc)["embeddings"][0]
    col.add(ids=[str(i)], embeddings=[emb], documents=[doc])

# 2. Every question: embed, retrieve, answer
q = "How much notice do I have to give?"
q_emb = ollama.embed(model="nomic-embed-text", input=q)["embeddings"][0]
hits = col.query(query_embeddings=[q_emb], n_results=3)
context = "\n\n".join(hits["documents"][0])

reply = ollama.chat(model="qwen3.5:9b", messages=[{
    "role": "user",
    "content": f"Answer using only this context:\n{context}\n\nQuestion: {q}",
}])
print(reply["message"]["content"])
```

Swap `qwen3.5:9b` for your Part 1 model. That's a complete, private RAG system. Every production concept has a seat in it: the chunking strategy, the embedding model, the retrieval count, the prompt that constrains the model to the context.

<Callout type="tip">

When answers disappoint, the fix is almost never a bigger model — it's better retrieval. Our [chunking and retrieval quality guide](/rag-chunking-retrieval-quality-part-2/) covers the levers. The [from-scratch RAG series](/build-a-rag-system-in-python-part-1/) builds the whole pipeline without frameworks if you want full depth.

</Callout>

**Bottom line: thirty lines of Python — embed, store, retrieve, answer — and you own every stage of the pipeline.**

## Quick recap

| Decision | Answer |
| --- | --- |
| Embedding model | nomic-embed-text (tiny, CPU-fast, free) |
| No-code tool | Open WebUI knowledge bases (`pip install open-webui`) |
| Code path | ollama + chromadb, ~30 lines |
| Answering model | Whatever your RAM supports from Part 1 |
| #1 fix for bad answers | Raise Ollama context length from 2048 to 8192+ |
| #2 fix | Better chunking — not a bigger model |
| Privacy | Files, index and chats never leave your laptop |

Three parts in, the pattern holds: the models got small enough, the tools got simple enough, and the paid tier stopped being the price of entry. Your laptop now chats, codes and reads your documents — for nothing.

<NextSteps>

- **Start here:** install Open WebUI, drag three PDFs into a knowledge base, and ask a question you know the answer to — verify it cites the right file.
- **Level up:** run the Python pipeline on a real folder of notes, then experiment with chunk sizes and `n_results`.
- **Go deeper:** the [from-scratch RAG series](/build-a-rag-system-in-python-part-1/) rebuilds this pipeline without frameworks, and [What Is RAG?](/what-is-rag-complete-guide-2026/) fills in the theory.

</NextSteps>
