Local AI, Zero CostBeginner

Local RAG: Chat With Your Documents 100% Privately (2026)

Set up local RAG and chat with your documents fully offline: Open WebUI knowledge bases for the no-code path, plus a tiny Python pipeline with Ollama.

SK

Sukhveer Kaur

Published July 5, 2026

4 min read

Open in ChatGPT Open in Claude

On this page +

Why local RAG instead of uploading your files How local RAG works The no-code path: Open WebUI knowledge bases The developer path: a ~30-line Python pipeline Quick recap

🧰 New here? Set up your environment first · ~5 min

Install Python 3.11+ — confirm with python3 --version.
Create and activate a virtual environment: python3 -m venv .venv then source .venv/bin/activate (Windows: .venv\Scripts\activate). venv, pip & uv primer →
Install the packages this tutorial lists: pip install -U pip <packages>.
Put your LLM API key in a .env file and never commit it. API key + .env primer →

Full walkthrough → Environment Setup primer

Local AI, Zero Cost — Part 3. Part 1 put a free model on your laptop; Part 2 turned it into a coding assistant. Now the same stack learns to answer questions about your files — with local RAG, fully private and offline.

Your notes, contracts, research papers and meeting minutes hold answers no chatbot knows. The usual fix is uploading them to a cloud AI. That’s exactly what you can’t do with client contracts, medical records or anything under NDA. Local RAG solves this the other way around: instead of sending documents to the model, you bring the model to the documents. Everything — the files, the search index, the conversation — stays on your laptop.

This guide gives you both paths from the series so far. The no-code route uses Open WebUI. The developer route is a ~30-line Python pipeline that shows you the machinery.

🟢 Beginner⏱️ 25 minStack: Ollama + nomic-embed-text, plus Open WebUI (no-code) or Python + Chroma (code)

✅ Before you start

Ollama running with a chat model from Part 1
New to embeddings? The embeddings and vector search primer explains the core idea in ten minutes
For the Python path: basic Python and pip — nothing else

🎯 Key takeaways

Local RAG = your documents + a local embedding model + a local LLM. The pipeline chunks your files, embeds them with nomic-embed-text, stores them in a local vector database, and answers with the model you already run — no upload, ever.
Open WebUI gives you document chat with zero code. Create a knowledge base, drag in PDFs, and reference it in any chat.
The #1 failure is Ollama’s 2048-token default context. Retrieved chunks don’t fit, answers go vague — raise it to 8192+ and most “RAG is broken” complaints disappear.

Why local RAG instead of uploading your files#

A local model knows nothing about your world — its training data ends months before you asked. RAG fixes that by retrieving the right passages from your files at question time. Doing it locally adds the properties that matter for private material:

Confidential by construction. Contracts, patient notes, unpublished research — the entire pipeline runs on hardware you own. There is no upload step to audit because there is no upload step.
Offline and permanent. Your document chat works on a plane, and no provider can retire the feature or change its pricing.
Free at any scale. Embedding a thousand pages costs electricity, not tokens. Ask unlimited questions.
You learn the real pipeline. Chunking, embeddings, retrieval — the same concepts behind every production RAG system, running where you can poke at them.

The honest trade-off is the same one from Parts 1 and 2: a laptop model summarizes and answers from retrieved context well, but a frontier cloud model reasons better over long, tangled documents. For private material, local is not the compromise — it’s the requirement.

Bottom line: local RAG is how you get “chat with your documents” for files you’d never be allowed to upload.

How local RAG works#

One diagram, two phases — index once, then ask forever:

The indexing phase splits each document into chunks and runs each through an embedding model. nomic-embed-text is the local standard, small enough to run instantly on CPU. The resulting vectors go into a local vector store. At question time, the pipeline embeds your query the same way. The store returns the most similar chunks, and your chat model answers using only that context.

Bottom line: two small models split the work — one turns text into searchable vectors, the other writes the answer.

The no-code path: Open WebUI knowledge bases#

Open WebUI is a free, self-hosted ChatGPT-style interface that sits on top of Ollama and has RAG built in:

Install and start it:

bash

pip install open-webui
open-webui serve
# open http://localhost:8080 in your browser

Pull the embedding model once: ollama pull nomic-embed-text, then in Open WebUI’s admin settings set the embedding engine to Ollama with nomic-embed-text.
Create a knowledge base (Workspace → Knowledge), and drag in your files — PDF, TXT, Markdown and DOCX all work.
Chat with it. In any conversation, type # and select your knowledge base. Ask away; Open WebUI retrieves the relevant chunks and cites which file they came from.

⚠️ Warning

Before judging the answers, fix the context window. Ollama defaults to 2048 tokens, too small to hold the retrieved chunks. That’s the classic cause of vague “I don’t see that in the document” answers. In Open WebUI: Admin Panel → Models → your model → Advanced Parameters → set context length to 8192 or more. Open WebUI’s own troubleshooting guide calls this the most common RAG failure.

Bottom line: pip install, pull one embedding model, drag in files — private document chat in a browser tab.

The developer path: a ~30-line Python pipeline#

The same pipeline in code, using Chroma as the local vector store:

bash

pip install ollama chromadb
ollama pull nomic-embed-text

python

import ollama
import chromadb
 
# 1. Index once: chunk, embed, store
docs = [
    "The notice period for termination is 60 days...",
    "Payment terms: invoices are due within 30 days...",
    # in real use: read your files and split into ~1,000-char chunks
]
client = chromadb.Client()
col = client.create_collection("my_docs")
for i, doc in enumerate(docs):
    emb = ollama.embed(model="nomic-embed-text", input=doc)["embeddings"][0]
    col.add(ids=[str(i)], embeddings=[emb], documents=[doc])
 
# 2. Every question: embed, retrieve, answer
q = "How much notice do I have to give?"
q_emb = ollama.embed(model="nomic-embed-text", input=q)["embeddings"][0]
hits = col.query(query_embeddings=[q_emb], n_results=3)
context = "\n\n".join(hits["documents"][0])
 
reply = ollama.chat(model="qwen3.5:9b", messages=[{
    "role": "user",
    "content": f"Answer using only this context:\n{context}\n\nQuestion: {q}",
}])
print(reply["message"]["content"])

Swap qwen3.5:9b for your Part 1 model. That’s a complete, private RAG system. Every production concept has a seat in it: the chunking strategy, the embedding model, the retrieval count, the prompt that constrains the model to the context.

💡 Tip

When answers disappoint, the fix is almost never a bigger model — it’s better retrieval. Our chunking and retrieval quality guide covers the levers. The from-scratch RAG series builds the whole pipeline without frameworks if you want full depth.

Bottom line: thirty lines of Python — embed, store, retrieve, answer — and you own every stage of the pipeline.

Quick recap#

Decision	Answer
Embedding model	nomic-embed-text (tiny, CPU-fast, free)
No-code tool	Open WebUI knowledge bases (`pip install open-webui`)
Code path	ollama + chromadb, ~30 lines
Answering model	Whatever your RAM supports from Part 1
#1 fix for bad answers	Raise Ollama context length from 2048 to 8192+
#2 fix	Better chunking — not a bigger model
Privacy	Files, index and chats never leave your laptop

Three parts in, the pattern holds: the models got small enough, the tools got simple enough, and the paid tier stopped being the price of entry. Your laptop now chats, codes and reads your documents — for nothing.

🧭 Where to go from here

Start here: install Open WebUI, drag three PDFs into a knowledge base, and ask a question you know the answer to — verify it cites the right file.
Level up: run the Python pipeline on a real folder of notes, then experiment with chunk sizes and n_results.
Go deeper: the from-scratch RAG series rebuilds this pipeline without frameworks, and What Is RAG? fills in the theory.

Frequently asked questions

What is local RAG? +

Local RAG is retrieval-augmented generation running entirely on your own machine. Your documents are chunked and embedded by a local model, stored in a local vector database, and answered by a local LLM — so nothing is ever uploaded to a cloud service.

Do I need a GPU to chat with my documents locally? +

No. Embedding models like nomic-embed-text are tiny and run fast on CPU, and the answering model is whatever your RAM already supports from Part 1 of this series. A GPU makes answers faster but is not required.

Why does my local RAG give bad or empty answers? +

The most common cause is Ollama's default 2048-token context window, which is too small to hold the retrieved chunks. Raise the context length to 8192 or more in your settings. After that, poor chunking is the usual suspect.

Is Open WebUI free and private? +

Yes. Open WebUI is open source and self-hosted. When paired with Ollama on the same machine, your documents, embeddings and chats stay on your hardware — there is no external service involved.

How is this different from uploading a PDF to ChatGPT? +

Functionally it feels similar, but the mechanics are opposite. Uploading sends your file to someone else's servers. Local RAG keeps the file, the search index and the conversation on your laptop, works offline, and costs nothing per question.

References

#LocalRAG #OpenWebUI #Ollama #PrivateAI #ChatWithDocuments #LocalLLM

Share

Written by

Sukhveer KaurSoftware Developer & AI Engineer

Sukhveer is a software developer specialising in AI systems and backend engineering. She has hands-on experience designing agentic AI applications, working with large language model pipelines, autonomous agent frameworks, and cloud-native services in Java and Python. At InfoWok, she bridges the gap between cutting-edge AI research and practical implementation — helping developers understand and apply emerging technologies through clear, experience-backed writing.

Linkedin ↗

Related guides

Guide · 5 minBuild a Customer Support AI Agent in Python (2026)Sukhveer Kaur · Jul 4, 2026 Guide · 6 minOpenAI Agents SDK Tutorial: Build an Agent in Python (2026)Sukhveer Kaur · Jul 4, 2026 Guide · 7 minVector Database for RAG: When to Ditch the List (Part 4)Sukhveer Kaur · Jul 3, 2026

More by Sukhveer Kaur

Guide · 7 minBest Local LLM for Your Laptop in 2026: Free and PrivateSukhveer Kaur · Jul 5, 2026 Guide · 5 minFree Local AI Coding Assistant in VS Code (2026 Setup)Sukhveer Kaur · Jul 5, 2026 Guide · 5 minSoftware Engineer Skills in 2026: What the Job Now ExpectsSukhveer Kaur · Jul 4, 2026

Continue the series

← Part 01

Free Local AI Coding Assistant in VS Code (2026 Setup)

Get the next part the day it lands

One email per new part. No digest spam.