Local AI, Zero Cost — Part 3. Part 1 put a free model on your laptop; Part 2 turned it into a coding assistant. Now the same stack learns to answer questions about your files — with local RAG, fully private and offline.
Your notes, contracts, research papers and meeting minutes hold answers no chatbot knows. The usual fix is uploading them to a cloud AI. That’s exactly what you can’t do with client contracts, medical records or anything under NDA. Local RAG solves this the other way around: instead of sending documents to the model, you bring the model to the documents. Everything — the files, the search index, the conversation — stays on your laptop.
This guide gives you both paths from the series so far. The no-code route uses Open WebUI. The developer route is a ~30-line Python pipeline that shows you the machinery.
- Ollama running with a chat model from Part 1
- New to embeddings? The embeddings and vector search primer explains the core idea in ten minutes
- For the Python path: basic Python and
pip— nothing else
- Local RAG = your documents + a local embedding model + a local LLM. The pipeline chunks your files, embeds them with
nomic-embed-text, stores them in a local vector database, and answers with the model you already run — no upload, ever. - Open WebUI gives you document chat with zero code. Create a knowledge base, drag in PDFs, and reference it in any chat.
- The #1 failure is Ollama’s 2048-token default context. Retrieved chunks don’t fit, answers go vague — raise it to 8192+ and most “RAG is broken” complaints disappear.
Why local RAG instead of uploading your files#
A local model knows nothing about your world — its training data ends months before you asked. RAG fixes that by retrieving the right passages from your files at question time. Doing it locally adds the properties that matter for private material:
- Confidential by construction. Contracts, patient notes, unpublished research — the entire pipeline runs on hardware you own. There is no upload step to audit because there is no upload step.
- Offline and permanent. Your document chat works on a plane, and no provider can retire the feature or change its pricing.
- Free at any scale. Embedding a thousand pages costs electricity, not tokens. Ask unlimited questions.
- You learn the real pipeline. Chunking, embeddings, retrieval — the same concepts behind every production RAG system, running where you can poke at them.
The honest trade-off is the same one from Parts 1 and 2: a laptop model summarizes and answers from retrieved context well, but a frontier cloud model reasons better over long, tangled documents. For private material, local is not the compromise — it’s the requirement.
Bottom line: local RAG is how you get “chat with your documents” for files you’d never be allowed to upload.
How local RAG works#
One diagram, two phases — index once, then ask forever:
The indexing phase splits each document into chunks and runs each through an embedding model. nomic-embed-text is the local standard, small enough to run instantly on CPU. The resulting vectors go into a local vector store. At question time, the pipeline embeds your query the same way. The store returns the most similar chunks, and your chat model answers using only that context.
Bottom line: two small models split the work — one turns text into searchable vectors, the other writes the answer.
The no-code path: Open WebUI knowledge bases#
Open WebUI is a free, self-hosted ChatGPT-style interface that sits on top of Ollama and has RAG built in:
- Install and start it:
pip install open-webui
open-webui serve
# open http://localhost:8080 in your browser- Pull the embedding model once:
ollama pull nomic-embed-text, then in Open WebUI’s admin settings set the embedding engine to Ollama withnomic-embed-text. - Create a knowledge base (Workspace → Knowledge), and drag in your files — PDF, TXT, Markdown and DOCX all work.
- Chat with it. In any conversation, type
#and select your knowledge base. Ask away; Open WebUI retrieves the relevant chunks and cites which file they came from.
Before judging the answers, fix the context window. Ollama defaults to 2048 tokens, too small to hold the retrieved chunks. That’s the classic cause of vague “I don’t see that in the document” answers. In Open WebUI: Admin Panel → Models → your model → Advanced Parameters → set context length to 8192 or more. Open WebUI’s own troubleshooting guide calls this the most common RAG failure.
Bottom line: pip install, pull one embedding model, drag in files — private document chat in a browser tab.
The developer path: a ~30-line Python pipeline#
The same pipeline in code, using Chroma as the local vector store:
pip install ollama chromadb
ollama pull nomic-embed-textimport ollama
import chromadb
# 1. Index once: chunk, embed, store
docs = [
"The notice period for termination is 60 days...",
"Payment terms: invoices are due within 30 days...",
# in real use: read your files and split into ~1,000-char chunks
]
client = chromadb.Client()
col = client.create_collection("my_docs")
for i, doc in enumerate(docs):
emb = ollama.embed(model="nomic-embed-text", input=doc)["embeddings"][0]
col.add(ids=[str(i)], embeddings=[emb], documents=[doc])
# 2. Every question: embed, retrieve, answer
q = "How much notice do I have to give?"
q_emb = ollama.embed(model="nomic-embed-text", input=q)["embeddings"][0]
hits = col.query(query_embeddings=[q_emb], n_results=3)
context = "\n\n".join(hits["documents"][0])
reply = ollama.chat(model="qwen3.5:9b", messages=[{
"role": "user",
"content": f"Answer using only this context:\n{context}\n\nQuestion: {q}",
}])
print(reply["message"]["content"])Swap qwen3.5:9b for your Part 1 model. That’s a complete, private RAG system. Every production concept has a seat in it: the chunking strategy, the embedding model, the retrieval count, the prompt that constrains the model to the context.
When answers disappoint, the fix is almost never a bigger model — it’s better retrieval. Our chunking and retrieval quality guide covers the levers. The from-scratch RAG series builds the whole pipeline without frameworks if you want full depth.
Bottom line: thirty lines of Python — embed, store, retrieve, answer — and you own every stage of the pipeline.
Quick recap#
| Decision | Answer |
|---|---|
| Embedding model | nomic-embed-text (tiny, CPU-fast, free) |
| No-code tool | Open WebUI knowledge bases (pip install open-webui) |
| Code path | ollama + chromadb, ~30 lines |
| Answering model | Whatever your RAM supports from Part 1 |
| #1 fix for bad answers | Raise Ollama context length from 2048 to 8192+ |
| #2 fix | Better chunking — not a bigger model |
| Privacy | Files, index and chats never leave your laptop |
Three parts in, the pattern holds: the models got small enough, the tools got simple enough, and the paid tier stopped being the price of entry. Your laptop now chats, codes and reads your documents — for nothing.
- Start here: install Open WebUI, drag three PDFs into a knowledge base, and ask a question you know the answer to — verify it cites the right file.
- Level up: run the Python pipeline on a real folder of notes, then experiment with chunk sizes and
n_results. - Go deeper: the from-scratch RAG series rebuilds this pipeline without frameworks, and What Is RAG? fills in the theory.

