Local AI, Zero Cost — Part 2. Part 1 matched your laptop’s RAM to the right open-source model. This part turns that setup into a working local AI coding assistant in VS Code.
GitHub Copilot costs money every month. Cursor costs more. Meanwhile the models that power a very usable coding assistant have become free, small enough for your laptop, and one config file away from living inside VS Code. The stack is three free pieces: VS Code, the Continue extension, and Ollama serving a Qwen coder model. No account, no trial, no token meter — and your code never leaves your machine.
This guide sets the whole thing up in about fifteen minutes: chat with your codebase, inline edits, and tab autocomplete, all running locally.
- Ollama installed with at least one model pulled — Part 1 covers this in ten minutes
- VS Code installed, and comfort pasting a command into a terminal
- 8 GB+ RAM (16 GB gives you the noticeably smarter 7B chat model)
- One assistant needs two models. A small, fast model (Qwen2.5-Coder 1.5B) handles keystroke-speed autocomplete; a larger one (Qwen2.5-Coder 7B or Qwen3-Coder 30B) handles chat and edits. Continue wires each to a role.
- The whole stack is free and private. Continue talks to Ollama at
localhost:11434, so chat, edits and autocomplete keep working offline and nothing is ever uploaded. - Setup is one YAML file. Pull two models, install the Continue extension, paste a ten-line
config.yaml— done.
Why go local instead of paying for Copilot#
The case mirrors the one from Part 1, sharpened for code:
- Your code stays yours. Client work under NDA, proprietary repos, unreleased features — none of it leaves the laptop. There is no cloud inference endpoint to trust, audit or get breached.
- Zero recurring cost. Copilot-class autocomplete and chat without a subscription. For students and self-learners, that removes the last excuse.
- It works on flights and bad Wi-Fi. Autocomplete keeps firing with the network cable pulled.
- Unlimited usage. No monthly premium-request caps, no throttling during a long refactoring session.
And the honest other side: cloud assistants still win on frontier-model reasoning and big agentic multi-file changes. If that’s your daily need, our Cursor vs Claude Code comparison covers the paid landscape. Many developers sensibly run both — local for the 80% of daily completions and questions, cloud for the hardest 20%.
Bottom line: a local AI coding assistant covers everyday completions, explanations and edits for exactly ₹0 — and it’s the only option that’s fully private and offline.
How a local AI coding assistant fits together#
Three pieces, one local port:
VS Code is the editor you already use. Continue is a free, open-source extension that adds the assistant UI — a chat sidebar, an inline-edit command, and tab autocomplete. Ollama (from Part 1) serves models at http://localhost:11434. Continue sends every request there, tagged with a role: chat, edit, apply or autocomplete.
The key design decision is two models, split by job:
- Autocomplete fires on almost every keystroke, so it must respond in milliseconds. A 1.5B model trained for FIM completion is ideal — big models here feel laggy, not smart.
- Chat and edits are manual, so a few seconds of thinking is fine. Use the largest coder model your RAM allows.
Bottom line: Continue is the cockpit, Ollama is the engine room, and the two-model split is what makes local autocomplete feel instant.
Pick your two models by RAM#
Qwen’s coder family dominates this niche in 2026 — qwen2.5-coder remains the most-pulled dedicated code model on Ollama, and qwen3-coder brings a 30B MoE that activates only ~3B parameters per token:
| Your laptop | Chat + edit model | Autocomplete model |
|---|---|---|
| 8 GB RAM | qwen2.5-coder:3b | qwen2.5-coder:0.5b |
| 16 GB RAM | qwen2.5-coder:7b | qwen2.5-coder:1.5b |
| 32 GB+ RAM | qwen3-coder:30b | qwen2.5-coder:1.5b |
Both models sit loaded side by side, so budget memory for the pair — on 16 GB that’s roughly 5 GB + 1 GB, which leaves comfortable headroom.
Pull models with their exact size tag (qwen2.5-coder:7b, not qwen2.5-coder). A bare name pulls :latest, which may be a different size than your config expects — the classic cause of Continue’s “404 model not found” error.
Bottom line: at 16 GB, qwen2.5-coder:7b for chat plus qwen2.5-coder:1.5b for autocomplete is the proven pairing.
Set it up in fifteen minutes#
Step 1 — pull the two models (Ollama installed in Part 1):
ollama pull qwen2.5-coder:7b
ollama pull qwen2.5-coder:1.5bStep 2 — install Continue. In VS Code, open Extensions, search “Continue”, install the one by Continue Dev. A sidebar icon appears.
Step 3 — point Continue at Ollama. Open Continue’s config (gear icon → Open config file) and replace it with:
name: Local Assistant
version: 0.0.1
schema: v1
models:
- name: Qwen2.5-Coder 7B
provider: ollama
model: qwen2.5-coder:7b
roles:
- chat
- edit
- apply
- name: Qwen2.5-Coder 1.5B
provider: ollama
model: qwen2.5-coder:1.5b
roles:
- autocompleteSwap the model tags for your RAM tier from the table above. That’s the entire configuration.
Step 4 — test all three features:
- Chat: open the Continue sidebar, ask “explain this file” with a file open.
- Inline edit: select a function, press
Cmd+I(Mac) orCtrl+I(Windows/Linux), type “add error handling”. - Autocomplete: start typing a function body and watch gray ghost text appear; accept with
Tab.
Continue’s Agent mode needs a model with tool-calling support. Small coder models often advertise it but fail in practice — if you see “Agent mode is not supported”, stick to Chat and Edit modes locally, or add capabilities: [tool_use] and test. Per Continue’s own docs, advertised and actual tool support don’t always match.
Bottom line: two pulls, one extension, ten lines of YAML — then chat, edit and autocomplete are live.
A realistic daily workflow#
Where the local setup shines day to day:
- “What does this do?” — select code, ask in chat. The 7B coder explains unfamiliar code well, and it’s reading the actual file, not a paste.
- Small refactors —
Cmd+I→ “convert to async”, “add type hints”, “extract this into a helper”. Review the diff, accept or reject. - Boilerplate autocomplete — tests, argument parsing, dataclasses, API handlers. This is where the 1.5B FIM model quietly earns its keep, suggesting the next two or three lines faster than you’d type them.
- Learning — because it’s free and unlimited, you can ask every “why” question you’d hesitate to burn paid tokens on. Pair it with the agent-building series and the assistant explains the code you’re writing as you write it.
Where to stay realistic: long multi-file features, subtle architectural decisions, and gnarly debugging across a big codebase are still better served by frontier models. Know which tool you’re holding.
When something breaks#
Four errors cover nearly every first-day problem:
- “404 model not found, try pulling it first” — your config names a tag you haven’t pulled. Run
ollama list, then pull the exact tag from the error. A bareollama pull qwen2.5-codergrabs:latest, which is not the same model as:7b. - “Model requires more system memory” — Continue defaults to a larger context window than some tools. Lower
contextLength(try 2048) underdefaultCompletionOptionsin that model’s config block, or step down one model size. - Autocomplete feels laggy — the autocomplete role is running on a model that’s too big. Keep it on the 1.5B; on every keystroke, speed beats brains.
- Nothing responds at all — check the engine:
curl http://localhost:11434should answer “Ollama is running”. If not, start Ollama and try again.
Bottom line: almost every failure is a tag mismatch, a memory ceiling, or Ollama not running — each fixable in under a minute.
Quick recap#
| Decision | Answer |
|---|---|
| The stack | VS Code + Continue extension + Ollama |
| Chat/edit model (16 GB) | qwen2.5-coder:7b |
| Autocomplete model | qwen2.5-coder:1.5b (FIM-trained, keystroke-fast) |
| 32 GB upgrade | qwen3-coder:30b for chat, same 1.5B for autocomplete |
| Configuration | ~10 lines of config.yaml, roles per model |
| Cost | ₹0 — no subscription, no tokens, no caps |
| Privacy | Everything stays on localhost:11434 |
A year ago “free coding assistant” meant a crippled trial. Now it means a private, offline stack you control end to end — built from the same pieces you set up in Part 1, plus one extension and ten lines of YAML.
- Start here: pull the two Qwen coder models and paste the config — you’ll have working autocomplete before lunch.
- Level up: point the Pydantic AI tutorial at the same Ollama endpoint and let your new assistant help you write your first type-safe agent.
- Coming in this series: using your local model to chat with your own documents — fully private RAG on your laptop.

