Pick the right open model for your RAM. Verified params, quant levels, and VRAM for Qwen3, Gemma 3, Llama, and DeepSeek distills across 8GB, 16GB, and 32GB machines — plus the runner to use.

Match your machine's RAM to a specific open model you can run tonight, and the runner to load it with.
The short answer: match the model to your RAM at Q4_K_M quantization. On 8GB, run Qwen3 8B or Gemma 3 4B — fast, capable, leaves headroom for your OS. On 16GB, run Qwen3 14B or Gemma 3 12B — the sweet spot for daily work. On 32GB, run Qwen3 32B, Gemma 3 27B, or a DeepSeek-R1 distill 32B for reasoning. Load any of them with Ollama (developers), LM Studio (tinkerers), or Jan (newcomers). All three wrap the same llama.cpp engine, so speed is near-identical — pick on workflow, not benchmarks.
I run local models daily on a 32GB Intel rig — no cloud, no API bill, no data leaving the box. Local inference crossed the line from "fun demo" to "actually useful" sometime in 2025, and in 2026 a $0 open model running on a laptop you already own handles most of what people still pay per-token for. The catch is matching the model to your memory. Pick too big and it swaps to disk and crawls. Pick too small and you leave quality on the table. This is the lookup table for getting it right.
RAM (or VRAM, if you have a discrete GPU) is the hard ceiling on what you can load. A model's file has to fit in memory alongside your operating system and the context window. The rule of thumb that holds across every model: at Q4 quantization, budget roughly 0.5–1 GB of memory per billion parameters, then leave 2–4 GB of headroom for the OS and the KV cache that grows with your prompt length.
Quantization is the lever that makes this work. A 14B model at full BF16 precision needs ~28 GB. The same model at Q4_K_M — 4-bit with K-quant mixing — drops to ~7–8 GB while losing under 1% accuracy on most benchmarks. Q4_K_M is the de-facto standard for local inference for exactly this reason: roughly 75% memory savings, minimal quality cost. If you have spare memory, step up to Q5_K_M or Q6. If you're tight, Q4_K_M is the floor I'd accept.
This is the citable unit. Every model below is Apache-licensed or open-weight, available today, and verified at Q4_K_M unless noted.
| Your RAM | Top pick | Params / quant | Approx. memory | Also good | Best for |
|---|---|---|---|---|---|
| 8 GB | Qwen3 8B | 8B / Q4_K_M | ~5 GB | Gemma 3 4B (~2.6 GB), Llama 3.2 3B | Chat, coding help, summarizing on a thin laptop |
| 16 GB | Qwen3 14B | 14B / Q4_K_M | ~8.5 GB | Gemma 3 12B (~6.7 GB), DeepSeek-R1 distill 14B (~6.5 GB) | The daily-driver sweet spot — real work, comfortable headroom |
| 32 GB | Qwen3 32B | 32B / Q4_K_M | ~19 GB | Gemma 3 27B (~15.1 GB), DeepSeek-R1 distill 32B (~18 GB) | Heavier reasoning, longer context, near-frontier quality offline |
Notes that matter:
Qwen3 14B at Q4_K_M. It lands around 8.5 GB, leaving you ~7 GB for the OS, your editor, and a generous context window. It's the closest thing to a free, private, always-available assistant that a mainstream laptop can run without compromise.
If 14B feels heavy on your machine — older CPU, integrated graphics, lots of other apps open — drop to Gemma 3 12B (~6.7 GB) for more breathing room, or run Qwen3 8B and spend the saved memory on a longer context window. For reasoning-heavy tasks on a 16GB box, the DeepSeek-R1 distill 14B (~6.5 GB) is the pick. 16GB is genuinely the comfort zone of local AI in 2026: enough for a 14B-class model that handles coding, drafting, and analysis without you babysitting memory.
8GB is the floor, and it works — you just stay in the 3B–8B range. Qwen3 8B at Q4_K_M (~5 GB) is the headline pick and leaves enough for a browser. If you also run heavy apps, step down to Gemma 3 4B (~2.6 GB) or Llama 3.2 3B, both of which are quick even on CPU-only inference.
The honest trade-off at 8GB: shorter context windows and the occasional "I don't know" where a 14B would've answered. But for summarizing a document, drafting an email, or rubber-ducking a bug, an 8GB machine running Qwen3 8B is more than enough — and it's running entirely on hardware you already paid for.
32GB is where local inference stops feeling like a compromise. You can run Qwen3 32B (~19 GB), Gemma 3 27B (~15.1 GB), or a DeepSeek-R1 distill 32B (~18 GB) at Q4_K_M and still have room for the OS plus a long context window. These are near-frontier on a lot of everyday tasks — and they're running on my desk with no network connection.
The move on 32GB is to keep two models pulled: a fast small one (Qwen3 8B) for quick turns, and a 32B for the questions that deserve it. Swapping between them in Ollama is one command. For how this slots into a full offline assistant stack, see Build Your Own Jarvis with Claude Code — the same routing logic applies whether the brain is local or cloud.
All three use llama.cpp under the hood, so raw tokens-per-second is within a few percent on identical hardware. You choose on workflow, not speed.
My default is Ollama for scripting plus LM Studio when I'm shopping for a new quant. Jan is the one I point non-developers to.
Two reasons it's worth it, and one honest caveat.
Privacy. Nothing leaves your machine. No prompt, no document, no codebase gets logged on someone else's server or used to train the next model. For anything sensitive — client work, personal data, unreleased writing — that alone settles it.
Cost. A frontier API call costs per-token forever. A local model costs the electricity to run a laptop you already own. If you're a heavy user, the math flips fast — and there's no metered anxiety about how many times you hit the model. For the full picture of where local fits alongside the paid frontier models, see the frontier model landscape for 2026 and the best AI superpowers stack.
The caveat: a 32B local model is not GPT-class or Claude-class on the hardest reasoning. For frontier-tier work — deep agentic coding, long-horizon planning — the cloud still wins. The right architecture in 2026 is both: local for the 80% that's private, cheap, and offline-capable; cloud for the 20% that genuinely needs the biggest brain. I build the creator-side of that split into GenCreator.
You don't strictly need a GPU. llama.cpp runs on CPU, and on Apple Silicon the unified memory architecture makes it genuinely fast. On a typical Intel or AMD laptop, CPU-only inference of an 8B model is usable — think reading-speed, not instant.
A GPU changes the experience, not the possibility. VRAM is the constraint that matters: a discrete card with 8–16 GB of VRAM lets the model live entirely on the GPU and run several times faster than CPU. If you're buying for local AI, more VRAM beats a faster core — a card with 16 GB of VRAM running a model fully in-GPU will outrun a faster card with 8 GB that has to spill to system RAM. And on the system side, 32 GB of RAM is the upgrade that unlocks the 32B tier. I name specific hardware honestly below.
What is the best local LLM for 16GB RAM in 2026? Qwen3 14B at Q4_K_M quantization. It uses about 8.5 GB, leaving comfortable headroom for your OS and a long context window. For reasoning-heavy tasks, swap in the DeepSeek-R1 distill 14B (~6.5 GB); for more breathing room, Gemma 3 12B (~6.7 GB).
What quantization should I use? Q4_K_M for almost everyone — it's 4-bit with K-quant mixing, cuts memory ~75%, and loses under 1% accuracy on most benchmarks. If you have spare memory, step up to Q5_K_M or Q6 for a small quality bump. Q4_K_M is the floor I'd accept for serious use.
Can I run a local LLM without a GPU? Yes. llama.cpp (which Ollama, LM Studio, and Jan all use) runs on CPU, and it's genuinely fast on Apple Silicon's unified memory. On a standard Intel or AMD laptop, an 8B model runs at roughly reading speed CPU-only. A GPU with enough VRAM makes it several times faster but isn't required to start.
Is Ollama or LM Studio better? Neither is faster — both wrap llama.cpp. Ollama is CLI-first with the leanest footprint and an OpenAI-compatible API, ideal for developers and scripting. LM Studio is a polished GUI with the best model discovery, ideal for tinkerers comparing quants. Jan is the open-source, MCP-capable option for newcomers.
Will a local model match GPT or Claude? Not on the hardest reasoning. A 32B local model at Q4 is excellent for everyday chat, drafting, summarizing, and a lot of coding — but frontier cloud models still lead on deep agentic and long-horizon tasks. The smart setup runs both: local for private, cheap, offline work; cloud for the questions that need the biggest brain.
Do I need to buy more RAM or a GPU? Only if you want a bigger model than your current memory allows. 8 GB runs an 8B model fine. 16 GB unlocks the 14B sweet spot. 32 GB of system RAM unlocks the 32B tier. If you buy a GPU, prioritize VRAM over raw speed — a 16 GB card that holds the whole model beats a faster 8 GB card that spills to system memory. (Hardware links below go through Amazon Associates; I only name parts I'd actually run, and I earn a small commission if you buy through them — no fake links, no padding.)
Match the model to the memory, load it with the runner that fits how you work, and you have a private, free assistant running tonight. The cloud is for the 20% that needs it. The other 80% can live on your own machine.
Step-by-step guide to setting up ACOS, creating your first agent, and shipping real products with AI.
Start buildingDownload AI architecture templates, multi-agent blueprints, and prompt engineering patterns.
Browse templatesConnect with creators and architects shipping AI products. Weekly office hours, shared resources, direct access.
Join the circleRead on FrankX.AI — AI Architecture, Music & Creator Intelligence
Weekly field notes on AI systems, production patterns, and builder strategy.

A tested comparison of the three local-LLM runners in June 2026 — Ollama, LM Studio, and Jan — on ease of use, model library, GUI vs CLI, OpenAI-compatible API, hardware support, and privacy.
Read articleWhich open-weight model for which hardware — Gemma 4, gpt-oss, Phi-4, Mistral Large 3, Llama 4, DeepSeek V4, and Kimi K2.6 compared by VRAM, license, and use case. When self-hosting beats an API, with verified benchmarks.
Read article
How we built a curated AI agent commentary system without logging sessions. The journey from raw surveillance to smart curation.
Read article