Intelligence DispatchesJune 6, 202611 min read

Best Local LLM to Run on Your Own Machine in 2026 (by RAM: 8GB / 16GB / 32GB)

Pick the right open model for your RAM. Verified params, quant levels, and VRAM for Qwen3, Gemma 3, Llama, and DeepSeek distills across 8GB, 16GB, and 32GB machines — plus the runner to use.

FrankX

AI Architect & Creator

Former Oracle AI architect · helped build Oracle's AI CoE

Share Share

Reading Goal

Match your machine's RAM to a specific open model you can run tonight, and the runner to load it with.

The short answer: match the model to your RAM at Q4_K_M quantization. On 8GB, run Qwen3 8B or Gemma 3 4B — fast, capable, leaves headroom for your OS. On 16GB, run Qwen3 14B or Gemma 3 12B — the sweet spot for daily work. On 32GB, run Qwen3 32B, Gemma 3 27B, or a DeepSeek-R1 distill 32B for reasoning. Load any of them with Ollama (developers), LM Studio (tinkerers), or Jan (newcomers). All three wrap the same llama.cpp engine, so speed is near-identical — pick on workflow, not benchmarks.

I run local models daily on a 32GB Intel rig — no cloud, no API bill, no data leaving the box. Local inference crossed the line from "fun demo" to "actually useful" sometime in 2025, and in 2026 a $0 open model running on a laptop you already own handles most of what people still pay per-token for. The catch is matching the model to your memory. Pick too big and it swaps to disk and crawls. Pick too small and you leave quality on the table. This is the lookup table for getting it right.

What does RAM actually decide?

RAM (or VRAM, if you have a discrete GPU) is the hard ceiling on what you can load. A model's file has to fit in memory alongside your operating system and the context window. The rule of thumb that holds across every model: at Q4 quantization, budget roughly 0.5–1 GB of memory per billion parameters, then leave 2–4 GB of headroom for the OS and the KV cache that grows with your prompt length.

Quantization is the lever that makes this work. A 14B model at full BF16 precision needs ~28 GB. The same model at Q4_K_M — 4-bit with K-quant mixing — drops to ~7–8 GB while losing under 1% accuracy on most benchmarks. Q4_K_M is the de-facto standard for local inference for exactly this reason: roughly 75% memory savings, minimal quality cost. If you have spare memory, step up to Q5_K_M or Q6. If you're tight, Q4_K_M is the floor I'd accept.

Which local LLM should I run for my RAM?

This is the citable unit. Every model below is Apache-licensed or open-weight, available today, and verified at Q4_K_M unless noted.

Your RAM	Top pick	Params / quant	Approx. memory	Also good	Best for
8 GB	Qwen3 8B	8B / Q4_K_M	~5 GB	Gemma 3 4B (~2.6 GB), Llama 3.2 3B	Chat, coding help, summarizing on a thin laptop
16 GB	Qwen3 14B	14B / Q4_K_M	~8.5 GB	Gemma 3 12B (~6.7 GB), DeepSeek-R1 distill 14B (~6.5 GB)	The daily-driver sweet spot — real work, comfortable headroom
32 GB	Qwen3 32B	32B / Q4_K_M	~19 GB	Gemma 3 27B (~15.1 GB), DeepSeek-R1 distill 32B (~18 GB)	Heavier reasoning, longer context, near-frontier quality offline

Notes that matter:

Qwen3 (released April 2025, Apache 2.0) is the strongest all-rounder across every tier. The dense lineup runs 0.6B / 1.7B / 4B / 8B / 14B / 32B, and Qwen3's own benchmarks put each size roughly on par with the next size up from the previous generation.
Gemma 3 (Google) comes in 1B / 4B / 12B / 27B. The 4B, 12B, and 27B are multimodal with a 128K context window. Google's Quantization-Aware Training (QAT) builds preserve near-BF16 quality at ~3x lower memory — worth grabbing the QAT GGUF when offered.
DeepSeek-R1 distills are reasoning-tuned models built on Qwen backbones. Reach for these when you want chain-of-thought on math, logic, or code — not for fast casual chat, where they over-think.
Llama 4 Scout (109B MoE, 17B active) gets cited as a "fits in 10GB VRAM" headline, but in practice its INT4 weights are ~55 GB — that's a workstation-GPU model, not a 16GB-laptop model. Don't let the active-parameter number fool you; you still load the full weights.

What's the best local LLM for 16GB RAM specifically?

Qwen3 14B at Q4_K_M. It lands around 8.5 GB, leaving you ~7 GB for the OS, your editor, and a generous context window. It's the closest thing to a free, private, always-available assistant that a mainstream laptop can run without compromise.

If 14B feels heavy on your machine — older CPU, integrated graphics, lots of other apps open — drop to Gemma 3 12B (~6.7 GB) for more breathing room, or run Qwen3 8B and spend the saved memory on a longer context window. For reasoning-heavy tasks on a 16GB box, the DeepSeek-R1 distill 14B (~6.5 GB) is the pick. 16GB is genuinely the comfort zone of local AI in 2026: enough for a 14B-class model that handles coding, drafting, and analysis without you babysitting memory.

What runs an 8GB machine without choking?

8GB is the floor, and it works — you just stay in the 3B–8B range. Qwen3 8B at Q4_K_M (~5 GB) is the headline pick and leaves enough for a browser. If you also run heavy apps, step down to Gemma 3 4B (~2.6 GB) or Llama 3.2 3B, both of which are quick even on CPU-only inference.

The honest trade-off at 8GB: shorter context windows and the occasional "I don't know" where a 14B would've answered. But for summarizing a document, drafting an email, or rubber-ducking a bug, an 8GB machine running Qwen3 8B is more than enough — and it's running entirely on hardware you already paid for.

What can a 32GB machine actually run?

32GB is where local inference stops feeling like a compromise. You can run Qwen3 32B (~19 GB), Gemma 3 27B (~15.1 GB), or a DeepSeek-R1 distill 32B (~18 GB) at Q4_K_M and still have room for the OS plus a long context window. These are near-frontier on a lot of everyday tasks — and they're running on my desk with no network connection.

The move on 32GB is to keep two models pulled: a fast small one (Qwen3 8B) for quick turns, and a 32B for the questions that deserve it. Swapping between them in Ollama is one command. For how this slots into a full offline assistant stack, see Build Your Own Jarvis with Claude Code — the same routing logic applies whether the brain is local or cloud.

Ollama, LM Studio, or Jan — which runner?

All three use llama.cpp under the hood, so raw tokens-per-second is within a few percent on identical hardware. You choose on workflow, not speed.

Ollama — CLI-first, leanest memory footprint (small background service, no Chromium window), OpenAI-compatible API. Every AI framework and IDE integrates with it. Pick this if you're a developer who wants the model to behave like any other service you script against.
LM Studio — polished GUI with the best model-discovery experience. Adds ~300–500 MB for the Electron shell. Pick this if you're a tinkerer comparing quantizations before committing, and you don't mind a closed-source app.
Jan — open-source, privacy-first, with native MCP (Model Context Protocol) support so local models can call tools — which neither Ollama nor LM Studio do natively. Pick this if you're new, want a clean chat UI, or care about license purity and a codebase you can inspect.

My default is Ollama for scripting plus LM Studio when I'm shopping for a new quant. Jan is the one I point non-developers to.

Is running local actually worth it versus the cloud?

Two reasons it's worth it, and one honest caveat.

Privacy. Nothing leaves your machine. No prompt, no document, no codebase gets logged on someone else's server or used to train the next model. For anything sensitive — client work, personal data, unreleased writing — that alone settles it.

Cost. A frontier API call costs per-token forever. A local model costs the electricity to run a laptop you already own. If you're a heavy user, the math flips fast — and there's no metered anxiety about how many times you hit the model. For the full picture of where local fits alongside the paid frontier models, see the frontier model landscape for 2026 and the best AI superpowers stack.

The caveat: a 32B local model is not GPT-class or Claude-class on the hardest reasoning. For frontier-tier work — deep agentic coding, long-horizon planning — the cloud still wins. The right architecture in 2026 is both: local for the 80% that's private, cheap, and offline-capable; cloud for the 20% that genuinely needs the biggest brain. I build the creator-side of that split into GenCreator.

Do I need a fancy GPU, or will my CPU do?

You don't strictly need a GPU. llama.cpp runs on CPU, and on Apple Silicon the unified memory architecture makes it genuinely fast. On a typical Intel or AMD laptop, CPU-only inference of an 8B model is usable — think reading-speed, not instant.

A GPU changes the experience, not the possibility. VRAM is the constraint that matters: a discrete card with 8–16 GB of VRAM lets the model live entirely on the GPU and run several times faster than CPU. If you're buying for local AI, more VRAM beats a faster core — a card with 16 GB of VRAM running a model fully in-GPU will outrun a faster card with 8 GB that has to spill to system RAM. And on the system side, 32 GB of RAM is the upgrade that unlocks the 32B tier. I name specific hardware honestly below.

FAQ

What is the best local LLM for 16GB RAM in 2026? Qwen3 14B at Q4_K_M quantization. It uses about 8.5 GB, leaving comfortable headroom for your OS and a long context window. For reasoning-heavy tasks, swap in the DeepSeek-R1 distill 14B (~6.5 GB); for more breathing room, Gemma 3 12B (~6.7 GB).

What quantization should I use? Q4_K_M for almost everyone — it's 4-bit with K-quant mixing, cuts memory ~75%, and loses under 1% accuracy on most benchmarks. If you have spare memory, step up to Q5_K_M or Q6 for a small quality bump. Q4_K_M is the floor I'd accept for serious use.

Can I run a local LLM without a GPU? Yes. llama.cpp (which Ollama, LM Studio, and Jan all use) runs on CPU, and it's genuinely fast on Apple Silicon's unified memory. On a standard Intel or AMD laptop, an 8B model runs at roughly reading speed CPU-only. A GPU with enough VRAM makes it several times faster but isn't required to start.

Is Ollama or LM Studio better? Neither is faster — both wrap llama.cpp. Ollama is CLI-first with the leanest footprint and an OpenAI-compatible API, ideal for developers and scripting. LM Studio is a polished GUI with the best model discovery, ideal for tinkerers comparing quants. Jan is the open-source, MCP-capable option for newcomers.

Will a local model match GPT or Claude? Not on the hardest reasoning. A 32B local model at Q4 is excellent for everyday chat, drafting, summarizing, and a lot of coding — but frontier cloud models still lead on deep agentic and long-horizon tasks. The smart setup runs both: local for private, cheap, offline work; cloud for the questions that need the biggest brain.

Do I need to buy more RAM or a GPU? Only if you want a bigger model than your current memory allows. 8 GB runs an 8B model fine. 16 GB unlocks the 14B sweet spot. 32 GB of system RAM unlocks the 32B tier. If you buy a GPU, prioritize VRAM over raw speed — a 16 GB card that holds the whole model beats a faster 8 GB card that spills to system memory. (Hardware links below go through Amazon Associates; I only name parts I'd actually run, and I earn a small commission if you buy through them — no fake links, no padding.)

Match the model to the memory, load it with the runner that fits how you work, and you have a private, free assistant running tonight. The cloud is for the 20% that needs it. The other 80% can live on your own machine.

Get Started

Build your first AI system

Step-by-step guide to setting up ACOS, creating your first agent, and shipping real products with AI.

Start building

Templates & Blueprints

Production-ready architecture

Download AI architecture templates, multi-agent blueprints, and prompt engineering patterns.

Browse templates

Inner Circle

Join the builder community

Connect with creators and architects shipping AI products. Weekly office hours, shared resources, direct access.

Join the circle

Stay in the intelligence loop

Weekly field notes on AI systems, production patterns, and builder strategy.

Continue Reading

Intelligence Dispatches12 min read

Ollama vs LM Studio vs Jan 2026: The Best Way to Run AI Locally

A tested comparison of the three local-LLM runners in June 2026 — Ollama, LM Studio, and Jan — on ease of use, model library, GUI vs CLI, OpenAI-compatible API, hardware support, and privacy.

Read article

Intelligence Dispatches9 min read

The Best Open & Local LLMs in 2026: A Self-Host Field Guide

Which open-weight model for which hardware — Gemma 4, gpt-oss, Phi-4, Mistral Large 3, Llama 4, DeepSeek V4, and Kimi K2.6 compared by VRAM, license, and use case. When self-hosting beats an API, with verified benchmarks.

Read article

AI Architecture5 min read

Building Privacy-First AI Transparency: The Agent Feed Architecture

How we built a curated AI agent commentary system without logging sessions. The journey from raw surveillance to smart curation.

Read article

Intelligence DispatchesJune 6, 202611 min read

Best Local LLM to Run on Your Own Machine in 2026 (by RAM: 8GB / 16GB / 32GB)

Pick the right open model for your RAM. Verified params, quant levels, and VRAM for Qwen3, Gemma 3, Llama, and DeepSeek distills across 8GB, 16GB, and 32GB machines — plus the runner to use.

FrankX

AI Architect & Creator

Former Oracle AI architect · helped build Oracle's AI CoE

Share Share

Reading Goal

Match your machine's RAM to a specific open model you can run tonight, and the runner to load it with.

What does RAM actually decide?

Which local LLM should I run for my RAM?

This is the citable unit. Every model below is Apache-licensed or open-weight, available today, and verified at Q4_K_M unless noted.

Your RAM	Top pick	Params / quant	Approx. memory	Also good	Best for
8 GB	Qwen3 8B	8B / Q4_K_M	~5 GB	Gemma 3 4B (~2.6 GB), Llama 3.2 3B	Chat, coding help, summarizing on a thin laptop
16 GB	Qwen3 14B	14B / Q4_K_M	~8.5 GB	Gemma 3 12B (~6.7 GB), DeepSeek-R1 distill 14B (~6.5 GB)	The daily-driver sweet spot — real work, comfortable headroom
32 GB	Qwen3 32B	32B / Q4_K_M	~19 GB	Gemma 3 27B (~15.1 GB), DeepSeek-R1 distill 32B (~18 GB)	Heavier reasoning, longer context, near-frontier quality offline

Notes that matter:

Qwen3 (released April 2025, Apache 2.0) is the strongest all-rounder across every tier. The dense lineup runs 0.6B / 1.7B / 4B / 8B / 14B / 32B, and Qwen3's own benchmarks put each size roughly on par with the next size up from the previous generation.
Gemma 3 (Google) comes in 1B / 4B / 12B / 27B. The 4B, 12B, and 27B are multimodal with a 128K context window. Google's Quantization-Aware Training (QAT) builds preserve near-BF16 quality at ~3x lower memory — worth grabbing the QAT GGUF when offered.
DeepSeek-R1 distills are reasoning-tuned models built on Qwen backbones. Reach for these when you want chain-of-thought on math, logic, or code — not for fast casual chat, where they over-think.
Llama 4 Scout (109B MoE, 17B active) gets cited as a "fits in 10GB VRAM" headline, but in practice its INT4 weights are ~55 GB — that's a workstation-GPU model, not a 16GB-laptop model. Don't let the active-parameter number fool you; you still load the full weights.

What's the best local LLM for 16GB RAM specifically?

What runs an 8GB machine without choking?

What can a 32GB machine actually run?

Ollama, LM Studio, or Jan — which runner?

All three use llama.cpp under the hood, so raw tokens-per-second is within a few percent on identical hardware. You choose on workflow, not speed.

Ollama — CLI-first, leanest memory footprint (small background service, no Chromium window), OpenAI-compatible API. Every AI framework and IDE integrates with it. Pick this if you're a developer who wants the model to behave like any other service you script against.
LM Studio — polished GUI with the best model-discovery experience. Adds ~300–500 MB for the Electron shell. Pick this if you're a tinkerer comparing quantizations before committing, and you don't mind a closed-source app.
Jan — open-source, privacy-first, with native MCP (Model Context Protocol) support so local models can call tools — which neither Ollama nor LM Studio do natively. Pick this if you're new, want a clean chat UI, or care about license purity and a codebase you can inspect.

My default is Ollama for scripting plus LM Studio when I'm shopping for a new quant. Jan is the one I point non-developers to.

Is running local actually worth it versus the cloud?

Two reasons it's worth it, and one honest caveat.

Do I need a fancy GPU, or will my CPU do?