Which open-weight model for which hardware — Gemma 4, gpt-oss, Phi-4, Mistral Large 3, Llama 4, DeepSeek V4, and Kimi K2.6 compared by VRAM, license, and use case. When self-hosting beats an API, with verified benchmarks.
TL;DR: In June 2026 you can run genuinely capable models on your own hardware — from a 2GB edge model to a frontier-class 1.6-trillion-parameter MoE. The honest picture: Gemma 4 (Apache 2.0) is the best quality-per-consumer-GPU pick, gpt-oss wins reasoning-per-VRAM, Phi-4 is the laptop/edge specialist, Mistral Large 3 is the EU-sovereign open frontier, Llama 4 brings multimodality and a 10M-context sibling, and DeepSeek V4 and Kimi K2.6 are the open-weight models that get closest to the closed frontier. None of them dethrone Claude Opus 4.8 or GPT-5.5 — the reason to self-host is control, privacy, license terms, and cost at volume, not the top of the leaderboard. Here's which one to run, and on what.
The closed frontier — Claude Opus 4.8, GPT-5.5, Gemini 3.5 — still leads on raw capability. So the case for open weights isn't "it's better." It's four concrete things:
If none of those apply to your workload, route to a frontier API and move on. If one or more do, read on.
| Model | Org | License | Params | Min VRAM (realistic) | Context | Best for |
|---|---|---|---|---|---|---|
| Phi-4 | Microsoft | MIT | 3.8B–15B | ~3-4GB (mini) → ~8-10GB (14B Q4) | 16K–128K | Laptops, edge, STEM/extraction |
| Gemma 4 | Apache 2.0 | E2B → 31B | ~2GB → ~18GB Q4 (31B) | 256K | Best quality on one consumer GPU | |
| gpt-oss | OpenAI | Apache 2.0 | 20B / 120B (MoE) | ~16GB (20b) / ~80GB (120b) | 131K | Reasoning-per-VRAM on one card |
| Mistral Large 3 | Mistral AI | Apache 2.0 | 675B / 41B (MoE) | 8×H200 (FP8) | 256K | EU sovereignty, multilingual |
| Llama 4 Maverick | Meta | Llama Community | 400B / 17B (MoE) | 8×H100 (Scout: 1×H100) | 1M (10M Scout) | Multimodal, permissive ecosystem |
| DeepSeek V4 | DeepSeek | MIT | 1.6T / 49B (Pro) | Data-center (Flash 284B self-hostable) | 1M | Max open coding capability |
| Kimi K2.6 | Moonshot | Modified MIT | 1T / 32B (MoE) | Data-center | 256K | Top open-weights intelligence |
Numbers are cross-checked against vendor model cards and the neutral Artificial Analysis Intelligence Index; "Min VRAM" reflects realistic quantized self-host, not the theoretical floor. MoE "active params" is misleading for memory — you still need every expert resident, so size for total parameters.
Run Phi-4 (the 3.8B mini fits in ~3-4GB; the 14B at Q4 needs ~8-10GB), gpt-oss-20b (~16GB, and it reasons well above its size), or Gemma 4's 12B / E4B tiers. Phi-4 is the STEM and function-calling specialist; Gemma 4 12B adds native audio and vision; gpt-oss-20b is the strongest pure reasoner of the three. See the Phi-4 deep dive and the Gemma 4 deep dive.
Gemma 4 31B at Q4 (~18GB) is the sweet spot: frontier-tier quality, native multimodal, Apache 2.0, on a card you can buy. This is the single best "serious work on my own machine" option in 2026.
gpt-oss-120b is purpose-built for this: MXFP4 quantization puts a near-o4-mini-class reasoner on a single 80GB card. DeepSeek V4-Flash (284B/13B) is the other strong single-node-ish option. Details in the gpt-oss deep dive and the DeepSeek V4 deep dive.
This unlocks the big MoEs: Mistral Large 3 (Apache 2.0, FP8 on one 8×H200 node), Llama 4 Maverick, DeepSeek V4-Pro (1.6T/49B), and Kimi K2.6 (1T/32B). These are the open models that get closest to the closed frontier — at the cost of a real infrastructure project.
A quiet but important 2026 story: Gemma 4 moved to Apache 2.0, dropping the custom "Gemma Terms" that governed Gemma 3. That puts Google's open family on the same permissive footing as gpt-oss and Mistral — unrestricted commercial use, no fine-tune disclosure, no redistribution friction.
The hierarchy, from most to least permissive for a commercial product:
If you're embedding a model in a product you intend to scale, the license is as load-bearing as the benchmark.
Open weights are free; inference is not. A single 8×H100 node is a six-figure capital expense or roughly $20–40/hour rented. The math that decides it:
The trap is treating "$0 weights" as "$0 to run." For most teams, the right pattern is hybrid: prototype and burst on a hosted open endpoint, and only stand up your own GPUs once volume and/or compliance justify it.
Be honest about the gap. On the neutral Artificial Analysis Intelligence Index, Claude Opus 4.8 sits around 61 and GPT-5.5 around 59–60. The best open weights — Kimi K2.6 (54) and Qwen3.7-Max (56.6, though closed) — are a real step behind, and the locally-runnable models (Gemma 4, gpt-oss, Phi-4) trade more capability for the privilege of fitting on your hardware.
That's the correct trade, not a disappointment. You self-host to own the model, run it offline, keep data in-house, and avoid a per-token meter — and in 2026 you can do all of that with quality that would have been frontier-class barely a year ago. For the full proprietary comparison, see the Frontier AI Models Intelligence Hub.
For raw capability, DeepSeek V4 and Kimi K2.6 lead the open-weight field (Kimi tops the neutral Artificial Analysis Intelligence Index among open models at 54). For something you can actually run on your own hardware, Gemma 4 (best on one consumer GPU) and gpt-oss (best reasoning-per-VRAM) are the picks. There's no single winner — it depends on your hardware and license needs.
Phi-4 (3.8B mini in ~3-4GB, 14B in ~8-10GB at Q4), gpt-oss-20b (~16GB), or Gemma 4's 12B / E4B tiers. Phi-4 is the STEM/extraction specialist; gpt-oss-20b is the strongest reasoner; Gemma 4 12B adds native audio and vision.
No. Mixture-of-Experts models like Llama 4 Maverick (400B total / 17B active) or DeepSeek V4 still need every expert resident in memory, so you size for total parameters, not active. "17B active" describes compute per token, not memory footprint.
The weights are free; the inference is not. You pay for GPUs (capex or rental). Self-hosting only beats a hosted API at high steady volume or under a data-residency requirement. For low or spiky usage, a hosted open-model endpoint (~$0.04–$1.50 per 1M tokens) is usually cheaper.
The MIT-licensed (Phi-4, DeepSeek V4) and Apache 2.0 (Gemma 4, gpt-oss, Mistral Large 3) models — all allow unrestricted commercial use, fine-tuning, and redistribution. Gemma 4 notably moved to Apache 2.0 from the custom Gemma Terms. Llama 4's Community License is free only under 700M monthly active users.
Not on raw capability — the best open weights trail the closed frontier by a meaningful margin on neutral benchmarks. They match or beat the frontier on the axes that make people self-host: data control, license freedom, offline operation, and cost at scale.
Analysis by Frank — AI Architect, builder of the Agentic Creator Operating System. Compiled June 5, 2026 from vendor model cards and independent sources (Artificial Analysis, llm-stats, LMArena, NIST/CAISI). Verified vs vendor-claimed figures are distinguished in each model's linked deep dive.
Step-by-step guide to setting up ACOS, creating your first agent, and shipping real products with AI.
Start buildingDownload AI architecture templates, multi-agent blueprints, and prompt engineering patterns.
Browse templatesConnect with creators and architects shipping AI products. Weekly office hours, shared resources, direct access.
Join the circleRead on FrankX.AI — AI Architecture, Music & Creator Intelligence
Weekly field notes on AI systems, production patterns, and builder strategy.
Anthropic's Opus 4.8 lands 41 days after 4.7 with the same $5/$25 pricing, SWE-Bench Pro 69.2%, GDPval-AA 1890, dynamic workflows, and cheaper fast mode. Technical breakdown with verified benchmarks, what changed, and what it means for builders.
Read articleDeepSeek shipped V4-Pro (1.6T/49B active) and V4-Flash (284B/13B active) on April 24, 2026 under MIT license, open weights, 1M context. SWE-bench Verified 80.6%, AA Intelligence Index 52, V4-Pro API at $1.74/$3.48 per 1M. Technical breakdown with verified benchmarks, what changed vs V3.2, and the self-host vs API math.
Read articleGemini 3.5 Pro is still in limited Vertex preview as of June 2026 — no model card, no benchmarks, no pricing. Here's the verifiable picture: what Flash already proved, what Google has committed to, and what to wait for at GA.
Read article