Google DeepMindGA
Gemma 4
Google’s open-weight flagship: a 31B frontier-tier model on one GPU, now Apache 2.0.
Read the full Gemma 4 analysisContext
256K
Max output
—
Input /1M
Open
Output /1M
Open
Best for
- Local-first / privacy-sensitive products (on-prem, healthcare, legal, finance)
- Cost-conscious agentic systems (26B A4B MoE + vLLM)
- Commercial fine-tuning under a clean license
Watch out
Academic benchmark jumps (AIME 89.2%, GPQA 84.3%) are largely Google’s own evals — vendor-claimed until reproduced. Qwen3.7-Max narrowly out-scores it on pure reasoning.
For creators. Run the 12B locally for offline multimodal work (image/audio on a 16GB laptop); 31B for on-prem RAG and coding where data can’t leave your box.
Benchmarks
| lmarena elo | 1452 |
| mmlu pro | 85.2 |
| gpqa diamond | 84.3 |
| livecodebench | 80 |
| aime 2026 | 89.2 |
Capabilities
- Apache 2.0 license (replaces the custom Gemma Terms used through Gemma 3)
- 31B dense flagship runs in ~18GB VRAM at Q4 on a single 24GB consumer GPU
- 26B A4B — Gemma's first MoE (~3.8B active, ~15.6GB int4) for agentic throughput
- Encoder-free 12B multimodal (native audio) runs on a 16GB laptop
- E2B (~2GB) to E4B (~8GB) tiers for edge / on-device
- Up to 256K context; runs via Ollama / llama.cpp / LM Studio / vLLM / HF