Google DeepMindGA

Gemma 4

Name: Gemma 4
Author: Google DeepMind

Google’s open-weight flagship: a 31B frontier-tier model on one GPU, now Apache 2.0.

Read the full Gemma 4 analysis

Context

256K

Max output

—

Input /1M

Open

Output /1M

Open

Best for

Local-first / privacy-sensitive products (on-prem, healthcare, legal, finance)
Cost-conscious agentic systems (26B A4B MoE + vLLM)
Commercial fine-tuning under a clean license

Watch out

Academic benchmark jumps (AIME 89.2%, GPQA 84.3%) are largely Google’s own evals — vendor-claimed until reproduced. Qwen3.7-Max narrowly out-scores it on pure reasoning.

For creators. Run the 12B locally for offline multimodal work (image/audio on a 16GB laptop); 31B for on-prem RAG and coding where data can’t leave your box.

Benchmarks

lmarena elo	1452
mmlu pro	85.2
gpqa diamond	84.3
livecodebench	80
aime 2026	89.2

Capabilities

Apache 2.0 license (replaces the custom Gemma Terms used through Gemma 3)
31B dense flagship runs in ~18GB VRAM at Q4 on a single 24GB consumer GPU
26B A4B — Gemma's first MoE (~3.8B active, ~15.6GB int4) for agentic throughput
Encoder-free 12B multimodal (native audio) runs on a 16GB laptop
E2B (~2GB) to E4B (~8GB) tiers for edge / on-device
Up to 256K context; runs via Ollama / llama.cpp / LM Studio / vLLM / HF

Compare Gemma 4

gpt-oss vs Gemma 4

Gemma 4 wins for multimodal work on a single consumer GPU; gpt-oss wins on reasoning-per-gigabyte and scales to 120b on one 80GB card. Both are Apache 2.0.

More from Google DeepMind

Gemini 3.5 Flash Gemini 3.5 Pro Gemini 3 Pro