Mac or NVIDIA for local LLMs?

Mac if you want big-model capacity at low power and one box (unified memory wins for MoE). NVIDIA if you want creator workloads (FLUX/Hunyuan/training), multi-user serving via vLLM, or maximum decode speed on 7-32B dense models. RTX 5090 prefill is 10× a Mac. Mac Studio M3 Ultra capacity is 8-16× a 5090.

How loud is an RTX 5090 in a quiet office?

~32.5 dBA at idle (inaudible), ~45-50 dBA under sustained load (audible but not bad), with possible coil whine that some workstations report as more annoying than the fans. Founders Edition is fine; cheaper AIB cards vary widely. Plan for it on the desk side of a wall, not on top of it.

Is DGX Spark worth $4,699 over a Mac Studio?

Only if you need CUDA-native development end-to-end with the same GPU architecture you'll deploy on. Spark's 273 GB/s bandwidth is half of M4 Max and a third of M3 Ultra — decode speed loses. The pitch is "develop locally, deploy to DGX cloud unchanged." For pure inference, Mac wins per dollar.

Which quant should I run?

Q6_K if you have the memory — it's effectively lossless (102% of baseline in some measurements) and the quality gap from Q4_K_M to Q6_K is larger than Q6_K to Q8_0. Q4_K_M is the right call when memory-constrained — 92-95% quality, half the size. On Apple Silicon use MLX 4-bit; the GGUF/MLX quant formats don't cross-pollinate.

Can I run this on a normal home circuit?

Single RTX 5090 + workstation = ~750W peak, fine on a dedicated 15A or 20A 120V circuit. Mac Studio M3 Ultra peaks ~480W, trivially fine. Dual 5090 (2,800W PSU) requires a 200-240V circuit — usually means an electrician. Used H100 SXM rigs need 3-phase power; don't put one at home unless you've already done the wiring.

Research Hub/Home AI Labs & Personal Compute

Home AI Labs & Personal Compute

Prosumer hardware and stacks for serious local AI in 2026

TL;DR

Mac Studio M3 Ultra and NVIDIA DGX Spark ($4,699) anchor sub-$5K personal compute. Strix Halo mini-PCs (Framework Desktop $1,999 / GMKtec EVO-X2 $1,499) run 70B at 65-120W. RTX 5090 (32GB, 1,792 GB/s) is the single-GPU creator king. MLX has overtaken llama.cpp on Apple Silicon for sub-27B decode.

Updated 2026-06-2226 sources validated

Research briefs like this — one per week. Validated sources, no filler.

17-18 tok/s

DeepSeek-R1 671B (Q4) on M3 Ultra Mac Studio under 200W

MacRumors / Apple

$4,699

NVIDIA DGX Spark Founders Edition (Feb 2026), 128GB unified, 1 PFLOP FP4

NVIDIA / Constellation Research

$1,499

GMKtec EVO-X2 128GB unified — cheapest 70B-capable mini-PC

Tom's Hardware

32 tok/s

Qwen3 235B on 4× M3 Ultra cluster via exo + RDMA over Thunderbolt 5

exo / Hardware Corner

Reference Builds by Budget Tier

Three things determine what runs: unified-memory ceiling (Apple/AMD) or VRAM ceiling (NVIDIA), memory bandwidth (decode speed), and wall-socket reality (a dual-5090 needs a 200-240V circuit). Below $5K, the smart money is on unified memory. Above $10K, NVIDIA's discrete-GPU bandwidth pulls ahead for prefill-heavy and creative workloads.

Under $5K

Entry

Framework Desktop 128GB ($1,999) or GMKtec EVO-X2 128GB ($1,499) for a 65-120W Strix Halo box running 70B Q4 at ~6-8 tok/s. NVIDIA DGX Spark ($4,699, 128GB unified, 273 GB/s, 300W) for a Blackwell-native dev box. Mac Studio M4 Max 128GB ($3,499) for the MLX path — Llama 4 Maverick unquantized at 8-12 tok/s.

$5K-$10K

Prosumer

Mac Studio M3 Ultra 256GB (current top SKU after Apple pulled 512GB in early 2026 amid the RAM price squeeze) — 819 GB/s bandwidth, runs Qwen3 235B at 28+ tok/s. Or single RTX 5090 workstation (32GB GDDR7, 1,792 GB/s) for creator workloads — FLUX.1 dev in 9s/image, Qwen3 30B-A3B at 200+ tok/s.

$10K-$25K

Pro

Dual RTX 5090 workstation (Puget Systems config, 2,800W PSU, requires 240V circuit) for parallel SD/Flux/training. Or 2-4× Mac Studio M3 Ultra cluster via Thunderbolt 5 + exo + RDMA (macOS 26.2+) — up to 200B+ models, 1.5TB pooled VRAM. Used H100 PCIe 80GB (~$20K) for an NVLink-free single-GPU rig.

Above $25K

Lab

Used H100 SXM5 80GB modules surface on eBay $20K-$32K but require enterprise SXM baseboard and 3-phase power — almost never the right answer at home. Better: 4× RTX 6000 Ada (48GB each) in a Threadripper PRO chassis for 192GB pooled VRAM at sane power. Or NVIDIA DGX Station (Blackwell Ultra, 784GB coherent memory) for a turnkey personal datacenter.

Local Inference Stack

Pick by user count and platform. Single user on Apple Silicon: MLX (or Ollama 0.19+ which now uses MLX as backend) — 230 tok/s sustained on M5 Max, vs llama.cpp at ~150. Single user on NVIDIA: llama.cpp or Ollama for ergonomics, LM Studio for a GUI. Multi-user serving (5+ concurrent): vLLM, which hits ~16-20× Ollama's concurrent throughput via PagedAttention. Distributed across machines: exo with RDMA-over-Thunderbolt-5 for Mac clusters, or vLLM tensor-parallel for NVIDIA.

MLX (Apple Silicon)

Mac default

Now the default on Mac. Zero-copy unified memory. 4.06× faster TTFT than llama.cpp's Metal backend on M5. Caveat: decode-only benchmarks overstate the win at >40K context.

Ollama 0.19+

Easy mode

Wraps llama.cpp on NVIDIA/AMD, MLX on Apple Silicon. CLI-first, OpenAI-compatible API. Best ergonomics for single-user local. Ollama 0.19 + MLX = prefill 1,154 → 1,810 tok/s, decode 58 → 112 tok/s on M5 Max Qwen3.5-35B.

vLLM

Production

Production serving. PagedAttention + continuous batching = 16-20× Ollama's concurrent throughput. Use when you're putting a local model behind an API for a team. Requires NVIDIA (or AMD ROCm) — not a single-user tool.

llama.cpp / LM Studio / Jan / OpenWebUI / exo

Ecosystem

llama.cpp is the upstream engine everything else wraps. LM Studio is the Electron GUI (~400MB RSS baseline). Jan and OpenWebUI are open-source ChatGPT clones. exo distributes inference across a cluster. Llamafile bundles a model + runtime in one binary.

Model Tier List by RAM

What actually runs well at what memory ceiling, Q2 2026. The cliffs are: 16GB unlocks 7-8B usable; 32GB unlocks 30-35B MoE (Qwen3 30B-A3B, Llama 4 Scout activated weights); 64GB unlocks dense 70B at Q4_K_M; 128GB unlocks Llama 4 Maverick unquantized + room for context; 192GB+ unlocks Qwen3 235B at decent quant. Quality cliffs: Q6_K is ~102% baseline (effectively lossless), Q4_K_M is 92-95% — the jump from Q4 to Q6 is bigger than Q6 to Q8.

16-24GB VRAM (RTX 4090/5080)

24GB

Llama 3.3 8B, Qwen3 8B, Gemma 3 9B, Phi-4 14B at Q4-Q6. Qwen3 30B-A3B MoE squeezes in at Q4. FLUX.1 dev runs. HunyuanVideo 1.5 at 720p with fp8 + offload. The "good enough for one engineer" tier.

32GB VRAM (RTX 5090)

32GB

Qwen3 30B-A3B at Q4_K_XL with 147K context fully in VRAM at ~52 tok/s. Llama 4 Scout activated experts. FLUX.1 + ControlNet stacks. 70B class is out of reach single-card — needs offload or unified-memory rig. Prompt processing screams (10,400 tok/s on Qwen3 8B).

64-96GB unified (M4 Max / Strix Halo)

70B

Llama 3.3 70B and Qwen3 72B at Q4_K_M, ~6-10 tok/s decode. Llama 4 Scout (109B MoE, ~17B active) comfortably. Qwen3 30B-A3B at FP16. Strix Halo's 212 GB/s bandwidth caps decode below Apple — but 65W power envelope is unbeatable for always-on.

128GB unified (Framework Desktop / Mac Studio M4 Max)

Maverick

Llama 4 Maverick (400B MoE, 17B active) unquantized at 8-12 tok/s on M4 Max. Qwen3 235B-A22B at Q4 at ~5.5 tok/s. The "real frontier-class private LLM" tier opens here. 70B BF16 in a 65W mini-PC — the 2026 inflection point.

192GB+ unified (Mac Studio M3 Ultra 256GB)

Frontier

Qwen3 235B at Q8 at 28+ tok/s. Mistral Large 3 dense unquantized. DeepSeek R1 671B is out of reach on 256GB but ran at 17-18 tok/s on the now-discontinued 512GB SKU. 819 GB/s bandwidth is what makes this fast — not just the capacity.

Creator Workflows on Local

Image: FLUX.1 dev/pro on ComfyUI is the daily driver — 9s/image on RTX 5090, ~25s on RTX 4090. FP4 via TensorRT Model Optimizer on Blackwell is the new quality/speed Pareto. Video: HunyuanVideo 1.5 (8.3B, 14GB VRAM with offload) and Wan 2.2 (1.3B at 8GB, 14B at 24GB+) are the open leaders — 4090 is the sweet spot, H200 if you need reliable 1080p. Voice: XTTS-v2 (4-6GB VRAM, 17 languages, 6s reference audio) remains the local default but is non-commercial license; F5-TTS and Chatterbox have caught up for commercial use. Music: no Suno-class open model exists yet — closest are MusicGen and Stable Audio, both well behind commercial APIs. Run an LLM + Whisper + XTTS stack on a single 5090 with ~24GB headroom left for the LLM.

Key Findings

Mac Studio M3 Ultra runs DeepSeek-R1 671B (Q4) at 17-18 tok/s drawing under 200W — but only on the 512GB SKU Apple silently discontinued in early 2026; current top SKU is 256GB (MacRumors, MacSparky).

NVIDIA DGX Spark hit $4,699 in February 2026, delivers 1 PFLOP FP4 with 128GB unified memory and 273 GB/s bandwidth at 300W — CES 2026 software update added 2.5× speedup via TensorRT-LLM (NVIDIA Newsroom, Constellation Research).

Framework Desktop 128GB at $1,999 and GMKtec EVO-X2 128GB at $1,499 run Llama 70B at Q4_K_M in a 65-120W mini-PC envelope — measured 96-100 tok/s on Qwen3 30B-A3B MoE (Framework Community, Tom's Hardware).

RTX 5090 (32GB GDDR7, 1,792 GB/s) hits 10,400 tok/s prefill on Qwen3 8B and sustains 52 tok/s on Qwen3 30B-A3B at 147K context — but at 575W TGP with 900W transient spikes and 50 dBA under load (Hardware Corner, TechPowerUp).

MLX has overtaken llama.cpp on Apple Silicon for sub-27B models: 230 tok/s sustained on M5 Max with 4.06× faster time-to-first-token via M5 Neural Accelerators — Ollama switched to MLX as backend in 0.19 (yage.ai, Towards AI).

exo + RDMA-over-Thunderbolt-5 (macOS 26.2+) cuts inter-device latency from ms to µs, giving 4× Mac Studios linear-scaling Qwen3 235B at 28-32 tok/s on a $30K cluster — pooled 1.5TB unified memory (Jeff Geerling, Hardware Corner).

Dual RTX 5090 workstations need 2,800W PSU and 200-240V circuits — a standard 20A/120V wall outlet (2,400W) cannot power them. Single 5090 is fine on 20A but coil whine and 50 dBA noise are real workstation considerations (Puget Systems, Tom's Hardware).

Research Transparency

Limitations

•Benchmark numbers vary substantially with quantization, context length, and batch — most cited tok/s figures are decode-only on short prompts and will degrade at long context (Groundy's MLX analysis: 51 tok/s decode collapsed to 3 tok/s effective at 8.5K context once prefill counted).
•Apple's removal of the 512GB Mac Studio M3 Ultra SKU in early 2026 means anyone reading 512GB DeepSeek-R1 benchmarks from 2025 cannot reproduce the build new — only used or M4/M5 Ultra successors when released.
•Pricing on consumer Strix Halo mini-PCs (GMKtec, Beelink) swings $500-$1,000 on promotions; the $1,499 EVO-X2 figure is promotional, not steady-state.
•Used H100 SXM listings on eBay are frequently "qualified samples" or parts-only — the headline prices ($4,500-$20,000) often don't correspond to a buildable home rig once SXM baseboard, cooling, and power are accounted for.

What We Don't Know

?Whether Apple will reintroduce a >256GB unified-memory SKU on M4 Ultra or M5 Ultra Mac Studio in late 2026 — the 512GB pull was framed as a RAM-price-squeeze decision, not a roadmap one.
?How much of the MLX vs llama.cpp gap holds at >40K context once prefill is honestly measured — most published comparisons are decode-only short-prompt benchmarks.
?Whether a true Suno-class open-source music model will land in 2026; current options (MusicGen, Stable Audio) remain meaningfully behind commercial APIs on local hardware.

Evidence Grade:Grade A(Peer-reviewed / meta-analyses)

Frequently Asked Questions

128GB unified memory is the floor. Mac Studio M4 Max 128GB ($3,499) runs Maverick unquantized at 8-12 tok/s — no consumer NVIDIA GPU offers 128GB at any price. Strix Halo at 128GB works but at lower bandwidth (~212 GB/s vs Mac's 546-819 GB/s).