The complete tutorial for head-to-head model evals inside Claude Code: per-spawn model overrides, ground truth before dispatch, self-verifying tasks, blind judging, and JSON receipts. The exact harness behind our Fable 5 vs Opus 4.8 rounds.
Make evals a routing reflex, not a platform project. The Claude Code Agent tool's per-spawn model override gives you a zero-infrastructure harness for head-to-head rounds — fix ground truth before dispatch, verify mechanically, judge blind, write a receipt. Adopt promptfoo only for prompt regression, Langfuse only once an app serves real users, and skip LangSmith entirely unless you already live in LangChain.
AI CoE pillar: Technology · evaluation discipline + Governance · evidence standards
TL;DR: You do not need an eval platform to know which model to route where. The Claude Code Agent tool accepts a per-spawn model override, which makes the CLI itself a head-to-head eval harness: dispatch the same task to two models as parallel subagents, verify objective tasks with asserts you wrote before dispatch, judge subjective tasks with a blind non-contestant model under shuffled labels, and write a JSON receipt. This is the exact loop behind our four receipted Fable 5 vs Opus 4.8 rounds, run within 24 hours of the model's release. Full walkthrough below, including the task-design rules and the three caveats that never leave a receipt.
Vendor benchmarks answer the vendor's question on the vendor's tasks. Your routing decision — which model runs your coding agents, your review gates, your fan-out workers — depends on behavior under your constraints: your output contracts, your repo, your governance rules. When Claude Fable 5 launched, its model card said nothing about whether it would respect a two-line output contract under load, or flag a governance-gated edit instead of executing it. Two afternoons of harness time answered both. A leaderboard you can't audit is marketing with decimals; an eval you ran yourself is a routing decision with receipts.
One capability makes the whole pattern possible: when Claude Code spawns a subagent through the Agent tool, you can pin that subagent to a specific model (fable, opus, sonnet, haiku). The orchestrating session becomes the harness; the subagents become contestants.
The loop has six steps:
Four to six tasks, one per capability axis you actually route on: reasoning under output constraints, coding with shipped asserts, repo-grounded or agentic tool use, constraint-stacked writing. Two design rules carry most of the integrity: prefer self-verifying tasks (asserts, known answers) and keep judged tasks to half the card or less.
Compute the answers yourself, first — script them if needed, write them down. Never derive truth from a contestant's output; the moment you do, the eval grades itself. For our Round 3, the harness computed the reasoning answer with a five-line script and counted the repo facts live before either contestant saw the prompt.
Send the same prompt to each contestant in one parallel block, one model override each. Tell contestants their final message is raw harness data, not user-facing prose — instruction compliance is part of what you're measuring. Cap concurrency to your machine's capacity.
Re-run the contestants' test suites yourself. Grep for banned patterns. Count the words. Check the format against the contract character by character. Anything a script can check, a script should check — mechanical verification is immune to the judge biases that plague LLM-graded evals.
For taste tasks (voice, code craft), use a non-contestant model as judge, shuffle which output is "A" and which is "B" per task, record the assignment, and never show the judge a model name. Crucially: the harness enforces hard constraints separately, so the judge's preference can't launder a violation — a beautiful answer over the word limit still fails the word limit.
One JSON file per round: contestants, judge and label assignments, per-task results with attempts and durations, the tally, and the caveats. Publish it. The receipt is what separates an eval from an anecdote — and it's what lets a claim survive the question "says who?"
The complete harness doctrine, task-design rules, and every receipt from our rounds are open source in the arena repo.
Things no model card mentions: Opus 4.8 answering a hard reasoning task confidently wrong in 2.7 seconds while Fable 5 solved it; Fable 5 silently executing a governance-gated edit that Opus flagged; a blind style verdict that flipped between rounds (which is exactly why single-judge n=1 style scores should never drive routing); and every model's output discipline degrading under heavy task load — the finding that moved "enforce contracts structurally" from preference to doctrine. Four rounds, two afternoons, zero infrastructure.
| Need | Use | Skip |
|---|---|---|
| Head-to-head model rounds | This harness — Agent overrides, native to Claude Code | Standing up eval servers |
| Prompt/pattern regression | promptfoo — declarative YAML, local, free, colocated with prompts | — |
| Runtime tracing of a live app | Langfuse — once real users exist; tracing is a production concern | Tracing infra for benchmarks |
| — | — | LangSmith as an eval layer: hosted and paid where promptfoo is local and free, unless you already live in LangChain |
Use the Agent tool to spawn parallel subagents with different model overrides, give both the same task prompt, verify objective results with asserts you fixed before dispatch, and judge subjective outputs with a blind non-contestant model under shuffled labels. Record everything in a JSON receipt.
Not for model comparison. LangSmith adds a hosted, paid framework dependency for nothing this harness lacks; Langfuse is runtime tracing — valuable once an app serves real users, irrelevant for benchmarking. For prompt regression testing, promptfoo (local, free, declarative) covers it.
Three layers: prefer mechanical verification so most tasks need no judge; use a non-contestant model from outside the matchup where possible; and always shuffle A/B labels per task while enforcing hard constraints in script — so the judge scores taste, never compliance.
The integrity rule that makes a homemade eval trustworthy: compute correct answers before any contestant runs, and never adjust them afterward. If the harness derives truth from a contestant's output, the eval silently becomes self-grading.
Our Round 3 — four tasks, two models, mechanical verification, one blind judgment, receipt written — took under an hour of wall-clock time inside a normal Claude Code session. New model day can be eval day.
By Frank — AI Architect at Oracle's EMEA AI Center of Excellence. The harness, task-design rules, and all four Fable 5 vs Opus 4.8 receipts are open source: methodology · receipts · live results.
Step-by-step guide to setting up ACOS, creating your first agent, and shipping real products with AI.
Start buildingDownload AI architecture templates, multi-agent blueprints, and prompt engineering patterns.
Browse templatesConnect with creators and architects shipping AI products. Weekly office hours, shared resources, direct access.
Join the circleRead on FrankX.AI — AI Architecture, Music & Creator Intelligence
Weekly field notes on AI systems, production patterns, and builder strategy.
Anthropic released Claude Fable 5 on June 9, 2026 — a Mythos-class model made generally available. Launch benchmarks: 95% SWE-bench Verified, ~80% SWE-bench Pro. We ran four first-party eval rounds against Opus 4.8 in Claude Code within 24 hours. Here are the receipts, the pricing math, and the routing guide.
Read articleHow to prompt Claude Fable 5, derived from four receipted eval rounds — constraint stacking that works, the agreeable-execution trap, why output contracts belong in structure, and a system-prompt template for agentic pipelines.
Read articleAnthropic's Opus 4.8 lands 41 days after 4.7 with the same $5/$25 pricing, SWE-Bench Pro 69.2%, GDPval-AA 1890, dynamic workflows, and cheaper fast mode. Technical breakdown with verified benchmarks, what changed, and what it means for builders.
Read article