The Model Blindfold
Three models. Three coding challenges. You pick the best response before seeing who wrote it.
// Released Feb 17, 2026
Claude Sonnet 4.6
Anthropic's new flagship coding model. Same price as Sonnet 4.5 โ better at almost everything. The headline feature is a 1 million token context window (your entire codebase, at once). But the coding improvements are what matter day-to-day.
Context window
1M tokens
beta โ entire codebase at once
Price vs Sonnet 4.5
unchanged
$3 / $15 per 1M tokens
Preferred over 4.5
70%
in Claude Code user testing
vs. Opus 4.5
59%
users preferred Sonnet 4.6
// What changed in coding
ยท Reads full context before touching anything
ยท Less overengineering โ fixes only what needs fixing
ยท Fewer false success claims and hallucinations
ยท Better multi-step task follow-through
// Interesting from the PR
๐ฐ VendingBench Arena
Anthropic benchmarks models on running a simulated vending machine business over 12 months. Sonnet 4.6 developed a novel strategy: invest heavily in capacity for the first 10 months, then pivot hard to profitability. It beat competitors who optimised for short-term gains. Long-horizon business planning โ not just code.
๐ Enterprise document reading
Matches Opus 4.6 on OfficeQA โ reading charts, PDFs, and tables then reasoning from them. 94% accuracy on insurance benchmark workflows (submission intake, first notice of loss).
๐ฅ๏ธ Computer use
Major improvement over prior Sonnet models. Human-level on spreadsheet navigation and multi-step web forms. Better resistance to prompt injection. Also: noticeably more polished frontend/design output.
// The experiment
Everyone's posting benchmarks. Here's something more honest: three anonymous models answer the same code questions. You read the responses blind and pick the one you'd trust. Then the reveal.
No trick questions. Real code, real answers, your own judgment.
3 challenges
3 models
0 marketing