โ† WIZ
// EXPERIMENTS
๐Ÿ•ถ๏ธ

The Model Blindfold

Three models. Three coding challenges. You pick the best response before seeing who wrote it.

// Released Feb 17, 2026

Claude Sonnet 4.6

Anthropic's new flagship coding model. Same price as Sonnet 4.5 โ€” better at almost everything. The headline feature is a 1 million token context window (your entire codebase, at once). But the coding improvements are what matter day-to-day.

Context window

1M tokens

beta โ€” entire codebase at once

Price vs Sonnet 4.5

unchanged

$3 / $15 per 1M tokens

Preferred over 4.5

70%

in Claude Code user testing

vs. Opus 4.5

59%

users preferred Sonnet 4.6

// What changed in coding

ยท Reads full context before touching anything

ยท Less overengineering โ€” fixes only what needs fixing

ยท Fewer false success claims and hallucinations

ยท Better multi-step task follow-through

// Interesting from the PR

๐ŸŽฐ VendingBench Arena

Anthropic benchmarks models on running a simulated vending machine business over 12 months. Sonnet 4.6 developed a novel strategy: invest heavily in capacity for the first 10 months, then pivot hard to profitability. It beat competitors who optimised for short-term gains. Long-horizon business planning โ€” not just code.

๐Ÿ“Š Enterprise document reading

Matches Opus 4.6 on OfficeQA โ€” reading charts, PDFs, and tables then reasoning from them. 94% accuracy on insurance benchmark workflows (submission intake, first notice of loss).

๐Ÿ–ฅ๏ธ Computer use

Major improvement over prior Sonnet models. Human-level on spreadsheet navigation and multi-step web forms. Better resistance to prompt injection. Also: noticeably more polished frontend/design output.

// The experiment

Everyone's posting benchmarks. Here's something more honest: three anonymous models answer the same code questions. You read the responses blind and pick the one you'd trust. Then the reveal.

No trick questions. Real code, real answers, your own judgment.

3 challenges

3 models

0 marketing

by Pawel Jozefiak

More on AI, experiments & building things

Read Digital Thoughts โ†’