Benchmark · same prompt, 3 agents

I raced 3 AI agents to build a solar system.

One prompt: a stunning interactive solar system, single self-contained index.html, no libraries. Claude Code, GLM-5.2 and MiniMax M3, recorded and played back side by side. Watch the 40-second race, then read what actually happened.

Each agent's terminal on the left, the solar system it actually built on the right. Real runs — 65s, 327s, 1200s — compressed to ~40 seconds. MiniMax wrote the most code and rendered nothing.

How each one did

Claude Code

65s

Opus · claude -p

Fastest, and the best-looking design of the three — undone by one missing line.

It built the starfield in a helper that never returned the array, so the page crashed on the first frame: a blank screen. A single `return` fixes it, and then it's gorgeous.

GLM-5.2

327s

Synthetic · pi

The only one that just worked. Correct on the first try, and it looked great.

Five minutes, no drama, no bug — just a finished, working solar system. Not the flashiest run. The quiet winner.

MiniMax M3

1200s

Synthetic · pi

Wrote the most code by far — 1,000+ lines — and ran all the way to the 20-minute cap.

Thorough to a fault for a one-file sketch, and in my testing it rendered blank anyway. Most code, slowest, least to show for it.

Side-by-side results: Claude, GLM-5.2 and MiniMax M3 — What each shipped. Claude's tile is after the one-line fix; MiniMax never rendered.

My take after 20 years of shipping production systems: stop optimizing for the leaderboard. The model that tops a benchmark and the one you can hand real work to are rarely the same model. The only score I trust is simple — did it do the job without me babysitting it? This week the boring, correct one won. It usually does.

I run little races like this most weeks. The club is where they get shared first.

Join the waitlist →