AI Coding Bake-off Ep. 5: Two Models Tied at 96 — With 5× Different Code (Issue #1124)

i.The assignment

A label that lied by omission.

In plain English — my bot sorts the market into regimes: trending up, trending down, choppy, quiet. Different regimes get different risk settings. One label was direction-blind: ranging_directional means "going sideways, but drifting" — and it never said which way. The trending labels already carry direction in their names; the ranging family didn't. Issue #1124 asked to fix that asymmetry: split the one blind label into ranging_directional_up and ranging_directional_down, and keep the bare label for the genuinely flat case. Sounds like a rename. It wasn't — that one label is read by about a dozen places across two languages, the live trading path and the backtester, in Go and Python, and they all have to agree in one pull request or the system splits its brain.

Read the full issue #1124 on GitHub ↗

ii.The landmine

One reader could silently lose its safety net.

Of those dozen readers, one is dangerous. An auto-protective exit — a "ratchet" that tightens your stop as a trade moves your way — looks up its settings by regime label. If the two new labels weren't explicitly wired in, they'd fall through to a generic bucket that has no ratchet ladder at all — and the protective exit would simply never arm. No error. No failed test. Just a money-guarding mechanism that quietly stops existing for trades in that regime. The issue called this out by name. So the real contest wasn't "can you split a label." It was "can you split a label without silently disarming a safety mechanism, across twelve readers in two languages, in one shot."

⚠️ The trap. The bug doesn't crash. It doesn't fail a test. It just removes a protection on real-money trades, quietly. A green checkmark proves nothing here — the only thing that proves it is reading the code and routing the new labels to the real ladder.

iii.The contestants

Anthropic · Claude Code · high

iv.How they were judged

Three blind reviews. Both suites re-run.

One independent, blind reviewer per PR — each sealed off from the other two branches so nobody anchored on another's work. Each checked out the branch, enumerated the acceptance criteria, verified every claim against the actual code with file-and-line citations, and re-ran the Go and Python suites from scratch instead of trusting the green checkmarks in the description.

Read the issue. Enumerate the acceptance criteria and the named safety landmine.

Check out the PR at HEAD. Verify every claim against the actual code, with file:line citations.

Re-run both suites. Go scheduler tests + touched Python suites, locally, from scratch.

Score /100 against the criteria. Report every miss, cited.

⚙️ A note on the judge. The reviewers all ran on a single model — an Anthropic model (Opus 4.8) — and one of the contestants is also Opus 4.8. Hold that thought. We'll come back to it at the bottom, next to the result it produced — because the result is what defuses it.

v.The highest floor yet

Everyone cleared the bar — completely.

This was not a "spot the broken build" episode. All three avoided the landmine: each routed the two new labels to the real ratchet ladder and shipped a test that fails if the protective exit ever returns empty. All three moved every reader in lockstep across Go and Python, kept the bare label valid for back-compat, and shipped genuinely green suites my reviewers re-ran themselves. Nobody broke the build, nobody disarmed the exit, nobody faked a test. So the contest came down to how — and the stats below describe that how. They are flavor, never a ranking factor.

Model	Diff	Files	First-pass time	First-pass cost (native unit)
Opus 4.8	+161 / −21	15	~24 min (slowest)	~240k peak context · 24% of 1M window
GLM 5.2	+848 / −56	24	~13 min	~25.5M tokens · ~15% of a Cursor Pro window
GPT 5.5	+573 / −67	29 (most)	~10 min (fastest)	~11% of a 5-hour usage window

📐 Side note — code volume and speed are not scored. One contestant did the whole job in a fifth of another's diff. The fastest was the briefest first-pass time. Hold both numbers in your head — the scores below ignore them on purpose, and that's exactly what makes the result strange.

The reckoning

So who
actually won?

A green build tells you nothing about how a safety-critical split was made. All three were scored against the issue's acceptance criteria — every claim verified against the code at HEAD, both suites re-run. The gaps are tiny. The story is in them.

Counting down, from third place…

— Third place —

GPT 5.5 A belt-and-suspenders take: it added the new labels as first-class labels and an alias fallback in every consumer — more robust than the minimum — and placed a clear comment naming the exact never-arm risk right at the dangerous spot. It's correct and guarded. Two small things kept it a hair back: it touches the most files (29) because it also threaded the change through a higher-timeframe classifier — where it then quietly collapses the split back to a single direction, a defensible choice that goes undocumented. And its PR description slightly undersells the work, calling the new labels "aliases" when the producer emits them for real.

95/100

✦Before the result — three things that shouldn't both be true

⚖️

Same score, 5× the code

161 vs 848 lines

Two PRs reached the identical top score with wildly different work — 161 lines by reusing what was there, 848 lines by systematically covering every case. Both legitimate. A clean reminder: more code is not more correctness.

⏱️

The fastest came last

Speed ≠ rank

The fastest model (GPT 5.5, ~10 min) finished last. The slowest (Opus, ~24 min) tied for first — while writing the least code. Whatever those extra minutes bought, it wasn't padding.

🔍

One audited the spec itself

Opus 4.8

The only entry to catch that the issue's own safety analysis was wrong: the unhandled-label ATR path drifts to a looser value (2.0), it does not never-arm — only the ratchet ladder never-arms. It fixed the drift anyway and documented the correction.

🏆

— First place —

It's a tie.

You scrolled the whole way down expecting one winner. There are two — 96 each, separated by nothing.

Opus 4.8

Anthropic · Claude Code

PR #1126 ↗

96 /100

Minimal, reuse-first. Routed the new labels through the existing ladder, ATR defaults, and family fallback — 161 lines across 15 files. The only entry to catch and document the issue's own ATR-never-arm error.

tied with

GLM 5.2

Z.ai · Cursor

PR #1125 ↗

96 /100

Most systematic. Installed one uniform "directional-family" rule across every exhaustiveness check in both languages — the most thorough, most-tested treatment. The price: 848 lines across 24 files.

Same correctness. Same complete acceptance-criteria coverage. Same avoided landmine. One reached it by reusing the machinery that already existed; the other by covering every case from scratch — and they landed in exactly the same place. GPT 5.5 sits one point back at 95, on reproducible facts: the most files touched, an undocumented HTF collapse, a description that undersells. The gap to third is soft; the dead heat at the top is real.

⚖️ Conflict of interest — read before trusting the result The judge is an Anthropic model (Opus 4.8). One co-winner is Anthropic too (Opus 4.8). Here's why it doesn't get to decide anything this time: it's a tie, and the co-winner sitting in exactly the same spot is Z.ai's GLM 5.2 — not Anthropic. There's no Anthropic-shaped thumb on a scale that lands two different families at the identical 96. As extra mitigation, the Opus PR's reviewer was told to apply a stricter bar and give no benefit of the doubt — it still came back at 96, having independently verified the ATR catch against the code. What doesn't depend on taste: either all three routed the new labels to the real ladder or they didn't; either Opus's ATR correction is true or it isn't. Check out the branches and verify.

The headline isn't who won — it's that two models reached an identical score with five times the difference in code, the fastest model came last, and the best work was the one that questioned the assignment.

✦Summary

All three, at a glance.

Anthropic · Claude Code

Opus 4.8

Co-winner

Score

96 /100

First-pass time

~24 min

Cost (native)

~240k ctx · 24% of 1M

Verdict

Minimal, reuse-first · caught the issue's own ATR error · +161 / −21 across 15 files

Z.ai · Cursor

GLM 5.2

Co-winner

Score

96 /100

First-pass time

~13 min

Cost (native)

~25.5M tokens · ~15%

Verdict

Uniform "directional-family" rule everywhere · most thorough · +848 / −56 across 24 files

OpenAI · Codex

GPT 5.5

Score

95 /100

First-pass time

~10 min

Cost (native)

~11% of 5-hr window

Verdict

Belt-and-suspenders labels + aliases · most files · undocumented HTF collapse · +573 / −67 across 29 files

✦Epilogue — what shipped

Both of them shipped.

I didn't merge a single contestant — I shipped a synthesis, and it's built from both co-winners at once. The merge candidate starts with Opus's minimal 161-line base and grafts GLM's one-directional "directional-family rule" on top: a plain ranging_directional configured anywhere automatically covers its up and down variants, everywhere, for good. That single rule closes the two small things reviewers flagged on the Opus PR — including one on the live entry path where a config could have quietly stopped taking trades.

Every relaxation that keeps an old config working is paired with a runtime fallback so a protective stop can never silently go missing — the exact mistake GLM's own first pass made and caught. Green in both languages. It's open as the pull request that closes the issue. So the bake-off didn't just pick a winner; it picked two, and the thing that ships is literally both of them — about as honest as a tie gets.

Synthesis PR #1130 · Opus base + GLM family rule ↗ Issue #1124 ↗

A label that lied by omission.

One reader could silently lose its safety net.

Opus 4.8

GLM 5.2

GPT 5.5

Three blind reviews. Both suites re-run.

Everyone cleared the bar — completely.

So whoactually won?

Same score, 5× the code

The fastest came last

One audited the spec itself

All three, at a glance.

Both of them shipped.

So who
actually won?