A label that lied by omission.
In plain English — my bot sorts the market into regimes: trending up, trending down, choppy, quiet. Different regimes get different risk settings. One label was direction-blind: ranging_directional means "going sideways, but drifting" — and it never said which way. The trending labels already carry direction in their names; the ranging family didn't. Issue #1124 asked to fix that asymmetry: split the one blind label into ranging_directional_up and ranging_directional_down, and keep the bare label for the genuinely flat case. Sounds like a rename. It wasn't — that one label is read by about a dozen places across two languages, the live trading path and the backtester, in Go and Python, and they all have to agree in one pull request or the system splits its brain.
One reader could silently lose its safety net.
Of those dozen readers, one is dangerous. An auto-protective exit — a "ratchet" that tightens your stop as a trade moves your way — looks up its settings by regime label. If the two new labels weren't explicitly wired in, they'd fall through to a generic bucket that has no ratchet ladder at all — and the protective exit would simply never arm. No error. No failed test. Just a money-guarding mechanism that quietly stops existing for trades in that regime. The issue called this out by name. So the real contest wasn't "can you split a label." It was "can you split a label without silently disarming a safety mechanism, across twelve readers in two languages, in one shot."
Three blind reviews. Both suites re-run.
One independent, blind reviewer per PR — each sealed off from the other two branches so nobody anchored on another's work. Each checked out the branch, enumerated the acceptance criteria, verified every claim against the actual code with file-and-line citations, and re-ran the Go and Python suites from scratch instead of trusting the green checkmarks in the description.
Read the issue. Enumerate the acceptance criteria and the named safety landmine.
Check out the PR at HEAD. Verify every claim against the actual code, with file:line citations.
Re-run both suites. Go scheduler tests + touched Python suites, locally, from scratch.
Score /100 against the criteria. Report every miss, cited.
Everyone cleared the bar — completely.
This was not a "spot the broken build" episode. All three avoided the landmine: each routed the two new labels to the real ratchet ladder and shipped a test that fails if the protective exit ever returns empty. All three moved every reader in lockstep across Go and Python, kept the bare label valid for back-compat, and shipped genuinely green suites my reviewers re-ran themselves. Nobody broke the build, nobody disarmed the exit, nobody faked a test. So the contest came down to how — and the stats below describe that how. They are flavor, never a ranking factor.
| Model | Diff | Files | First-pass time | First-pass cost (native unit) |
|---|---|---|---|---|
| Opus 4.8 | +161 / −21 | 15 | ~24 min (slowest) | ~240k peak context · 24% of 1M window |
| GLM 5.2 | +848 / −56 | 24 | ~13 min | ~25.5M tokens · ~15% of a Cursor Pro window |
| GPT 5.5 | +573 / −67 | 29 (most) | ~10 min (fastest) | ~11% of a 5-hour usage window |
So who
actually won?
A green build tells you nothing about how a safety-critical split was made. All three were scored against the issue's acceptance criteria — every claim verified against the code at HEAD, both suites re-run. The gaps are tiny. The story is in them.
Counting down, from third place…
Same score, 5× the code
Two PRs reached the identical top score with wildly different work — 161 lines by reusing what was there, 848 lines by systematically covering every case. Both legitimate. A clean reminder: more code is not more correctness.
The fastest came last
The fastest model (GPT 5.5, ~10 min) finished last. The slowest (Opus, ~24 min) tied for first — while writing the least code. Whatever those extra minutes bought, it wasn't padding.
One audited the spec itself
The only entry to catch that the issue's own safety analysis was wrong: the unhandled-label ATR path drifts to a looser value (2.0), it does not never-arm — only the ratchet ladder never-arms. It fixed the drift anyway and documented the correction.
You scrolled the whole way down expecting one winner. There are two — 96 each, separated by nothing.
Minimal, reuse-first. Routed the new labels through the existing ladder, ATR defaults, and family fallback — 161 lines across 15 files. The only entry to catch and document the issue's own ATR-never-arm error.
Most systematic. Installed one uniform "directional-family" rule across every exhaustiveness check in both languages — the most thorough, most-tested treatment. The price: 848 lines across 24 files.
Same correctness. Same complete acceptance-criteria coverage. Same avoided landmine. One reached it by reusing the machinery that already existed; the other by covering every case from scratch — and they landed in exactly the same place. GPT 5.5 sits one point back at 95, on reproducible facts: the most files touched, an undocumented HTF collapse, a description that undersells. The gap to third is soft; the dead heat at the top is real.
The headline isn't who won — it's that two models reached an identical score with five times the difference in code, the fastest model came last, and the best work was the one that questioned the assignment.
All three, at a glance.
Both of them shipped.
I didn't merge a single contestant — I shipped a synthesis, and it's built from both co-winners at once. The merge candidate starts with Opus's minimal 161-line base and grafts GLM's one-directional "directional-family rule" on top: a plain ranging_directional configured anywhere automatically covers its up and down variants, everywhere, for good. That single rule closes the two small things reviewers flagged on the Opus PR — including one on the live entry path where a config could have quietly stopped taking trades.
Every relaxation that keeps an old config working is paired with a runtime fallback so a protective stop can never silently go missing — the exact mistake GLM's own first pass made and caught. Green in both languages. It's open as the pull request that closes the issue. So the bake-off didn't just pick a winner; it picked two, and the thing that ships is literally both of them — about as honest as a tie gets.