AI Coding Bake-off Ep. 3: Six Runs, One Refactor — The Judge Picked Itself (Issue #879)

i.The assignment

One regime to rule them all.

In plain English — the bot decides when to trade partly by classifying the market's regime: trending up, trending down, ranging, with finer-grained flavors. Until now, every strategy computed that label privately, inside its own per-cycle subprocess — two strategies on the same coin redid the same math, and nothing else could ask "what's the regime right now?" Issue #879 moves it all into one Go-scheduler-owned calculator: computed once per cycle in a per-cycle store, injected into every consumer — entry gates, position stamping, dynamic stop-loss, the dashboard, even strategies sitting flat. The issue also settles the scariest question in writing: if the calculator fails, clear the value to empty and fail open — and it names the two alternatives it considered and rejected. This is a refactor of safety-critical plumbing, not a feature bolt-on. Spec-reading discipline was a scored event.

Read the full issue #879 on GitHub ↗

ii.The contestants — the biggest field yet

OpenAI · Codex · high

iii.How they were judged

Six blind reviews.

One independent, blind reviewer per PR — six reviewers, each sealed off from the other five branches. PR descriptions were treated as claims to falsify, not facts.

Read the issue. Enumerate all 8 acceptance criteria.

Check out the PR at HEAD. Verify every claim against the actual code, with file:line citations.

Build & run everything. go build, go vet, the full Go test suite, and the complete Python suites — locally.

Score /100 against the acceptance criteria. Report the misses, cited.

All six PRs pass every local check — go build, go vet, the full Go and Python suites. Hold that thought: a fully green CI does not mean the feature works. This episode proves it in the harshest way possible.

⚖️ A note on the judge. The reviewers all ran on a single model — and that model is one of the six contestants. We'll name it at the bottom, next to the result it produced. Keep scrolling.

iv.Time to first pass

How long it took.

Wall-clock to produce the first pass: 11 minutes to 53 minutes — a 5× spread for the same issue. And it runs almost exactly opposite to the final scores.

Model	First-pass time
Composer 2.5	11 min
GPT-5.5 high	16 min
GPT-5.5 xhigh	19.5 min
Fable 5	29.5 min
Opus 4.8 high	42 min
Opus 4.8 xhigh	53 min

v.Quota cost

The quota bill.

First-pass spend. Units differ by provider, so these aren't directly comparable — but the gap in magnitude is the story: from ~2% of a monthly plan to half a 5-hour window in under 20 minutes.

Model	Quota consumed
Composer 2.5	~2% of Cursor Pro monthly
Opus 4.8 high	14% of a 5-hour window
Opus 4.8 xhigh	20% of a 5-hour window
Fable 5	27% of a 5-hour window
GPT-5.5 high	44% of a 5-hour window · 7% of weekly
GPT-5.5 xhigh	51% of a 5-hour window · 8% of weekly — in 19.5 min

⛽ Highest burn rate ever recorded in this series: GPT-5.5 at xhigh consumed half a 5-hour quota window in under twenty minutes. Composer, at the other extreme, was roughly 10× cheaper than the Opus runs — and 5× faster.

vi.Fix commits & footprint

How many tries — and how much code.

Fix commits after the first pass (the fix→re-review loops ran outside GitHub, so branch history stands in for re-review count) — and how big each diff was.

Model	Fix commits	Files	Lines
Opus 4.8 xhigh #903	0 — single commit	20	+1,241 / −46
GPT-5.5 #899	1	22	+1,070 / −68
Composer 2.5 #900	1	27	+1,440 / −112
Opus 4.8 high #901	1	27	+2,334 / −53
GPT-5.5 #902	1	19	+1,237 / −103
Fable 5 #910	2	20	+1,749 / −46

📐 Side note — diff size is not scored. Diffs ranged from +1,070 to +2,334 lines for the same issue, and size predicted nothing about the final order — in either direction. The one footnote worth keeping: #903 delivered its result in a single commit with zero fix rounds, the only run in the field to do so.

The reckoning

But which one was
actually right?

Time and quota tell you about effort. They don't tell you whether a live trading-safety mechanism still fires. All six were scored against the issue's 8 acceptance criteria — every claim verified against the code at HEAD.

Counting down, from sixth place…

— Sixth place —

Composer 2.5 Fastest (11 min), cheapest (~2% of a monthly plan), and the broadest cosmetic sweep of the field — status JSON, web UI, even Discord output. One problem: the new regime subprocess reads market-data rows as dictionaries when the fetcher returns lists, so it crashes on every single real invocation. The store stays permanently empty, and because every consumer was switched away from the old inline math, regime entry-gating — a live trading-safety mechanism — silently stops firing. The kicker: CI stayed green the whole way down, because the only subprocess test exits before reaching the broken line.

38/100

— Fifth place —

GPT-5.5high Honest description, clean build — but it rebuilt the Python label classifiers in Go instead of emitting them from Python, and the two copies disagree: a zero-ATR market maps to a different label family, and a 4-decimal rounding step flips labels exactly at thresholds. It shipped zero Go-vs-Python parity tests — the one safety net that redesign demanded. Stale window labels also leak through on bundle failure, and there's no dashboard surface at all.

66/100

— Fourth place —

GPT-5.5xhigh A genuinely solid, race-free core that passes everything. But where the issue enumerated three failure policies, picked (b), and documented why the others were rejected, GPT-5.5 quietly invented a fourth option of its own — silent inline recompute instead of clear-to-empty. Defensible engineering, except the whole point of that section was that the decision had already been made. No portfolio dashboard surface, no backtest work, most of the parity-test matrix unwritten — and the highest burn rate of the field.

70/100

— Third place —

Opus 4.8xhigh The most disciplined run of the field: a single commit, zero fix rounds, textbook reuse, and an exact implementation of the failure policy. It also shipped the field's subtlest real bug: its subprocess fetches spot candles for OKX perpetuals, so swap strategies silently read a regime computed on the wrong market — falsifying the PR's "byte-identical" claim. The effort-knob surprise: xhigh scored below high, after 11 more minutes of deliberation.

84/100

— Second place —

Opus 4.8high Zero correctness bugs found — the only contestant with a clean sheet. It deliberately replaced the spec'd two-layer store with a per-window-exact redesign, well-argued and openly disclosed in the PR body, and shipped the only live/backtest parity test in the top three. The misses: options never migrated off inline compute (the store is populated from the result — the reverse direction), and a "no double network fetch" claim that's only true on one platform. The biggest diff of the field, +2,334 lines.

86/100

✦Before the winner — four notes

⚡

Fastest & cheapest

Composer 2.5

A full first draft in 11 min at ~2% of a monthly quota — 5× faster and ~10× cheaper than the Opus runs. It also produced the one PR that crashes in production.

🎯

Only zero-fix-commit run

Opus 4.8 xhigh

One commit, zero fix rounds, second-smallest diff. Maximum polish — yet more deliberation didn't buy more correctness: it scored two points below its own high-effort twin.

⛽

Highest burn rate

GPT-5.5

51% of a 5-hour quota window in 19.5 minutes — the fastest quota burn ever recorded in this series, for a fourth-place finish.

💡

Shared insight

The Claude Code trio

All three Claude Code runs independently discovered the same simplification — the issue's two-layer store collapses to one in this codebase — and argued it openly in their PR bodies. None of the Codex or Cursor runs surfaced it.

🏆

— First place —

Fable 5

PR #910 ↗ · cc/regime-global-store-879

The only PR that migrated every runtime consumer onto the store — six dispatch sites, options, flat manual, every check script — with the failure policy implemented exactly as the issue specified, then hardened past it: a phase time-budget, a store seal against late writes, and a generation guard against cross-cycle stragglers, each with its own regression test. Every line of label math stayed in Python, single-sourced. Its misses were real but bounded: the backtest bullet untouched, a context-cancel leak on kill-switch cycles, one-bar staleness edges.

91 /100

⚖️ Conflict of interest — read before trusting the margin The judge is Fable 5. The winner is Fable 5. Each PR was scored by an independent, blind per-PR sub-reviewer, but the family bias is structural — and the margin over Opus 4.8 high (86) is just 5 points. Read that margin with the bias in mind. What doesn't depend on taste: every finding above is file:line-cited and falsifiable — a crash on every invocation, a wrong-market data feed, an explicitly rejected failure policy, an unmigrated consumer. Check out the branches and verify.

The headline isn't the top-two order; it's that for the third episode running, the cheapest, fastest runs got the safety-critical detail wrong — and the failure mode escalated from "subtle geometry bug" to "feature does not run, CI green."

✦Summary

All six, at a glance.

Anthropic · high

Fable 5

Score

91 /100

First-pass time

29.5 min

Fix commits

Quota

27% of a 5-hour window

Anthropic · high

Opus 4.8

Score

86 /100

First-pass time

42 min

Fix commits

Quota

14% of a 5-hour window

Anthropic · xhigh

Opus 4.8

Score

84 /100

First-pass time

53 min

Fix commits

Quota

20% of a 5-hour window

OpenAI · xhigh

GPT-5.5

Score

70 /100

First-pass time

19.5 min

Fix commits

Quota

51% of 5h · 8% of weekly

OpenAI · high

GPT-5.5

Score

66 /100

First-pass time

16 min

Fix commits

Quota

44% of 5h · 7% of weekly

Cursor · default

Composer 2.5

Score

38 /100

First-pass time

11 min

Fix commits

Quota

~2% of Cursor Pro monthly

✦Epilogue — what ships next

Still open.

As of scoring, issue #879 is still open — none of the six PRs has merged. But the precedent is two-for-two: Episodes 1 and 2 both ended with a best-of-N synthesis built on the winner's branch, with the strongest parts of the field folded in. If the pattern holds, Fable 5's #910 — the fully-migrated store and the hardened failure policy — will be the foundation of what ships. Watch the issue for the close.

Issue #879 · watch for the synthesis ↗

One regime to rule them all.

GPT-5.5

Composer 2.5

Opus 4.8

GPT-5.5

Opus 4.8

Fable 5

Six blind reviews.

How long it took.

The quota bill.

How many tries — and how much code.

But which one wasactually right?

Fastest & cheapest

Only zero-fix-commit run

Highest burn rate

Shared insight

Fable 5

All six, at a glance.

Still open.

But which one was
actually right?