Go-Trader · Issue #879 · Bake-off Episode 3

Six runs.
One refactor.
The judge picked itself.

Six runs. Four models. One winner. Fable 5, Composer 2.5, plus Opus 4.8 and GPT-5.5 each at two effort levels, rebuilt the same safety-critical plumbing in a live trading bot — fully independently, from one issue. We measured what it cost, how long it took, and how correct the code actually was.

One more thing: the judge is one of the contestants. Keep scrolling — the winner is at the bottom.

scroll ↓
i.The assignment

One regime to rule them all.

In plain English — the bot decides when to trade partly by classifying the market's regime: trending up, trending down, ranging, with finer-grained flavors. Until now, every strategy computed that label privately, inside its own per-cycle subprocess — two strategies on the same coin redid the same math, and nothing else could ask "what's the regime right now?" Issue #879 moves it all into one Go-scheduler-owned calculator: computed once per cycle in a per-cycle store, injected into every consumer — entry gates, position stamping, dynamic stop-loss, the dashboard, even strategies sitting flat. The issue also settles the scariest question in writing: if the calculator fails, clear the value to empty and fail open — and it names the two alternatives it considered and rejected. This is a refactor of safety-critical plumbing, not a feature bolt-on. Spec-reading discipline was a scored event.

iii.How they were judged

Six blind reviews.

One independent, blind reviewer per PR — six reviewers, each sealed off from the other five branches. PR descriptions were treated as claims to falsify, not facts.

01

Read the issue. Enumerate all 8 acceptance criteria.

02

Check out the PR at HEAD. Verify every claim against the actual code, with file:line citations.

03

Build & run everything. go build, go vet, the full Go test suite, and the complete Python suites — locally.

04

Score /100 against the acceptance criteria. Report the misses, cited.

All six PRs pass every local check — go build, go vet, the full Go and Python suites. Hold that thought: a fully green CI does not mean the feature works. This episode proves it in the harshest way possible.
⚖️ A note on the judge. The reviewers all ran on a single model — and that model is one of the six contestants. We'll name it at the bottom, next to the result it produced. Keep scrolling.
iv.Time to first pass

How long it took.

Wall-clock to produce the first pass: 11 minutes to 53 minutes — a 5× spread for the same issue. And it runs almost exactly opposite to the final scores.

ModelFirst-pass time
Composer 2.5 11 min
GPT-5.5 high 16 min
GPT-5.5 xhigh 19.5 min
Fable 5 29.5 min
Opus 4.8 high 42 min
Opus 4.8 xhigh 53 min
v.Quota cost

The quota bill.

First-pass spend. Units differ by provider, so these aren't directly comparable — but the gap in magnitude is the story: from ~2% of a monthly plan to half a 5-hour window in under 20 minutes.

ModelQuota consumed
Composer 2.5 ~2% of Cursor Pro monthly
Opus 4.8 high 14% of a 5-hour window
Opus 4.8 xhigh 20% of a 5-hour window
Fable 5 27% of a 5-hour window
GPT-5.5 high 44% of a 5-hour window · 7% of weekly
GPT-5.5 xhigh 51% of a 5-hour window · 8% of weekly — in 19.5 min
Highest burn rate ever recorded in this series: GPT-5.5 at xhigh consumed half a 5-hour quota window in under twenty minutes. Composer, at the other extreme, was roughly 10× cheaper than the Opus runs — and 5× faster.
vi.Fix commits & footprint

How many tries — and how much code.

Fix commits after the first pass (the fix→re-review loops ran outside GitHub, so branch history stands in for re-review count) — and how big each diff was.

ModelFix commitsFilesLines
Opus 4.8 xhigh #903 0 — single commit 20 +1,241 / −46
GPT-5.5 #899 1 22 +1,070 / −68
Composer 2.5 #900 1 27 +1,440 / −112
Opus 4.8 high #901 1 27 +2,334 / −53
GPT-5.5 #902 1 19 +1,237 / −103
Fable 5 #910 2 20 +1,749 / −46
📐 Side note — diff size is not scored. Diffs ranged from +1,070 to +2,334 lines for the same issue, and size predicted nothing about the final order — in either direction. The one footnote worth keeping: #903 delivered its result in a single commit with zero fix rounds, the only run in the field to do so.
The reckoning

But which one was
actually right?

Time and quota tell you about effort. They don't tell you whether a live trading-safety mechanism still fires. All six were scored against the issue's 8 acceptance criteria — every claim verified against the code at HEAD.

Counting down, from sixth place…

— Sixth place —
Composer 2.5 Fastest (11 min), cheapest (~2% of a monthly plan), and the broadest cosmetic sweep of the field — status JSON, web UI, even Discord output. One problem: the new regime subprocess reads market-data rows as dictionaries when the fetcher returns lists, so it crashes on every single real invocation. The store stays permanently empty, and because every consumer was switched away from the old inline math, regime entry-gating — a live trading-safety mechanism — silently stops firing. The kicker: CI stayed green the whole way down, because the only subprocess test exits before reaching the broken line.
38/100
— Fifth place —
GPT-5.5high Honest description, clean build — but it rebuilt the Python label classifiers in Go instead of emitting them from Python, and the two copies disagree: a zero-ATR market maps to a different label family, and a 4-decimal rounding step flips labels exactly at thresholds. It shipped zero Go-vs-Python parity tests — the one safety net that redesign demanded. Stale window labels also leak through on bundle failure, and there's no dashboard surface at all.
66/100
— Fourth place —
GPT-5.5xhigh A genuinely solid, race-free core that passes everything. But where the issue enumerated three failure policies, picked (b), and documented why the others were rejected, GPT-5.5 quietly invented a fourth option of its own — silent inline recompute instead of clear-to-empty. Defensible engineering, except the whole point of that section was that the decision had already been made. No portfolio dashboard surface, no backtest work, most of the parity-test matrix unwritten — and the highest burn rate of the field.
70/100
— Third place —
Opus 4.8xhigh The most disciplined run of the field: a single commit, zero fix rounds, textbook reuse, and an exact implementation of the failure policy. It also shipped the field's subtlest real bug: its subprocess fetches spot candles for OKX perpetuals, so swap strategies silently read a regime computed on the wrong market — falsifying the PR's "byte-identical" claim. The effort-knob surprise: xhigh scored below high, after 11 more minutes of deliberation.
84/100
— Second place —
Opus 4.8high Zero correctness bugs found — the only contestant with a clean sheet. It deliberately replaced the spec'd two-layer store with a per-window-exact redesign, well-argued and openly disclosed in the PR body, and shipped the only live/backtest parity test in the top three. The misses: options never migrated off inline compute (the store is populated from the result — the reverse direction), and a "no double network fetch" claim that's only true on one platform. The biggest diff of the field, +2,334 lines.
86/100
Before the winner — four notes

Fastest & cheapest

Composer 2.5

A full first draft in 11 min at ~2% of a monthly quota — 5× faster and ~10× cheaper than the Opus runs. It also produced the one PR that crashes in production.

🎯

Only zero-fix-commit run

Opus 4.8 xhigh

One commit, zero fix rounds, second-smallest diff. Maximum polish — yet more deliberation didn't buy more correctness: it scored two points below its own high-effort twin.

Highest burn rate

GPT-5.5

51% of a 5-hour quota window in 19.5 minutes — the fastest quota burn ever recorded in this series, for a fourth-place finish.

💡

Shared insight

The Claude Code trio

All three Claude Code runs independently discovered the same simplification — the issue's two-layer store collapses to one in this codebase — and argued it openly in their PR bodies. None of the Codex or Cursor runs surfaced it.

🏆
— First place —

Fable 5

PR #910 ↗ · cc/regime-global-store-879

The only PR that migrated every runtime consumer onto the store — six dispatch sites, options, flat manual, every check script — with the failure policy implemented exactly as the issue specified, then hardened past it: a phase time-budget, a store seal against late writes, and a generation guard against cross-cycle stragglers, each with its own regression test. Every line of label math stayed in Python, single-sourced. Its misses were real but bounded: the backtest bullet untouched, a context-cancel leak on kill-switch cycles, one-bar staleness edges.

91 /100
⚖️ Conflict of interest — read before trusting the margin The judge is Fable 5. The winner is Fable 5. Each PR was scored by an independent, blind per-PR sub-reviewer, but the family bias is structural — and the margin over Opus 4.8 high (86) is just 5 points. Read that margin with the bias in mind. What doesn't depend on taste: every finding above is file:line-cited and falsifiable — a crash on every invocation, a wrong-market data feed, an explicitly rejected failure policy, an unmigrated consumer. Check out the branches and verify.

The headline isn't the top-two order; it's that for the third episode running, the cheapest, fastest runs got the safety-critical detail wrong — and the failure mode escalated from "subtle geometry bug" to "feature does not run, CI green."

Summary

All six, at a glance.

Anthropic · high
Fable 5
Score
91 /100
First-pass time
29.5 min
Fix commits
2
Quota
27% of a 5-hour window
Anthropic · high
Opus 4.8
Score
86 /100
First-pass time
42 min
Fix commits
1
Quota
14% of a 5-hour window
Anthropic · xhigh
Opus 4.8
Score
84 /100
First-pass time
53 min
Fix commits
0
Quota
20% of a 5-hour window
OpenAI · xhigh
GPT-5.5
Score
70 /100
First-pass time
19.5 min
Fix commits
1
Quota
51% of 5h · 8% of weekly
OpenAI · high
GPT-5.5
Score
66 /100
First-pass time
16 min
Fix commits
1
Quota
44% of 5h · 7% of weekly
Cursor · default
Composer 2.5
Score
38 /100
First-pass time
11 min
Fix commits
1
Quota
~2% of Cursor Pro monthly
Epilogue — what ships next

Still open.

As of scoring, issue #879 is still open — none of the six PRs has merged. But the precedent is two-for-two: Episodes 1 and 2 both ended with a best-of-N synthesis built on the winner's branch, with the strongest parts of the field folded in. If the pattern holds, Fable 5's #910 — the fully-migrated store and the hardened failure policy — will be the foundation of what ships. Watch the issue for the close.