One regime to rule them all.
In plain English — the bot decides when to trade partly by classifying the market's regime: trending up, trending down, ranging, with finer-grained flavors. Until now, every strategy computed that label privately, inside its own per-cycle subprocess — two strategies on the same coin redid the same math, and nothing else could ask "what's the regime right now?" Issue #879 moves it all into one Go-scheduler-owned calculator: computed once per cycle in a per-cycle store, injected into every consumer — entry gates, position stamping, dynamic stop-loss, the dashboard, even strategies sitting flat. The issue also settles the scariest question in writing: if the calculator fails, clear the value to empty and fail open — and it names the two alternatives it considered and rejected. This is a refactor of safety-critical plumbing, not a feature bolt-on. Spec-reading discipline was a scored event.
GPT-5.5
Composer 2.5
Opus 4.8
GPT-5.5
Opus 4.8
Fable 5
Six blind reviews.
One independent, blind reviewer per PR — six reviewers, each sealed off from the other five branches. PR descriptions were treated as claims to falsify, not facts.
Read the issue. Enumerate all 8 acceptance criteria.
Check out the PR at HEAD. Verify every claim against the actual code, with file:line citations.
Build & run everything. go build, go vet, the full Go test suite, and the complete Python suites — locally.
Score /100 against the acceptance criteria. Report the misses, cited.
How long it took.
Wall-clock to produce the first pass: 11 minutes to 53 minutes — a 5× spread for the same issue. And it runs almost exactly opposite to the final scores.
| Model | First-pass time |
|---|---|
| Composer 2.5 | 11 min |
| GPT-5.5 high | 16 min |
| GPT-5.5 xhigh | 19.5 min |
| Fable 5 | 29.5 min |
| Opus 4.8 high | 42 min |
| Opus 4.8 xhigh | 53 min |
The quota bill.
First-pass spend. Units differ by provider, so these aren't directly comparable — but the gap in magnitude is the story: from ~2% of a monthly plan to half a 5-hour window in under 20 minutes.
| Model | Quota consumed |
|---|---|
| Composer 2.5 | ~2% of Cursor Pro monthly |
| Opus 4.8 high | 14% of a 5-hour window |
| Opus 4.8 xhigh | 20% of a 5-hour window |
| Fable 5 | 27% of a 5-hour window |
| GPT-5.5 high | 44% of a 5-hour window · 7% of weekly |
| GPT-5.5 xhigh | 51% of a 5-hour window · 8% of weekly — in 19.5 min |
How many tries — and how much code.
Fix commits after the first pass (the fix→re-review loops ran outside GitHub, so branch history stands in for re-review count) — and how big each diff was.
| Model | Fix commits | Files | Lines |
|---|---|---|---|
| Opus 4.8 xhigh #903 | 0 — single commit | 20 | +1,241 / −46 |
| GPT-5.5 #899 | 1 | 22 | +1,070 / −68 |
| Composer 2.5 #900 | 1 | 27 | +1,440 / −112 |
| Opus 4.8 high #901 | 1 | 27 | +2,334 / −53 |
| GPT-5.5 #902 | 1 | 19 | +1,237 / −103 |
| Fable 5 #910 | 2 | 20 | +1,749 / −46 |
But which one was
actually right?
Time and quota tell you about effort. They don't tell you whether a live trading-safety mechanism still fires. All six were scored against the issue's 8 acceptance criteria — every claim verified against the code at HEAD.
Counting down, from sixth place…
Fastest & cheapest
A full first draft in 11 min at ~2% of a monthly quota — 5× faster and ~10× cheaper than the Opus runs. It also produced the one PR that crashes in production.
Only zero-fix-commit run
One commit, zero fix rounds, second-smallest diff. Maximum polish — yet more deliberation didn't buy more correctness: it scored two points below its own high-effort twin.
Highest burn rate
51% of a 5-hour quota window in 19.5 minutes — the fastest quota burn ever recorded in this series, for a fourth-place finish.
Shared insight
All three Claude Code runs independently discovered the same simplification — the issue's two-layer store collapses to one in this codebase — and argued it openly in their PR bodies. None of the Codex or Cursor runs surfaced it.
Fable 5
The only PR that migrated every runtime consumer onto the store — six dispatch sites, options, flat manual, every check script — with the failure policy implemented exactly as the issue specified, then hardened past it: a phase time-budget, a store seal against late writes, and a generation guard against cross-cycle stragglers, each with its own regression test. Every line of label math stayed in Python, single-sourced. Its misses were real but bounded: the backtest bullet untouched, a context-cancel leak on kill-switch cycles, one-bar staleness edges.
The headline isn't the top-two order; it's that for the third episode running, the cheapest, fastest runs got the safety-critical detail wrong — and the failure mode escalated from "subtle geometry bug" to "feature does not run, CI green."
All six, at a glance.
Still open.
As of scoring, issue #879 is still open — none of the six PRs has merged. But the precedent is two-for-two: Episodes 1 and 2 both ended with a best-of-N synthesis built on the winner's branch, with the strongest parts of the field folded in. If the pattern holds, Fable 5's #910 — the fully-migrated store and the hardened failure policy — will be the foundation of what ships. Watch the issue for the close.