AI Coding Bake-off Ep. 4: Three Models, One Audit — The Winner Wrote the Least Code (Issue #906)

i.The assignment

Two brains that have to agree.

In plain English — my bot has two brains. The backtester replays years of price history in seconds to tell me whether a strategy makes money. The live engine makes the real decisions, bar by bar, with real money. The entire premise of backtesting is that those two brains decide the same thing on the same data — and they quietly drift apart. The backtester sees the whole price history at once and can normalize a number against data that, live, wouldn't exist yet; the live engine only ever sees a trailing window ending at the last closed bar. Issue #906 asked for a full audit of the engine, but it named one deliverable as the prize: a parity tool that runs both brains over the same candles and prints a bar-by-bar diff of every decision. Build that, and you catch drift the moment it appears — instead of after it's cost you money. That one sentence, "runs both decision paths," turned out to be the whole contest.

Read the full issue #906 on GitHub ↗

ii.The contestants

Cursor · Cursor Agent · high

iii.How they were judged

Three blind reviews.

One independent, blind reviewer per PR — each sealed off from the other two branches. And crucially: every reviewer didn't just read the diff — it actually ran the tool against real and synthetic price data. PR descriptions were treated as claims to falsify, not facts.

Read the issue. Enumerate the four acceptance criteria.

Check out the PR at HEAD. Verify every claim against the actual code, with file:line citations.

Run the tool, run the tests. Drive the parity tool on real data; run the new and surrounding suites — locally.

Score /100 against the criteria. Report the misses, cited.

All three PRs pass their own and the surrounding test suites — this was not a "spot the broken build" episode. The floor was genuinely high: every one is a plausible, test-passing audit. The ranking is decided on the one thing that's hard — does the parity tool actually run two independent decision paths — and on whether the audit found real bugs.

⚖️ A note on the judge. The reviewers all ran on a single model — an Anthropic model (Opus 4.8) — and the winning PR is also Anthropic (Fable 5), while the runner-up two points back is not. We'll come back to this at the bottom, next to the result it produced.

iv.The deciding sentence

"Runs both decision paths."

The tool's value rests entirely on running two genuinely independent paths: the vectorized, full-history backtest, and a trailing-window replay of how the bot actually decides live. That cross-framing divergence — full-series normalization, unseeded rolling state — is the silent mechanism behind most parity bugs, and the engine's only guard against look-ahead cannot catch it. A tool that diffs a backtest against itself, or only emits a backtest trace and asks you to bring the "live" half, doesn't deliver the criterion — no matter how clean the code is.

Model	Runs two real paths?	Catches frame-dependent drift (verified)
Fable 5	Yes — vector + trailing-window	0 mismatches clean · 42/201 on a drift-prone strategy
Composer 2.5	Yes — reuses the real live helpers	0 mismatches clean · 48/50 on a drift-prone strategy
GPT-5.5	No — one path, live side is BYO	Cannot — no independent live path

🎯 Two of three took it literally. Fable modeled the live check-script semantics; Composer reused the actual live helpers (the most faithful version in the field). GPT-5.5 built a tool that runs the backtest and asks you to hand it the live trace as a spreadsheet — so it structurally can't catch the drift it exists to catch.

v.Diff size & what the audit found

Least code. Most bugs found.

Diff size is never scored — but here it almost perfectly inverts the rank. The winner shipped the smallest diff of the field, entirely additive (no production code touched, so zero regression risk), and was the only run that came back holding real bugs.

Model	Lines	Files	Real bugs found
Fable 5 #946	+537 / −0	3	2 — cited & filed
GPT-5.5 #936	+897 / −6	8	0 — shipped a broken flag
Composer 2.5 #935	+1,062 / −0	6	0

📐 Side note — diff size is not scored. The winner did the most with a third of the runner-up's code, across half the files, touching nothing already running. The lowest score sits in the middle on lines but touched the most files — and a chunk of that is the broken composite-regime flag.

The reckoning

But which one was
actually right?

A clean build tells you nothing about whether the tool catches the drift it exists to catch. All three were scored against the issue's four acceptance criteria — every claim verified against the code at HEAD, the tool run on real data.

Counting down, from third place…

— Third place —

GPT-5.5 Its tool runs the backtest, produces a per-bar trace… and then asks you to hand it the live trace as a spreadsheet from somewhere else. It never runs the live path itself, so it structurally cannot catch the drift the tool exists to catch. On top of that it shipped a feature nobody asked for — a new market-regime classifier flag. Run it, and it's a silent no-op: it only does anything if you also pass a separate config it never mentions, the help text claims otherwise, and no test would ever notice. A feature documented as done, passing its tests, doing nothing.

58/100

— Second place —

Composer 2.5 This one genuinely impressed me. Its tool runs both paths for real — and its "live" side calls the same code the live bot calls, the most faithful version in the field. Verified: zero disagreements on a normal strategy, 48 of 50 bars flagged on a drift-prone one. It also shipped the most polish: config-driven runs, real trade extraction, structured output. What cost it the win — its brand-new "regression tests" are hollow: they check that a test file exists and mentions a bug number, not that the bug is caught. And it took the most code in the field to get there.

88/100

✦Before the winner — three notes

🪶

Did the most with the least

Fable 5

+537 lines across 3 files, entirely additive — a third of the runner-up's diff, touching no production code. Smallest footprint, top score.

🎛️

Most faithful live path

Composer 2.5

Its parity tool reuses the actual live check-script helpers in-process, rather than modeling them — the most accurate "live" side of the field. The win was two points away.

🫥

The silent no-op

GPT-5.5

Shipped a new regime-classifier flag that's documented as working, passes its tests, and does nothing on its own. Green checkmark over a hollow core — the recurring lower-tier failure mode.

🐛

Only one came back with bugs

Fable 5

Two real, cited parity bugs: the backtester silently drops direction and invert_signal — and a related error message tells you to use exactly those dropped settings — plus a class of legacy config keys that no-op in backtest but work live.

🏆

— First place —

Fable 5

PR #946 ↗ · cc/issue-906-backtest-audit

It built the same genuine two-path tool as the runner-up — verified clean on a normal strategy and lighting up on a drift-prone one — but did three things the others didn't. It wrote the least code of the field (+537 across 3 files) and touched no production code, so it can't break anything running. It came back with two real bugs, both cited to the line. And it was honest about its own limit: its "live" path is a faithful model of live behavior, not literally the live code — the one place the runner-up was better.

90 /100

⚖️ Conflict of interest — read before trusting the margin The judge is an Anthropic model (Opus 4.8). The winner is Anthropic too (Fable 5). The runner-up, two points back, is Cursor's — not Anthropic. Each PR was scored by an independent, blind per-PR sub-reviewer, but the family bias is structural, and a two-point margin in favor of the judge's own family is exactly what you should squint at. What doesn't depend on taste: either the tool runs two independent paths or it runs one; either the two bugs are real or they aren't; either the new regression tests assert behavior or they assert that a file mentions a number. Check out the branches and verify — the two-point gap is soft; the chasm down to third is not.

The headline isn't the top-two order; it's that the whole contest rode on one sentence — "runs both decision paths" — and the model that won did the most with the least code, and was the only one to find real bugs.

✦Summary

All three, at a glance.

Anthropic · Claude Code

Fable 5

Score

90 /100

Diff

+537 / −0

Files

Verdict

Genuine two-path tool · found 2 real parity bugs · zero production-code risk

Cursor · Cursor Agent

Composer 2.5

Score

88 /100

Diff

+1,062 / −0

Files

Verdict

Two-path tool, most faithful live path · new regression tests assert little

OpenAI · Codex

GPT-5.5

Score

58 /100

Diff

+897 / −6

Files

Verdict

Single-path tool (live side is BYO) · shipped a silent no-op regime flag

✦Epilogue — what ships next

Still open.

As of scoring, issue #906 is still open — none of the three PRs has merged. But the precedent is three-for-three: every prior episode ended with a best-of-N synthesis built on the winner's branch, strongest parts of the field folded in. If the pattern holds, the foundation will be Fable 5's #946 — the genuine two-path engine and the real-bug findings — with Composer's more faithful live-helper reuse and fill extraction grafted on. Watch the issue for the close.

Issue #906 · watch for the synthesis ↗

Two brains that have to agree.

Composer 2.5

GPT-5.5

Fable 5

Three blind reviews.

"Runs both decision paths."

Least code. Most bugs found.

But which one wasactually right?

Did the most with the least

Most faithful live path

The silent no-op

Only one came back with bugs

Fable 5

All three, at a glance.

Still open.

But which one was
actually right?