Two brains that have to agree.
In plain English — my bot has two brains. The backtester replays years of price history in seconds to tell me whether a strategy makes money. The live engine makes the real decisions, bar by bar, with real money. The entire premise of backtesting is that those two brains decide the same thing on the same data — and they quietly drift apart. The backtester sees the whole price history at once and can normalize a number against data that, live, wouldn't exist yet; the live engine only ever sees a trailing window ending at the last closed bar. Issue #906 asked for a full audit of the engine, but it named one deliverable as the prize: a parity tool that runs both brains over the same candles and prints a bar-by-bar diff of every decision. Build that, and you catch drift the moment it appears — instead of after it's cost you money. That one sentence, "runs both decision paths," turned out to be the whole contest.
Three blind reviews.
One independent, blind reviewer per PR — each sealed off from the other two branches. And crucially: every reviewer didn't just read the diff — it actually ran the tool against real and synthetic price data. PR descriptions were treated as claims to falsify, not facts.
Read the issue. Enumerate the four acceptance criteria.
Check out the PR at HEAD. Verify every claim against the actual code, with file:line citations.
Run the tool, run the tests. Drive the parity tool on real data; run the new and surrounding suites — locally.
Score /100 against the criteria. Report the misses, cited.
"Runs both decision paths."
The tool's value rests entirely on running two genuinely independent paths: the vectorized, full-history backtest, and a trailing-window replay of how the bot actually decides live. That cross-framing divergence — full-series normalization, unseeded rolling state — is the silent mechanism behind most parity bugs, and the engine's only guard against look-ahead cannot catch it. A tool that diffs a backtest against itself, or only emits a backtest trace and asks you to bring the "live" half, doesn't deliver the criterion — no matter how clean the code is.
| Model | Runs two real paths? | Catches frame-dependent drift (verified) |
|---|---|---|
| Fable 5 | Yes — vector + trailing-window | 0 mismatches clean · 42/201 on a drift-prone strategy |
| Composer 2.5 | Yes — reuses the real live helpers | 0 mismatches clean · 48/50 on a drift-prone strategy |
| GPT-5.5 | No — one path, live side is BYO | Cannot — no independent live path |
Least code. Most bugs found.
Diff size is never scored — but here it almost perfectly inverts the rank. The winner shipped the smallest diff of the field, entirely additive (no production code touched, so zero regression risk), and was the only run that came back holding real bugs.
| Model | Lines | Files | Real bugs found |
|---|---|---|---|
| Fable 5 #946 | +537 / −0 | 3 | 2 — cited & filed |
| GPT-5.5 #936 | +897 / −6 | 8 | 0 — shipped a broken flag |
| Composer 2.5 #935 | +1,062 / −0 | 6 | 0 |
But which one was
actually right?
A clean build tells you nothing about whether the tool catches the drift it exists to catch. All three were scored against the issue's four acceptance criteria — every claim verified against the code at HEAD, the tool run on real data.
Counting down, from third place…
Did the most with the least
+537 lines across 3 files, entirely additive — a third of the runner-up's diff, touching no production code. Smallest footprint, top score.
Most faithful live path
Its parity tool reuses the actual live check-script helpers in-process, rather than modeling them — the most accurate "live" side of the field. The win was two points away.
The silent no-op
Shipped a new regime-classifier flag that's documented as working, passes its tests, and does nothing on its own. Green checkmark over a hollow core — the recurring lower-tier failure mode.
Only one came back with bugs
Two real, cited parity bugs: the backtester silently drops direction and invert_signal — and a related error message tells you to use exactly those dropped settings — plus a class of legacy config keys that no-op in backtest but work live.
Fable 5
It built the same genuine two-path tool as the runner-up — verified clean on a normal strategy and lighting up on a drift-prone one — but did three things the others didn't. It wrote the least code of the field (+537 across 3 files) and touched no production code, so it can't break anything running. It came back with two real bugs, both cited to the line. And it was honest about its own limit: its "live" path is a faithful model of live behavior, not literally the live code — the one place the runner-up was better.
The headline isn't the top-two order; it's that the whole contest rode on one sentence — "runs both decision paths" — and the model that won did the most with the least code, and was the only one to find real bugs.
All three, at a glance.
Still open.
As of scoring, issue #906 is still open — none of the three PRs has merged. But the precedent is three-for-three: every prior episode ended with a best-of-N synthesis built on the winner's branch, strongest parts of the field folded in. If the pattern holds, the foundation will be Fable 5's #946 — the genuine two-path engine and the real-bug findings — with Composer's more faithful live-helper reuse and fill extraction grafted on. Watch the issue for the close.