Go-Trader · Issue #873 · Model Bake-off · Episode 2

Three models.
One feature.
One winner.

Composer 2.5, GPT 5.5, and Opus 4.8 each built the same risk-critical trading feature from one issue. We measured what it cost, how many review rounds it took to pass, and how correct the code actually was.

Scroll to the bottom to see who won.

scroll ↓
i.The assignment

Add to a winner — without moving the net.

In plain English — a trading bot normally only ever shrinks a position until it's closed. This feature lets you add to a position that's working (called scale-in, or pyramiding). The catch: adding more must not disturb the safety net you set at the start. Your stop-loss distance and your profit targets stay frozen to the original entry — only the size of the protective orders grows to cover the bigger position. The issue even named the tempting wrong way to do it — recompute everything around the new average price — and said, in writing, don't.

iii.How they were judged

The LGTM loop.

Every PR ran the same gate: an automated Opus 4.8 reviewer that builds the code and runs the tests on every round — verifying claims against the tree, not just reading the diff.

01

Review. The action posts a verdict on the PR.

02

Not LGTM? Send it back to the model's harness to fix every finding.

03

Re-review. The action re-checks the updated PR.

04

Repeat until LGTM.

Then a second, stricter pass scored each PR against all 7 acceptance criteria, verifying every claim against the codebase at HEAD. A passing LGTM does not guarantee a perfect score — and in this round, the gate approved a PR with a risk-critical defect the deeper pass caught.
⚖️ A note on the judge. The scoring pass was run by Opus 4.8 — the same model that authored the winning PR (#877). The margin over second place is only 3 points; read the verdict with that conflict of interest in mind, and treat the top two as co-leaders.
iv.Time to first pass

How long it took.

Wall-clock to produce the first pass. The spread is roughly end-to-end — and it runs exactly opposite to the final scores.

ModelFirst-pass time
Composer 2.5 Fast 7.5 min
Opus 4.8 33 min
GPT 5.5 47 min
v.Token & quota cost

The token bill.

First-pass spend. Units differ by provider, so these aren't directly comparable — but the gap in magnitude is the story.

ModelFirst-pass token / quota cost
Composer 2.5 Fast ~2% of Cursor Pro monthly · 128k tokens · 64% of context
Opus 4.8 ~340k tokens · 34% of 1M context
GPT 5.5 >96% of Pro plan · compacted context once mid-task
vi.Rounds & footprint

How many tries to pass?

Re-reviews needed before reaching LGTM — and how much code each touched. The cheapest, fastest model needed the most rounds.

ModelRe-reviews to LGTMFirst-pass verdictFilesNet lines
Opus 4.8 #877 1 2 findings, then LGTM 13 +1495 / −35
GPT 5.5 #876 1 2 blocking, then LGTM 29 +1626 / −188
Composer 2.5 Fast #875 3 Changes requested ×3 rounds 12 +1075 / −64
The reckoning

But which one was
actually right?

Cost and review rounds tell you about effort. They don't tell you if the code is correct. We scored all three against the 7 acceptance criteria — verifying every claim against the code at HEAD.

Counting down, from third place…

— Third place —
Composer 2.5 Fast Got the shape right — the opt-in flag, the blend math, the manual command, the bookkeeping — and did it in 7.5 minutes. But on the single most important rule it did the thing the issue forbade: it re-based the stop-loss and take-profit triggers around the new blended average, so adding to a position quietly shifts your safety net. That path had no test, and the PR described it as "frozen trigger prices" — which the code is not. Most review rounds (3), lowest score.
62/100
— Second place —
GPT 5.5 Correct, complete, and honest — every geometry consumer correctly pinned to the frozen entry price, backed by assertion-rich tests for both the blend math and the protective-order resizing. The cost: it was the slowest (47 min) and most expensive by a mile — over 96% of a plan, with a mid-task context compaction. Deductions were minor: a redundant stop-loss replace per add and an uncapped manual-add path.
92/100
Before the winner — two notes

Fastest & cheapest

Composer 2.5 Fast

A full first draft in 7.5 min at ~2% of a monthly quota — roughly 6× faster than the field. Unbeatable for a fast scaffold, as long as a human hardens the risk-critical parts.

🤝

Photo finish

GPT 5.5

Landed a correct, fully-tested result 3 points behind the winner. Given the Opus-judges-Opus conflict of interest, this is effectively a tie at the top.

🏆
— First place —

Opus 4.8

PR #877 ↗ · cc/issue-873-scale-in

The only PR that honored the "freeze the entry" decision cleanly: a dedicated RiskAnchorPrice pins every trigger to the first entry while PnL and order sizing use the blended average — exactly as specified. It closed the live-fill consistency gap with a single source of truth, reused the existing protection machinery, and backed it all with meaningful tests. No acceptance-criteria misses.

95 /100

The asterisk: the judge was Opus 4.8 — the same family that wrote this PR — and the margin over GPT 5.5 (92) is just three points. Treat them as co-leaders. The headline isn't the top-two order; it's that the cheapest, fastest model got the risk-critical detail wrong, and the slow, expensive ones got it right.

Summary

All three, at a glance.

Anthropic
Opus 4.8
Score
95 /100
First-pass time
33 min
Re-reviews
1
Token / quota cost
~340k tokens · 34% of 1M ctx
OpenAI
GPT 5.5
Score
92 /100
First-pass time
47 min
Re-reviews
1
Token / quota cost
>96% of Pro plan · compacted 1×
Cursor
Composer 2.5 Fast
Score
62 /100
First-pass time
7.5 min
Re-reviews
3
Token / quota cost
~2% monthly · 128k tok · 64% ctx
Epilogue — what shipped

The winner became the base.

The bake-off wasn't academic. Issue #873 shipped to main as PR #882 — a best-of-three synthesis built on the winner's architecture, Opus 4.8's #877 (its RiskAnchorPrice freeze-the-entry design and single-source decision), with the best parts of GPT 5.5 and Composer folded in. The three contestant PRs stayed open; the winning design is what's now running.