AI Coding Bake-off Ep. 2: Opus 4.8 vs GPT 5.5 vs Composer 2.5 (Issue #873)

i.The assignment

Add to a winner — without moving the net.

In plain English — a trading bot normally only ever shrinks a position until it's closed. This feature lets you add to a position that's working (called scale-in, or pyramiding). The catch: adding more must not disturb the safety net you set at the start. Your stop-loss distance and your profit targets stay frozen to the original entry — only the size of the protective orders grows to cover the bigger position. The issue even named the tempting wrong way to do it — recompute everything around the new average price — and said, in writing, don't.

Read the full issue #873 on GitHub ↗

ii.The contestants

Cursor

iii.How they were judged

The LGTM loop.

Every PR ran the same gate: an automated Opus 4.8 reviewer that builds the code and runs the tests on every round — verifying claims against the tree, not just reading the diff.

Review. The action posts a verdict on the PR.

Not LGTM? Send it back to the model's harness to fix every finding.

Re-review. The action re-checks the updated PR.

Repeat until LGTM.

Then a second, stricter pass scored each PR against all 7 acceptance criteria, verifying every claim against the codebase at HEAD. A passing LGTM does not guarantee a perfect score — and in this round, the gate approved a PR with a risk-critical defect the deeper pass caught.

⚖️ A note on the judge. The scoring pass was run by Opus 4.8 — the same model that authored the winning PR (#877). The margin over second place is only 3 points; read the verdict with that conflict of interest in mind, and treat the top two as co-leaders.

iv.Time to first pass

How long it took.

Wall-clock to produce the first pass. The spread is roughly 6× end-to-end — and it runs exactly opposite to the final scores.

Model	First-pass time
Composer 2.5 Fast	7.5 min
Opus 4.8	33 min
GPT 5.5	47 min

v.Token & quota cost

The token bill.

First-pass spend. Units differ by provider, so these aren't directly comparable — but the gap in magnitude is the story.

Model	First-pass token / quota cost
Composer 2.5 Fast	~2% of Cursor Pro monthly · 128k tokens · 64% of context
Opus 4.8	~340k tokens · 34% of 1M context
GPT 5.5	>96% of Pro plan · compacted context once mid-task

vi.Rounds & footprint

How many tries to pass?

Re-reviews needed before reaching LGTM — and how much code each touched. The cheapest, fastest model needed the most rounds.

Model	Re-reviews to LGTM	First-pass verdict	Files	Net lines
Opus 4.8 #877	1	2 findings, then LGTM	13	+1495 / −35
GPT 5.5 #876	1	2 blocking, then LGTM	29	+1626 / −188
Composer 2.5 Fast #875	3	Changes requested ×3 rounds	12	+1075 / −64

The reckoning

But which one was
actually right?

Cost and review rounds tell you about effort. They don't tell you if the code is correct. We scored all three against the 7 acceptance criteria — verifying every claim against the code at HEAD.

Counting down, from third place…

— Third place —

Composer 2.5 Fast Got the shape right — the opt-in flag, the blend math, the manual command, the bookkeeping — and did it in 7.5 minutes. But on the single most important rule it did the thing the issue forbade: it re-based the stop-loss and take-profit triggers around the new blended average, so adding to a position quietly shifts your safety net. That path had no test, and the PR described it as "frozen trigger prices" — which the code is not. Most review rounds (3), lowest score.

62/100

— Second place —

GPT 5.5 Correct, complete, and honest — every geometry consumer correctly pinned to the frozen entry price, backed by assertion-rich tests for both the blend math and the protective-order resizing. The cost: it was the slowest (47 min) and most expensive by a mile — over 96% of a plan, with a mid-task context compaction. Deductions were minor: a redundant stop-loss replace per add and an uncapped manual-add path.

92/100

✦Before the winner — two notes

⚡

Fastest & cheapest

Composer 2.5 Fast

A full first draft in 7.5 min at ~2% of a monthly quota — roughly 6× faster than the field. Unbeatable for a fast scaffold, as long as a human hardens the risk-critical parts.

🤝

Photo finish

GPT 5.5

Landed a correct, fully-tested result 3 points behind the winner. Given the Opus-judges-Opus conflict of interest, this is effectively a tie at the top.

🏆

— First place —

Opus 4.8

PR #877 ↗ · cc/issue-873-scale-in

The only PR that honored the "freeze the entry" decision cleanly: a dedicated RiskAnchorPrice pins every trigger to the first entry while PnL and order sizing use the blended average — exactly as specified. It closed the live-fill consistency gap with a single source of truth, reused the existing protection machinery, and backed it all with meaningful tests. No acceptance-criteria misses.

95 /100

The asterisk: the judge was Opus 4.8 — the same family that wrote this PR — and the margin over GPT 5.5 (92) is just three points. Treat them as co-leaders. The headline isn't the top-two order; it's that the cheapest, fastest model got the risk-critical detail wrong, and the slow, expensive ones got it right.

✦Summary

All three, at a glance.

Anthropic

Opus 4.8

Score

95 /100

First-pass time

33 min

Re-reviews

Token / quota cost

~340k tokens · 34% of 1M ctx

OpenAI

GPT 5.5

Score

92 /100

First-pass time

47 min

Re-reviews

Token / quota cost

>96% of Pro plan · compacted 1×

Cursor

Composer 2.5 Fast

Score

62 /100

First-pass time

7.5 min

Re-reviews

Token / quota cost

~2% monthly · 128k tok · 64% ctx

✦Epilogue — what shipped

The winner became the base.

The bake-off wasn't academic. Issue #873 shipped to main as PR #882 — a best-of-three synthesis built on the winner's architecture, Opus 4.8's #877 (its RiskAnchorPrice freeze-the-entry design and single-source decision), with the best parts of GPT 5.5 and Composer folded in. The three contestant PRs stayed open; the winning design is what's now running.

PR #882 · best-of-three synthesis ↗

Add to a winner — without moving the net.

Composer 2.5 Fast

GPT 5.5

Opus 4.8

The LGTM loop.

How long it took.

The token bill.

How many tries to pass?

But which one wasactually right?

Fastest & cheapest

Photo finish

Opus 4.8

All three, at a glance.

The winner became the base.

But which one was
actually right?