AI Coding Bake-off Ep. 1: GPT-5 vs Opus 4.8 vs Composer 2.5 (Issue #844)

i.The assignment

A trailing-stop that ratchets.

In plain English — This is a new auto-pilot rule for a trading bot. When a trade is winning, the bot keeps a “safety net” price below it: if the market falls back to that line, the trade closes and locks in the gains. This feature makes the net smarter — as the price climbs past milestones, the net follows it upward so more profit is protected, and at each milestone the bot can optionally cash out a slice of the position.

Read the full issue #844 on GitHub ↗

ii.The contestants

Cursor

iii.How they were judged

The LGTM loop.

Every PR ran the same gate: claude-code-action@v1 with Opus 4.8 on xhigh, reviewing locally — building, testing, verifying claims against the tree, not just reading the diff.

Review. The action posts a verdict on the PR.

Not LGTM? Send it back to the model's harness to fix every finding.

Re-review. The action re-checks the updated PR.

Repeat until LGTM.

Then a second, stricter pass scored each PR against all 11 acceptance criteria, verifying every claim against the codebase at HEAD. A passing LGTM does not guarantee a perfect score — the scoring pass goes deeper.

⚖️ A note on the judge. This scoring pass was run by Opus 4.8 (high effort) — which is also one of the contestants (PR #859). For what it's worth, it ranked its own submission 3rd, below two rivals — but read the verdict with that conflict of interest in mind.

iv.Time to complete

How long it took.

Wall-clock split into the first pass and the fix pass needed to reach LGTM; the bar shows total time. The spread is roughly 10× end-to-end.

Model	1st pass	2nd pass	Total time
Composer 2.5	4.5 min	1 min	5.5 min
GPT 5.5	14 min	8 min	22 min
Opus 4.8 high	38 min	4 min	42 min
Opus 4.8 max	54.5 min	—	54.5 min

v.Token & quota cost

The token bill.

Total spend per PR — every pass combined, not just the first draft. Units differ by provider, so these aren't directly comparable.

Model	Total token / quota cost
Composer 2.5	~2% Cursor Pro plan Composer monthly usage · ~75% of 200k context
GPT 5.5	77% of Plus plan 5-hr window
Opus 4.8 high	32% of 1M ctx + ~245k subagent tokens
Opus 4.8 max	45% of 1M ctx + ~416k subagent tokens

vi.Rounds & footprint

How many tries to pass?

Re-reviews needed before reaching LGTM — and how much code each touched. One model cleared review on the first pass with zero blocking findings.

Model	Re-reviews to LGTM	First-pass verdict	Files	Net lines
Opus 4.8 max #858	0	LGTM + 4 optional	15	+1408 / −4
Composer 2.5 #856	1	3 blocking + 2 optional	15	+1121 / −6
GPT 5.5 #857	1	3 findings + 1 nit	19	+1544 / −11
Opus 4.8 high #859	1	2 findings + 1 optional	15	+1093 / −12

The reckoning

But which one was
actually best?

Cost and review rounds tell you about effort. They don't tell you if the code is right. We scored all four against the 11 acceptance criteria — verifying every claim against the codebase.

Counting down, from fourth place…

— Fourth place —

Composer 2.5 Correct on the hard parts — and the only model that did it in 4.5 minutes flat. But it shipped a 117-line backtester change with no regression test, and its _regime variant silently no-ops when regime detection is off, where every sibling rejects that config.

86/100

— Third place —

Opus 4.8 High A clean, well-reused perps-only build with verified tests. It dropped manual from the required "perps + manual" scope — a defensible call — but its PR description claims manual support and manual.go edits that aren't in the diff. Excellent code, misleading writeup.

90/100

— Second place —

Opus 4.8 Max Correct, idiomatic, thoroughly tested — and the only model to clear code review on the first pass. The single gap: it too rejects manual, a stated acceptance criterion. The difference from third? It disclosed the scope narrowing honestly, in its docs. So close to the top.

91/100

✦Before the winner — two special mentions

🛡️

Most reliable generator

Opus 4.8 Max

The only PR to reach LGTM with zero fix cycles — no blocking findings on the first review. If you want it right the first time, this was it.

⚡

Best cost-to-quality

Composer 2.5

A correct, manual-supporting draft in 4.5 min at ~2% quota — an order of magnitude cheaper than the field. The value pick for a fast first draft to harden.

🏆

— First place —

GPT 5.5

PR #857 ↗ · codex/issue-844-trailing-ratchet

The only PR that was simultaneously feature-complete against the literal criteria (including manual support), fully tested with a verified-green backtester regression, and accurately described. Its only deductions were cosmetic.

93 /100

The asterisk: it was also the most quota-hungry to produce — 77% of the 5-hour Plus window in a single attempt. Best result, steepest bill. If manual can wait for a follow-up, Opus 4.8 Max (91) is the co-best merge candidate and the more dependable one-shot.

✦Summary

All four, at a glance.

Cursor

Composer 2.5

Score

86 /100

Total time

5.5 min

Re-reviews

Token / quota cost

~2% monthly quota · ~75% of 200k ctx

OpenAI

GPT 5.5

Score

93 /100

Total time

22 min

Re-reviews

Token / quota cost

77% of Plus plan 5-hr window

Anthropic

Opus 4.8 Max

Score

91 /100

Total time

54.5 min

Re-reviews

Token / quota cost

45% of 1M ctx + ~416k subagent tokens

Anthropic

Opus 4.8 High

Score

90 /100

Total time

42 min

Re-reviews

Token / quota cost

32% of 1M ctx + ~245k subagent tokens

✦Epilogue — what shipped

The winner became the base.

The bake-off wasn't academic. Issue #844 shipped to main as PR #860 — a best-of-4 synthesis built on the winner, GPT-5's #857, as its foundation, with the strongest parts of the other three folded in. The individual contestant PRs were closed; the winning implementation is what's now running.

PR #860 · best-of-4 synthesis ↗

A trailing-stop that ratchets.

Composer 2.5

GPT 5.5

Opus 4.8 Max ultrathink

Opus 4.8 High

The LGTM loop.

How long it took.

The token bill.

How many tries to pass?

But which one wasactually best?

Most reliable generator

Best cost-to-quality

GPT 5.5

All four, at a glance.

The winner became the base.

But which one was
actually best?