Go-Trader · Issue #844 · Model Bake-off · Episode 1

Three models.
One issue.
One winner.

Composer, GPT, and two flavors of Opus each built the same close-strategy evaluator from scratch. We measured what it cost, how many review rounds it took to pass, and how correct the result actually was.

Scroll to the bottom to see who won.

scroll ↓
i.The assignment

A trailing-stop that ratchets.

In plain English — This is a new auto-pilot rule for a trading bot. When a trade is winning, the bot keeps a “safety net” price below it: if the market falls back to that line, the trade closes and locks in the gains. This feature makes the net smarter — as the price climbs past milestones, the net follows it upward so more profit is protected, and at each milestone the bot can optionally cash out a slice of the position.

iii.How they were judged

The LGTM loop.

Every PR ran the same gate: claude-code-action@v1 with Opus 4.8 on xhigh, reviewing locally — building, testing, verifying claims against the tree, not just reading the diff.

01

Review. The action posts a verdict on the PR.

02

Not LGTM? Send it back to the model's harness to fix every finding.

03

Re-review. The action re-checks the updated PR.

04

Repeat until LGTM.

Then a second, stricter pass scored each PR against all 11 acceptance criteria, verifying every claim against the codebase at HEAD. A passing LGTM does not guarantee a perfect score — the scoring pass goes deeper.
⚖️ A note on the judge. This scoring pass was run by Opus 4.8 (high effort) — which is also one of the contestants (PR #859). For what it's worth, it ranked its own submission 3rd, below two rivals — but read the verdict with that conflict of interest in mind.
iv.Time to complete

How long it took.

Wall-clock split into the first pass and the fix pass needed to reach LGTM; the bar shows total time. The spread is roughly 10× end-to-end.

Model1st pass2nd passTotal time
Composer 2.5 4.5 min 1 min 5.5 min
GPT 5.5 14 min 8 min 22 min
Opus 4.8 high 38 min 4 min 42 min
Opus 4.8 max 54.5 min 54.5 min
v.Token & quota cost

The token bill.

Total spend per PR — every pass combined, not just the first draft. Units differ by provider, so these aren't directly comparable.

ModelTotal token / quota cost
Composer 2.5 ~2% Cursor Pro plan Composer monthly usage · ~75% of 200k context
GPT 5.5 77% of Plus plan 5-hr window
Opus 4.8 high 32% of 1M ctx + ~245k subagent tokens
Opus 4.8 max 45% of 1M ctx + ~416k subagent tokens
vi.Rounds & footprint

How many tries to pass?

Re-reviews needed before reaching LGTM — and how much code each touched. One model cleared review on the first pass with zero blocking findings.

ModelRe-reviews to LGTMFirst-pass verdictFilesNet lines
Opus 4.8 max #858 0 LGTM + 4 optional 15 +1408 / −4
Composer 2.5 #856 1 3 blocking + 2 optional 15 +1121 / −6
GPT 5.5 #857 1 3 findings + 1 nit 19 +1544 / −11
Opus 4.8 high #859 1 2 findings + 1 optional 15 +1093 / −12
The reckoning

But which one was
actually best?

Cost and review rounds tell you about effort. They don't tell you if the code is right. We scored all four against the 11 acceptance criteria — verifying every claim against the codebase.

Counting down, from fourth place…

— Fourth place —
Composer 2.5 Correct on the hard parts — and the only model that did it in 4.5 minutes flat. But it shipped a 117-line backtester change with no regression test, and its _regime variant silently no-ops when regime detection is off, where every sibling rejects that config.
86/100
— Third place —
Opus 4.8 High A clean, well-reused perps-only build with verified tests. It dropped manual from the required "perps + manual" scope — a defensible call — but its PR description claims manual support and manual.go edits that aren't in the diff. Excellent code, misleading writeup.
90/100
— Second place —
Opus 4.8 Max Correct, idiomatic, thoroughly tested — and the only model to clear code review on the first pass. The single gap: it too rejects manual, a stated acceptance criterion. The difference from third? It disclosed the scope narrowing honestly, in its docs. So close to the top.
91/100
Before the winner — two special mentions
🛡️

Most reliable generator

Opus 4.8 Max

The only PR to reach LGTM with zero fix cycles — no blocking findings on the first review. If you want it right the first time, this was it.

Best cost-to-quality

Composer 2.5

A correct, manual-supporting draft in 4.5 min at ~2% quota — an order of magnitude cheaper than the field. The value pick for a fast first draft to harden.

🏆
— First place —

GPT 5.5

PR #857 ↗ · codex/issue-844-trailing-ratchet

The only PR that was simultaneously feature-complete against the literal criteria (including manual support), fully tested with a verified-green backtester regression, and accurately described. Its only deductions were cosmetic.

93 /100

The asterisk: it was also the most quota-hungry to produce — 77% of the 5-hour Plus window in a single attempt. Best result, steepest bill. If manual can wait for a follow-up, Opus 4.8 Max (91) is the co-best merge candidate and the more dependable one-shot.

Summary

All four, at a glance.

Cursor
Composer 2.5
Score
86 /100
Total time
5.5 min
Re-reviews
1
Token / quota cost
~2% monthly quota · ~75% of 200k ctx
OpenAI
GPT 5.5
Score
93 /100
Total time
22 min
Re-reviews
1
Token / quota cost
77% of Plus plan 5-hr window
Anthropic
Opus 4.8 Max
Score
91 /100
Total time
54.5 min
Re-reviews
0
Token / quota cost
45% of 1M ctx + ~416k subagent tokens
Anthropic
Opus 4.8 High
Score
90 /100
Total time
42 min
Re-reviews
1
Token / quota cost
32% of 1M ctx + ~245k subagent tokens
Epilogue — what shipped

The winner became the base.

The bake-off wasn't academic. Issue #844 shipped to main as PR #860 — a best-of-4 synthesis built on the winner, GPT-5's #857, as its foundation, with the strongest parts of the other three folded in. The individual contestant PRs were closed; the winning implementation is what's now running.