Add to a winner — without moving the net.
In plain English — a trading bot normally only ever shrinks a position until it's closed. This feature lets you add to a position that's working (called scale-in, or pyramiding). The catch: adding more must not disturb the safety net you set at the start. Your stop-loss distance and your profit targets stay frozen to the original entry — only the size of the protective orders grows to cover the bigger position. The issue even named the tempting wrong way to do it — recompute everything around the new average price — and said, in writing, don't.
The LGTM loop.
Every PR ran the same gate: an automated Opus 4.8 reviewer that builds the code and runs the tests on every round — verifying claims against the tree, not just reading the diff.
Review. The action posts a verdict on the PR.
Not LGTM? Send it back to the model's harness to fix every finding.
Re-review. The action re-checks the updated PR.
Repeat until LGTM.
How long it took.
Wall-clock to produce the first pass. The spread is roughly 6× end-to-end — and it runs exactly opposite to the final scores.
| Model | First-pass time |
|---|---|
| Composer 2.5 Fast | 7.5 min |
| Opus 4.8 | 33 min |
| GPT 5.5 | 47 min |
The token bill.
First-pass spend. Units differ by provider, so these aren't directly comparable — but the gap in magnitude is the story.
| Model | First-pass token / quota cost |
|---|---|
| Composer 2.5 Fast | ~2% of Cursor Pro monthly · 128k tokens · 64% of context |
| Opus 4.8 | ~340k tokens · 34% of 1M context |
| GPT 5.5 | >96% of Pro plan · compacted context once mid-task |
How many tries to pass?
Re-reviews needed before reaching LGTM — and how much code each touched. The cheapest, fastest model needed the most rounds.
| Model | Re-reviews to LGTM | First-pass verdict | Files | Net lines |
|---|---|---|---|---|
| Opus 4.8 #877 | 1 | 2 findings, then LGTM | 13 | +1495 / −35 |
| GPT 5.5 #876 | 1 | 2 blocking, then LGTM | 29 | +1626 / −188 |
| Composer 2.5 Fast #875 | 3 | Changes requested ×3 rounds | 12 | +1075 / −64 |
But which one was
actually right?
Cost and review rounds tell you about effort. They don't tell you if the code is correct. We scored all three against the 7 acceptance criteria — verifying every claim against the code at HEAD.
Counting down, from third place…
Fastest & cheapest
A full first draft in 7.5 min at ~2% of a monthly quota — roughly 6× faster than the field. Unbeatable for a fast scaffold, as long as a human hardens the risk-critical parts.
Photo finish
Landed a correct, fully-tested result 3 points behind the winner. Given the Opus-judges-Opus conflict of interest, this is effectively a tie at the top.
Opus 4.8
The only PR that honored the "freeze the entry" decision cleanly: a dedicated RiskAnchorPrice pins every trigger to the first entry while PnL and order sizing use the blended average — exactly as specified. It closed the live-fill consistency gap with a single source of truth, reused the existing protection machinery, and backed it all with meaningful tests. No acceptance-criteria misses.
The asterisk: the judge was Opus 4.8 — the same family that wrote this PR — and the margin over GPT 5.5 (92) is just three points. Treat them as co-leaders. The headline isn't the top-two order; it's that the cheapest, fastest model got the risk-critical detail wrong, and the slow, expensive ones got it right.
All three, at a glance.
The winner became the base.
The bake-off wasn't academic. Issue #873 shipped to main as PR #882 — a best-of-three synthesis built on the winner's architecture, Opus 4.8's #877 (its RiskAnchorPrice freeze-the-entry design and single-source decision), with the best parts of GPT 5.5 and Composer folded in. The three contestant PRs stayed open; the winning design is what's now running.