Model Royale (retired)

what it was

Model Royale was a weekly tournament: a fixed lineup of coding models, a task each round, a hidden-test gate, blinded peer judging, and — in round two — an adversarial "Break" round where every model attacked every other's sandbox. The arc was Build → Break → Fix, with a model eliminated each round.

why it's paused

Two rounds in, the tournament framing had taught us what it was going to teach us — and most of it was about the framing itself:

Elimination didn't survive contact with a flat round. Round 2 produced zero real breaches; every model tied. There was no honest basis for a cut, and manufacturing one (a tiebreak) was the format wagging the methodology.
The tasks were too easy to separate models. Round 1's hidden tests passed for everyone; round 2's exploits landed on nobody. When a round can't produce a loser, it can't produce a ranking.
Peer judging is an opinion, not a verdict. Models judging each other is a closed, correlated loop — useful as colour, not as a score the rest of the result should hang on.

None of that was wasted. The tournament validated the harness, the committed-artifact pipeline, and — in round 2 — the reference oracle that mechanically catches cheese exploits. It surfaced the real problems early, which is exactly what a first format is for.

what open-bench is now

A set of standalone benchmarks. Each one is a task, run across the model lineup, with the full receipts committed — diffs, transcripts, costs, test results. Objective where it can be, transparent always. No tournament, no elimination, no crowned champion; the artifacts are the product and the reader is the judge.

the rounds it produced

Both rounds live on as benchmark results:

sandbox · 2026-05-05 — round 1, "Build": implement a container sandbox.
break-sandbox · 2026-05-14 — round 2, "Break": adversarial exploit suites.
writeups — the retrospectives, including what round 2 actually showed.