Round 2 is the adversarial round. Each of the seven models writes a black-box
exploit suite against the round-1 sandbox spec — tests that try to escape a
container sandbox. Every suite then runs against every other model’s round-1
sandbox.py: 42 ordered pairs, objective scoring, no judges. A per-test pass
is an escape that landed.
The result: nothing landed
round ranking
round 2 (Break) — objective. defense-weighted: ranked by breaches taken (lower better), then breaches landed. models with identical records share a rank. a per-round ranking only — no elimination.
| impl | defender score | attacker score |
|---|---|---|
| 01 deepseek | 0 | 0 |
| 01 deepseek-flash | 0 | 0 |
| 01 glm | 0 | 0 |
| 01 kimi | 0 | 0 |
| 01 mimo | 0 | 0 |
| 01 minimax | 0 | 0 |
| 01 qwen | 0 | 0 |
Zero real breaches. Every one of the seven round-1 sandboxes held against every honest exploit thrown at it. The attack matrix is all zeros, and every model is tied.
That’s a flat result, and it’s worth being plain about: round 2 did not separate the lineup. No one is eliminated — an inconclusive round doesn’t get to manufacture a verdict. The interesting part isn’t a ranking; it’s why it came out flat, and what it took to see that clearly.
The catch: most of the “escapes” were cheese
The raw matrix did not start at zero. Before filtering, six of the seven models landed escapes against everything — glm against all six targets, deepseek against all six, and so on. Read naively, glm “won” the round.
It didn’t. Those weren’t escapes.
Every exploit suite is also run against a reference sandbox — a known-correct implementation transcribed straight from the round-1 spec. An exploit that “escapes” the reference cannot be demonstrating a real vulnerability: the reference has no vulnerability. It’s a test that passes unconditionally — cheese, or a real attempt that asserts on the wrong thing.
13 exploits across 6 models escaped the reference. Each one also escaped every
real target, by the same count — the signature of a test that passes no matter
what it’s pointed at. The oracle excluded all 13. deepseek-flash was the only
model whose suite was completely clean.
Without that control pass, this round’s numbers would have been pure noise — glm alone would have shown two dozen phantom “breaches” off four cheese exploits. This was the first time the reference oracle ran against a real lineup, and it’s the run that proves it isn’t optional.
What it actually tells us
A flat round is data, not a failure. Three honest readings, and we can’t yet tell them apart:
- the round-1 sandboxes are genuinely solid;
- the models didn’t write strong black-box attacks (the cheese rate suggests some of them barely tried);
- “every model attacks every other” may just be the wrong structure for producing signal.
What’s not in doubt: the harness did its job. It ran objectively, it caught every piece of cheese mechanically, and it reported a flat result honestly instead of inventing a winner to keep a bracket moving. The next round of work is on the round design — a harder, more discriminating substrate — not on the scoring.