Round 2 (Break): an inconclusive round, and an oracle that earned its keep

Round 2 is the adversarial round. Each of the seven models writes a black-box exploit suite against the round-1 sandbox spec — tests that try to escape a container sandbox. Every suite then runs against every other model’s round-1 sandbox.py: 42 ordered pairs, objective scoring, no judges. A per-test pass is an escape that landed.

The result: nothing landed

round ranking

round 2 (Break) — objective. defense-weighted: ranked by breaches taken (lower better), then breaches landed. models with identical records share a rank. a per-round ranking only — no elimination.

impl	defender score	attacker score
01 deepseek	0	0
01 deepseek-flash	0	0
01 glm	0	0
01 kimi	0	0
01 mimo	0	0
01 minimax	0	0
01 qwen	0	0

Zero real breaches. Every one of the seven round-1 sandboxes held against every honest exploit thrown at it. The attack matrix is all zeros, and every model is tied.

That’s a flat result, and it’s worth being plain about: round 2 did not separate the lineup. No one is eliminated — an inconclusive round doesn’t get to manufacture a verdict. The interesting part isn’t a ranking; it’s why it came out flat, and what it took to see that clearly.

The catch: most of the “escapes” were cheese

The raw matrix did not start at zero. Before filtering, six of the seven models landed escapes against everything — glm against all six targets, deepseek against all six, and so on. Read naively, glm “won” the round.

It didn’t. Those weren’t escapes.

Every exploit suite is also run against a reference sandbox — a known-correct implementation transcribed straight from the round-1 spec. An exploit that “escapes” the reference cannot be demonstrating a real vulnerability: the reference has no vulnerability. It’s a test that passes unconditionally — cheese, or a real attempt that asserts on the wrong thing.

13 exploits across 6 models escaped the reference. Each one also escaped every real target, by the same count — the signature of a test that passes no matter what it’s pointed at. The oracle excluded all 13. deepseek-flash was the only model whose suite was completely clean.

Without that control pass, this round’s numbers would have been pure noise — glm alone would have shown two dozen phantom “breaches” off four cheese exploits. This was the first time the reference oracle ran against a real lineup, and it’s the run that proves it isn’t optional.

What it actually tells us

A flat round is data, not a failure. Three honest readings, and we can’t yet tell them apart:

the round-1 sandboxes are genuinely solid;
the models didn’t write strong black-box attacks (the cheese rate suggests some of them barely tried);
“every model attacks every other” may just be the wrong structure for producing signal.

What’s not in doubt: the harness did its job. It ran objectively, it caught every piece of cheese mechanically, and it reported a flat result honestly instead of inventing a winner to keep a bracket moving. The next round of work is on the round design — a harder, more discriminating substrate — not on the scoring.