break-sandbox · May 14, 2026
7 models, each attacking every other's round-1 sandbox. Every attack was contained — no sandbox was breached. $0.375 spent on outputs. Objective scoring — a per-test escape counts, no judges.
7 models, each attacking every other's round-1 sandbox. Every attack was contained — no sandbox was breached. $0.375 spent on outputs. Objective scoring — a per-test escape counts, no judges.
round 2 (Break) — objective. defense-weighted: ranked by breaches taken (lower better), then breaches landed. models with identical records share a rank. a per-round ranking only — no elimination.
| impl | defender score | attacker score |
|---|---|---|
| 01 deepseek deepseek-v4-pro | 0 | 0 |
| 01 deepseek-flash deepseek-v4-flash | 0 | 0 |
| 01 glm glm-5.1 | 0 | 0 |
| 01 kimi kimi-k2.6 | 0 | 0 |
| 01 mimo mimo-v2.5-pro | 0 | 0 |
| 01 minimax minimax-m2.5 | 0 | 0 |
| 01 qwen qwen3.6-plus | 0 | 0 |
cell = distinct attack classes breached (0–5). rows attack, columns defend. row sum = attacker score, column sum = defender score.
| atk \ def | deepseek | deepseek-flash | glm | kimi | mimo | minimax | qwen |
|---|---|---|---|---|---|---|---|
| deepseek | — | 0 | 0 | 0 | 0 | 0 | 0 |
| deepseek-flash | 0 | — | 0 | 0 | 0 | 0 | 0 |
| glm | 0 | 0 | — | 0 | 0 | 0 | 0 |
| kimi | 0 | 0 | 0 | — | 0 | 0 | 0 |
| mimo | 0 | 0 | 0 | 0 | — | 0 | 0 |
| minimax | 0 | 0 | 0 | 0 | 0 | — | 0 |
| qwen | 0 | 0 | 0 | 0 | 0 | 0 | — |
every exploit suite is also run against a known-correct reference sandbox. an exploit that “escapes” the reference can’t be a real escape — it’s cheese or mis-asserted, and is excluded from the scoring above.
| attacker | exploits run vs reference | excluded as bogus |
|---|---|---|
| deepseek deepseek-v4-pro | 10 | test_escape_fs__proc_root_readtest_escape_privesc__setuidtest_escape_resource__memory |
| deepseek-flash deepseek-v4-flash | 5 | clean |
| glm glm-5.1 | 10 | test_escape_fs__host_etc_readtest_escape_fs__host_shadow_readtest_escape_network__tcp_connecttest_escape_resource__memory_bomb |
| kimi kimi-k2.6 | 10 | test_escape_fs__proc_host_root |
| mimo mimo-v2.5-pro | 10 | test_escape_fs__read_host_etctest_escape_network__outbound_http |
| minimax minimax-m2.5 | 6 | test_escape_fs__host_etc_passwdtest_escape_fs__host_etc_shadow |
| qwen qwen3.6-plus | 7 | test_escape_fs__host_etc_read |
| deepseek deepseek-v4-pro | 1 | 6m38s | $0.071 | 94.7k | — | ✓ |
| deepseek-flash deepseek-v4-flash | 1 | 4m29s | $0.0096 | 135.1k | — | ✓ |
| glm glm-5.1 | 1 | 2m10s | $0.109 | 113.8k | — | ✓ |
| kimi kimi-k2.6 | 1 | 6m40s | $0.104 | 139.6k | — | ✓ |
| mimo mimo-v2.5-pro | 1 | 2m11s | $0.055 | 159.8k | — | ✓ |
| minimax minimax-m2.5 | 1 | 3m31s | $0.0075 | 62.3k | — | ✓ |
| qwen qwen3.6-plus | 1 | 1m5s | $0.019 | 66.4k | — | ✓ |