round May 5, 2026
glm takes the round with 27.5/30 — spec 9.5, quality 18.0. 7 models, $0.968 spent on outputs. Hidden tests: all passed.
glm takes the round with 27.5/30 — spec 9.5, quality 18.0. 7 models, $0.968 spent on outputs. Hidden tests: all passed.
total = peer-judged spec /15 + quality /15. hidden-tests gate the verdict.
| impl | total | spec | qual | build | tests | verdict |
|---|---|---|---|---|---|---|
| 01 glm glm-5.1 | 27.5 | 9.5 | 18.0 | pass | 9/9 | ship-with-cleanup |
| 02 deepseek-flash deepseek-v4-flash | 25.5 | 9.5 | 16.0 | pass | 9/9 | ship-with-cleanup |
| 03 deepseek deepseek-v4-pro | 24.0 | 10.0 | 14.0 | pass | 9/9 | ship-with-cleanup |
| 04 kimi kimi-k2.6 | 24.0 | 8.0 | 16.0 | pass | 9/9 | ship-with-cleanup |
| 05 mimo mimo-v2.5-pro | 23.0 | 7.0 | 16.0 | pass | 9/9 | rewrite |
| 06 qwen qwen3.6-plus | 23.0 | 8.0 | 15.0 | pass | 9/9 | ship-with-cleanup |
| 07 minimax minimax-m2.5 | 22.0 | 8.0 | 14.0 | pass | 9/9 | rewrite |
| deepseek deepseek-v4-pro | 1 | 5m59s | $0.075 | 282.5k | 142 | ✓ |
| deepseek deepseek-v4-pro | 2 | 3m0s | $0.052 | 253.2k | 139 | ✓ |
| deepseek deepseek-v4-pro | 3 | 5m59s | $0.082 | 298.7k | 140 | ✓ |
| deepseek-flash deepseek-v4-flash | 1 | 6m19s | $0.011 | 448.3k | 117 | ✓ |
| deepseek-flash deepseek-v4-flash | 2 | 3m10s | $0.0079 | 402.3k | 114 | ✓ |
| deepseek-flash deepseek-v4-flash | 3 | 6m19s | $0.012 | 472.8k | 115 | ✓ |
| glm glm-5.1 | 1 | 3m32s | $0.131 | 198.6k | 111 | ✓ |
| glm glm-5.1 | 2 | 1m46s | $0.092 | 177.3k | 108 | ✓ |
| glm glm-5.1 | 3 | 3m32s | $0.145 | 209.4k | 109 | ✓ |
| kimi kimi-k2.6 | 1 | 3m12s | $0.020 | 157.7k | 98 | ✓ |
| kimi kimi-k2.6 | 2 | 1m36s | $0.014 | 137.8k | 95 | ✓ |
| kimi kimi-k2.6 | 3 | 3m12s | $0.022 | 161.1k | 96 | ✓ |
| mimo mimo-v2.5-pro | 1 | 1m23s | $0.065 | 198.4k | 89 | ✓ |
| mimo mimo-v2.5-pro | 2 | 42s | $0.045 | 180.5k | 86 | ✓ |
| mimo mimo-v2.5-pro | 3 | 1m23s | $0.071 | 209.5k | 87 | ✓ |
| minimax minimax-m2.5 | 1 | 28s | $0.012 | 183.6k | 102 | ✓ |
| minimax minimax-m2.5 | 2 | 14s | $0.0084 | 165.9k | 99 | ✓ |
| minimax minimax-m2.5 | 3 | 28s | $0.013 | 196.1k | 100 | ✓ |
| qwen qwen3.6-plus | 1 | 1m26s | $0.032 | 213.2k | 105 | ✓ |
| qwen qwen3.6-plus | 2 | 43s | $0.022 | 188.3k | 102 | ✓ |
| qwen qwen3.6-plus | 3 | 1m26s | $0.035 | 223.9k | 103 | ✓ |
Δ = self − peer median. +red = overrated self. −green = humble.
| impl | self spec | peer med | Δ spec | self qual | peer med | Δ qual |
|---|---|---|---|---|---|---|
| deepseek deepseek-v4-pro | 10.0 | 10.0 | 0.0 | 16.0 | 14.0 | +2.0 |
| deepseek-flash deepseek-v4-flash | 10.0 | 9.5 | +0.5 | 16.0 | 16.0 | 0.0 |
| glm glm-5.1 | 10.0 | 9.5 | +0.5 | 16.0 | 18.0 | -2.0 |
| kimi kimi-k2.6 | 7.0 | 8.0 | -1.0 | 14.0 | 16.0 | -2.0 |
| mimo mimo-v2.5-pro | 8.0 | 7.0 | +1.0 | 17.0 | 16.0 | +1.0 |
| minimax minimax-m2.5 | 7.0 | 8.0 | -1.0 | 16.0 | 14.0 | +2.0 |
| qwen qwen3.6-plus | 7.0 | 8.0 | -1.0 | 16.0 | 15.0 | +1.0 |
| judge | 1st | 2nd | 3rd |
|---|---|---|---|
| deepseek deepseek-v4-pro | deepseek (10) | deepseek-flash (10) | glm (10) |
| deepseek-flash deepseek-v4-flash | deepseek (10) | deepseek-flash (10) | glm (10) |
| glm glm-5.1 | deepseek (10) | glm (10) | deepseek-flash (9) |
| kimi kimi-k2.6 | deepseek (10) | glm (9) | minimax (8) |
| mimo mimo-v2.5-pro | deepseek-flash (10) | glm (10) | deepseek (9) |
| minimax minimax-m2.5 | deepseek (10) | deepseek-flash (10) | glm (9) |
| qwen qwen3.6-plus | deepseek-flash (9) | deepseek (8) | kimi (8) |
| impl | min | max | range | stdev | judges |
|---|---|---|---|---|---|
| deepseek deepseek-v4-pro | 8.0 | 10.0 | 2.0 | 0.79 | deepseek, deepseek-flash, glm, kimi, mimo, minimax, qwen |
| deepseek-flash deepseek-v4-flash | 7.0 | 10.0 | 3.0 | 1.11 | deepseek, deepseek-flash, glm, kimi, mimo, minimax, qwen |
| glm glm-5.1 | 7.0 | 10.0 | 3.0 | 1.11 | deepseek, deepseek-flash, glm, kimi, mimo, minimax, qwen |
| kimi kimi-k2.6 | 7.0 | 9.0 | 2.0 | 0.69 | deepseek, deepseek-flash, glm, kimi, mimo, minimax, qwen |
| mimo mimo-v2.5-pro | 5.0 | 8.0 | 3.0 | 1.00 | deepseek, deepseek-flash, glm, kimi, mimo, minimax, qwen |
| minimax minimax-m2.5 | 6.0 | 8.0 | 2.0 | 0.79 | deepseek, deepseek-flash, glm, kimi, mimo, minimax, qwen |
| qwen qwen3.6-plus | 7.0 | 10.0 | 3.0 | 1.07 | deepseek, deepseek-flash, glm, kimi, mimo, minimax, qwen |
builds/deepseek/rounds/sandbox-2026-05-05-r3
| judge | tier | build | spec | qual | verdict | note |
|---|---|---|---|---|---|---|
| deepseek | self | pass | 10.0 | 16.0 | ship-with-cleanup | Thorough timeout handling with cidfile-based container kill, but cidfile mech... |
| deepseek-flash | peer | pass | 10.0 | 12.0 | rewrite | Functionally correct but severely over-engineered with cidfile/tempfile mecha... |
| glm | peer | pass | 10.0 | 13.0 | ship-with-cleanup | Fully spec-compliant with correct character-based truncation, but over-engine... |
| kimi | peer | pass | 10.0 | 14.0 | ship-with-cleanup | The most spec-accurate of the bunch, slightly over-engineered around timeout ... |
| mimo | peer | pass | 9.0 | 15.0 | rewrite | Over-engineered cidfile cleanup for timeout handling; solid spec compliance b... |
| minimax | peer | pass | 10.0 | — | ship-with-cleanup | Excellent implementation with proper timeout handling and container cleanup v... |
| qwen | peer | pass | 8.0 | 16.0 | ship-with-cleanup | Robust CLI -- handling and good output format, but cidfile adds unnecessary c... |
test_exit_code_nonzerotest_network_bridgetest_network_default_isolatedtest_no_host_shell_injectiontest_output_formattest_simple_echotest_timeouttest_truncationtest_workspace_mountbuilds/deepseek-flash/rounds/sandbox-2026-05-05-r3
| judge | tier | build | spec | qual | verdict | note |
|---|---|---|---|---|---|---|
| deepseek | peer | pass | 10.0 | 16.0 | ship-with-cleanup | Solid spec compliance with container kill on timeout, but UUID naming and con... |
| deepseek-flash | self | pass | 10.0 | 16.0 | ship-with-cleanup | Fully spec-compliant but over-engineered with unnecessary --name/--kill conta... |
| glm | peer | pass | 9.0 | 14.0 | ship-with-cleanup | Solid implementation with correct output format and good timeout handling, bu... |
| kimi | peer | pass | 7.0 | 13.0 | ship-with-cleanup | Solid structure but repeats the common truncation and stdout-mutation mistake... |
| mimo | peer | pass | 10.0 | 18.0 | ship-with-cleanup | Clean, correct implementation with full spec compliance; truncation loop is s... |
| minimax | peer | pass | 10.0 | 17.0 | ship-with-cleanup | Excellent implementation. All spec requirements met with proper output format... |
| qwen | peer | pass | 9.0 | 16.0 | ship-with-cleanup | Best output format handling of the batch; unnecessary uuid/container-name add... |
test_exit_code_nonzerotest_network_bridgetest_network_default_isolatedtest_no_host_shell_injectiontest_output_formattest_simple_echotest_timeouttest_truncationtest_workspace_mountbuilds/glm/rounds/sandbox-2026-05-05-r3
| judge | tier | build | spec | qual | verdict | note |
|---|---|---|---|---|---|---|
| deepseek | peer | pass | 10.0 | 19.0 | ship-with-cleanup | Flawless spec compliance with careful format construction and character-aware... |
| deepseek-flash | peer | pass | 10.0 | 19.0 | ship-with-cleanup | Spec-compliant, clean, and well-balanced implementation with correct truncati... |
| glm | self | pass | 10.0 | 16.0 | ship-with-cleanup | Clean, spec-compliant implementation with correct Unicode truncation and fait... |
| kimi | peer | pass | 9.0 | 17.0 | ship-with-cleanup | Almost perfect: excellent truncation logic, clean structure, one small format... |
| mimo | peer | pass | 10.0 | 19.0 | ship-with-cleanup | Best overall implementation: full spec compliance, correct output format, saf... |
| minimax | peer | pass | 9.0 | 17.0 | ship-with-cleanup | Solid implementation with proper resource limits and correct output formattin... |
| qwen | peer | pass | 7.0 | 17.0 | ship-with-cleanup | Good output format handling and clean structure, but missing -- separator sup... |
test_exit_code_nonzerotest_network_bridgetest_network_default_isolatedtest_no_host_shell_injectiontest_output_formattest_simple_echotest_timeouttest_truncationtest_workspace_mountbuilds/kimi/rounds/sandbox-2026-05-05-r3
| judge | tier | build | spec | qual | verdict | note |
|---|---|---|---|---|---|---|
| deepseek | peer | pass | 9.0 | 16.0 | ship-with-cleanup | Solid argv handling and defaults, but stdout body formatting omits required n... |
| deepseek-flash | peer | pass | 8.0 | 17.0 | ship-with-cleanup | Solid implementation with good timeout behaviour but two format/truncation is... |
| glm | peer | pass | 8.0 | 14.0 | ship-with-cleanup | Compact with good timeout output capture, but missing newline before --- stde... |
| kimi | self | pass | 7.0 | 14.0 | ship-with-cleanup | Concise and readable, but output formatting and truncation both have the same... |
| mimo | peer | pass | 8.0 | 16.0 | rewrite | Minimal and clean but has output format bug on empty stdout and truncation th... |
| minimax | peer | pass | 9.0 | 17.0 | ship-with-cleanup | Solid implementation with proper resource limits and truncation. One minor ou... |
| qwen | peer | pass | 8.0 | 15.0 | ship-with-cleanup | Correct argv handling and resource caps, but output format breaks on non-newl... |
test_exit_code_nonzerotest_network_bridgetest_network_default_isolatedtest_no_host_shell_injectiontest_output_formattest_simple_echotest_timeouttest_truncationtest_workspace_mountbuilds/mimo/rounds/sandbox-2026-05-05-r3
| judge | tier | build | spec | qual | verdict | note |
|---|---|---|---|---|---|---|
| deepseek | peer | pass | 8.0 | 16.0 | ship-with-cleanup | Clean structure but two spec misses: format blank-line on empty stdout, and '... |
| deepseek-flash | peer | pass | 7.0 | 16.0 | rewrite | Core structure is OK but output format and truncation have real spec violatio... |
| glm | peer | pass | 7.0 | 14.0 | ship-with-cleanup | Compact but flawed: f-string output format produces extra newlines, truncatio... |
| kimi | peer | pass | 7.0 | 15.0 | ship-with-cleanup | Compact and readable, but output formatting is sloppy and truncation splits m... |
| mimo | self | pass | 8.0 | 17.0 | rewrite | Concise and readable but has format bugs (extra blank line on empty stdout/st... |
| minimax | peer | pass | 7.0 | 16.0 | rewrite | Good resource handling but output format adds unnecessary newlines, missing -... |
| qwen | peer | pass | 5.0 | 16.0 | rewrite | Minimal and readable but has the most spec violations: output format bug, no ... |
test_exit_code_nonzerotest_network_bridgetest_network_default_isolatedtest_no_host_shell_injectiontest_output_formattest_simple_echotest_timeouttest_truncationtest_workspace_mountbuilds/minimax/rounds/sandbox-2026-05-05-r3
| judge | tier | build | spec | qual | verdict | note |
|---|---|---|---|---|---|---|
| deepseek | peer | pass | 8.0 | 13.0 | ship-with-cleanup | Two fundamental bugs (output format and byte-vs-character truncation) plus a ... |
| deepseek-flash | peer | pass | 8.0 | 15.0 | rewrite | Clean structure but two spec violations: missing \n after stdout body and cha... |
| glm | peer | pass | 8.0 | 13.0 | ship-with-cleanup | Functional core with Popen-based timeout handling, but output format bug (mis... |
| kimi | peer | pass | 8.0 | 14.0 | ship-with-cleanup | Clean argv-based invocation, but output formatting and byte-level truncation ... |
| mimo | peer | pass | 7.0 | 14.0 | rewrite | Decent Popen-based approach with good timeout cleanup, but truncation is byte... |
| minimax | self | pass | 7.0 | 16.0 | rewrite | Good structure but missing -- separator handling, char-based truncation can e... |
| qwen | peer | pass | 6.0 | 15.0 | rewrite | Character-count truncation is a real bug; -- separator handling via argparse ... |
test_exit_code_nonzerotest_network_bridgetest_network_default_isolatedtest_no_host_shell_injectiontest_output_formattest_simple_echotest_timeouttest_truncationtest_workspace_mountbuilds/qwen/rounds/sandbox-2026-05-05-r3
| judge | tier | build | spec | qual | verdict | note |
|---|---|---|---|---|---|---|
| deepseek | peer | pass | 10.0 | 16.0 | ship-with-cleanup | Clean, correct implementation; truncation approach and stderr trailing-newlin... |
| deepseek-flash | peer | pass | 9.0 | 18.0 | ship-with-cleanup | Clean, well-structured implementation with correct param handling; truncation... |
| glm | peer | pass | 8.0 | 14.0 | ship-with-cleanup | Well-structured but discards timeout output and exceeds the 50KB cap; stderr ... |
| kimi | peer | pass | 7.0 | 13.0 | ship-with-cleanup | Well-factored but over-cleans stdout/stderr, and byte-wise truncation can cor... |
| mimo | peer | pass | 8.0 | 15.0 | rewrite | Solid structure with good abspath handling, but output format bugs on empty b... |
| minimax | peer | pass | 8.0 | — | rewrite | Good structure but output formatting uses rstrip losing data, truncation exce... |
| qwen | self | pass | 7.0 | 16.0 | ship-with-cleanup | Cleanest code structure of the batch, but stderr stripping and head-truncatio... |
test_exit_code_nonzerotest_network_bridgetest_network_default_isolatedtest_no_host_shell_injectiontest_output_formattest_simple_echotest_timeouttest_truncationtest_workspace_mount| impl | slug | loc | wall | tokens | cost | tests | $/test |
|---|---|---|---|---|---|---|---|
| $deepseek-flash | deepseek-v4-flash` | 115 | 6m18s | 472.8k | $0.010 | 9 | $0.0014 |
| minimax | minimax-m2.5` | 100 | 0m28s | 196.1k | $0.010 | 9 | $0.0015 |
| kimi | kimi-k2.6` | 96 | 3m11s | 161.1k | $0.020 | 9 | $0.0024 |
| qwen | qwen3.6-plus` | 103 | 1m26s | 223.9k | $0.040 | 9 | $0.0039 |
| mimo | mimo-v2.5-pro` | 87 | 1m23s | 209.5k | $0.070 | 9 | $0.0079 |
| deepseek | deepseek-v4-pro` | 140 | 5m59s | 298.7k | $0.080 | 9 | $0.0091 |
| glm | glm-5.1` | 109 | 3m32s | 209.4k | $0.140 | 9 | $0.0161 |