Name: open-bench round 2026-05-05
Creator: open-bench
Published: 2026-05-05

scoreboard

total = peer-judged spec /15 + quality /15. hidden-tests gate the verdict.

impl	total	spec	qual	build	tests	verdict
01 glm glm-5.1	27.5	9.5	18.0	pass	9/9	ship-with-cleanup
02 deepseek-flash deepseek-v4-flash	25.5	9.5	16.0	pass	9/9	ship-with-cleanup
03 deepseek deepseek-v4-pro	24.0	10.0	14.0	pass	9/9	ship-with-cleanup
04 kimi kimi-k2.6	24.0	8.0	16.0	pass	9/9	ship-with-cleanup
05 mimo mimo-v2.5-pro	23.0	7.0	16.0	pass	9/9	rewrite
06 qwen qwen3.6-plus	23.0	8.0	15.0	pass	9/9	ship-with-cleanup
07 minimax minimax-m2.5	22.0	8.0	14.0	pass	9/9	rewrite

runs


deepseek deepseek-v4-pro	1	5m59s	$0.075	282.5k	142	✓
deepseek deepseek-v4-pro	2	3m0s	$0.052	253.2k	139	✓
deepseek deepseek-v4-pro	3	5m59s	$0.082	298.7k	140	✓
deepseek-flash deepseek-v4-flash	1	6m19s	$0.011	448.3k	117	✓
deepseek-flash deepseek-v4-flash	2	3m10s	$0.0079	402.3k	114	✓
deepseek-flash deepseek-v4-flash	3	6m19s	$0.012	472.8k	115	✓
glm glm-5.1	1	3m32s	$0.131	198.6k	111	✓
glm glm-5.1	2	1m46s	$0.092	177.3k	108	✓
glm glm-5.1	3	3m32s	$0.145	209.4k	109	✓
kimi kimi-k2.6	1	3m12s	$0.020	157.7k	98	✓
kimi kimi-k2.6	2	1m36s	$0.014	137.8k	95	✓
kimi kimi-k2.6	3	3m12s	$0.022	161.1k	96	✓
mimo mimo-v2.5-pro	1	1m23s	$0.065	198.4k	89	✓
mimo mimo-v2.5-pro	2	42s	$0.045	180.5k	86	✓
mimo mimo-v2.5-pro	3	1m23s	$0.071	209.5k	87	✓
minimax minimax-m2.5	1	28s	$0.012	183.6k	102	✓
minimax minimax-m2.5	2	14s	$0.0084	165.9k	99	✓
minimax minimax-m2.5	3	28s	$0.013	196.1k	100	✓
qwen qwen3.6-plus	1	1m26s	$0.032	213.2k	105	✓
qwen qwen3.6-plus	2	43s	$0.022	188.3k	102	✓
qwen qwen3.6-plus	3	1m26s	$0.035	223.9k	103	✓
deepseek deepseek-v4-pro (n=3)	—	5m59s	$0.075	282.5k	140	✓
deepseek-flash deepseek-v4-flash (n=3)	—	6m19s	$0.011	448.3k	115	✓
glm glm-5.1 (n=3)	—	3m32s	$0.131	198.6k	109	✓
kimi kimi-k2.6 (n=3)	—	3m12s	$0.020	157.7k	96	✓
mimo mimo-v2.5-pro (n=3)	—	1m23s	$0.065	198.4k	87	✓
minimax minimax-m2.5 (n=3)	—	28s	$0.012	183.6k	100	✓
qwen qwen3.6-plus (n=3)	—	1m26s	$0.032	213.2k	103	✓

self-bias check

Δ = self − peer median. +red = overrated self. −green = humble.

impl	self spec	peer med	Δ spec	self qual	peer med	Δ qual
deepseek deepseek-v4-pro	10.0	10.0	0.0	16.0	14.0	+2.0
deepseek-flash deepseek-v4-flash	10.0	9.5	+0.5	16.0	16.0	0.0
glm glm-5.1	10.0	9.5	+0.5	16.0	18.0	-2.0
kimi kimi-k2.6	7.0	8.0	-1.0	14.0	16.0	-2.0
mimo mimo-v2.5-pro	8.0	7.0	+1.0	17.0	16.0	+1.0
minimax minimax-m2.5	7.0	8.0	-1.0	16.0	14.0	+2.0
qwen qwen3.6-plus	7.0	8.0	-1.0	16.0	15.0	+1.0

per-judge ranking

judge	1st	2nd	3rd
deepseek deepseek-v4-pro	deepseek (10)	deepseek-flash (10)	glm (10)
deepseek-flash deepseek-v4-flash	deepseek (10)	deepseek-flash (10)	glm (10)
glm glm-5.1	deepseek (10)	glm (10)	deepseek-flash (9)
kimi kimi-k2.6	deepseek (10)	glm (9)	minimax (8)
mimo mimo-v2.5-pro	deepseek-flash (10)	glm (10)	deepseek (9)
minimax minimax-m2.5	deepseek (10)	deepseek-flash (10)	glm (9)
qwen qwen3.6-plus	deepseek-flash (9)	deepseek (8)	kimi (8)

inter-judge agreement

impl	min	max	range	stdev	judges
deepseek deepseek-v4-pro	8.0	10.0	2.0	0.79	deepseek, deepseek-flash, glm, kimi, mimo, minimax, qwen
deepseek-flash deepseek-v4-flash	7.0	10.0	3.0	1.11	deepseek, deepseek-flash, glm, kimi, mimo, minimax, qwen
glm glm-5.1	7.0	10.0	3.0	1.11	deepseek, deepseek-flash, glm, kimi, mimo, minimax, qwen
kimi kimi-k2.6	7.0	9.0	2.0	0.69	deepseek, deepseek-flash, glm, kimi, mimo, minimax, qwen
mimo mimo-v2.5-pro	5.0	8.0	3.0	1.00	deepseek, deepseek-flash, glm, kimi, mimo, minimax, qwen
minimax minimax-m2.5	6.0	8.0	2.0	0.79	deepseek, deepseek-flash, glm, kimi, mimo, minimax, qwen
qwen qwen3.6-plus	7.0	10.0	3.0	1.07	deepseek, deepseek-flash, glm, kimi, mimo, minimax, qwen

per-implementation detail

#deepseekdeepseek-v4-proview full run →

builds/deepseek/rounds/sandbox-2026-05-05-r3

judge	tier	build	spec	qual	verdict	note
deepseek	self	pass	10.0	16.0	ship-with-cleanup	Thorough timeout handling with cidfile-based container kill, but cidfile mech...
deepseek-flash	peer	pass	10.0	12.0	rewrite	Functionally correct but severely over-engineered with cidfile/tempfile mecha...
glm	peer	pass	10.0	13.0	ship-with-cleanup	Fully spec-compliant with correct character-based truncation, but over-engine...
kimi	peer	pass	10.0	14.0	ship-with-cleanup	The most spec-accurate of the bunch, slightly over-engineered around timeout ...
mimo	peer	pass	9.0	15.0	rewrite	Over-engineered cidfile cleanup for timeout handling; solid spec compliance b...
minimax	peer	pass	10.0	—	ship-with-cleanup	Excellent implementation with proper timeout handling and container cleanup v...
qwen	peer	pass	8.0	16.0	ship-with-cleanup	Robust CLI -- handling and good output format, but cidfile adds unnecessary c...

hidden tests9/9 passed

✓test_exit_code_nonzero
✓test_network_bridge
✓test_network_default_isolated
✓test_no_host_shell_injection
✓test_output_format
✓test_simple_echo
✓test_timeout
✓test_truncation
✓test_workspace_mount

#deepseek-flashdeepseek-v4-flashview full run →

builds/deepseek-flash/rounds/sandbox-2026-05-05-r3

judge	tier	build	spec	qual	verdict	note
deepseek	peer	pass	10.0	16.0	ship-with-cleanup	Solid spec compliance with container kill on timeout, but UUID naming and con...
deepseek-flash	self	pass	10.0	16.0	ship-with-cleanup	Fully spec-compliant but over-engineered with unnecessary --name/--kill conta...
glm	peer	pass	9.0	14.0	ship-with-cleanup	Solid implementation with correct output format and good timeout handling, bu...
kimi	peer	pass	7.0	13.0	ship-with-cleanup	Solid structure but repeats the common truncation and stdout-mutation mistake...
mimo	peer	pass	10.0	18.0	ship-with-cleanup	Clean, correct implementation with full spec compliance; truncation loop is s...
minimax	peer	pass	10.0	17.0	ship-with-cleanup	Excellent implementation. All spec requirements met with proper output format...
qwen	peer	pass	9.0	16.0	ship-with-cleanup	Best output format handling of the batch; unnecessary uuid/container-name add...

hidden tests9/9 passed

✓test_exit_code_nonzero
✓test_network_bridge
✓test_network_default_isolated
✓test_no_host_shell_injection
✓test_output_format
✓test_simple_echo
✓test_timeout
✓test_truncation
✓test_workspace_mount

#glmglm-5.1view full run →

builds/glm/rounds/sandbox-2026-05-05-r3

judge	tier	build	spec	qual	verdict	note
deepseek	peer	pass	10.0	19.0	ship-with-cleanup	Flawless spec compliance with careful format construction and character-aware...
deepseek-flash	peer	pass	10.0	19.0	ship-with-cleanup	Spec-compliant, clean, and well-balanced implementation with correct truncati...
glm	self	pass	10.0	16.0	ship-with-cleanup	Clean, spec-compliant implementation with correct Unicode truncation and fait...
kimi	peer	pass	9.0	17.0	ship-with-cleanup	Almost perfect: excellent truncation logic, clean structure, one small format...
mimo	peer	pass	10.0	19.0	ship-with-cleanup	Best overall implementation: full spec compliance, correct output format, saf...
minimax	peer	pass	9.0	17.0	ship-with-cleanup	Solid implementation with proper resource limits and correct output formattin...
qwen	peer	pass	7.0	17.0	ship-with-cleanup	Good output format handling and clean structure, but missing -- separator sup...

hidden tests9/9 passed

✓test_exit_code_nonzero
✓test_network_bridge
✓test_network_default_isolated
✓test_no_host_shell_injection
✓test_output_format
✓test_simple_echo
✓test_timeout
✓test_truncation
✓test_workspace_mount

#kimikimi-k2.6view full run →

builds/kimi/rounds/sandbox-2026-05-05-r3

judge	tier	build	spec	qual	verdict	note
deepseek	peer	pass	9.0	16.0	ship-with-cleanup	Solid argv handling and defaults, but stdout body formatting omits required n...
deepseek-flash	peer	pass	8.0	17.0	ship-with-cleanup	Solid implementation with good timeout behaviour but two format/truncation is...
glm	peer	pass	8.0	14.0	ship-with-cleanup	Compact with good timeout output capture, but missing newline before --- stde...
kimi	self	pass	7.0	14.0	ship-with-cleanup	Concise and readable, but output formatting and truncation both have the same...
mimo	peer	pass	8.0	16.0	rewrite	Minimal and clean but has output format bug on empty stdout and truncation th...
minimax	peer	pass	9.0	17.0	ship-with-cleanup	Solid implementation with proper resource limits and truncation. One minor ou...
qwen	peer	pass	8.0	15.0	ship-with-cleanup	Correct argv handling and resource caps, but output format breaks on non-newl...

hidden tests9/9 passed

✓test_exit_code_nonzero
✓test_network_bridge
✓test_network_default_isolated
✓test_no_host_shell_injection
✓test_output_format
✓test_simple_echo
✓test_timeout
✓test_truncation
✓test_workspace_mount

#mimomimo-v2.5-proview full run →

builds/mimo/rounds/sandbox-2026-05-05-r3

judge	tier	build	spec	qual	verdict	note
deepseek	peer	pass	8.0	16.0	ship-with-cleanup	Clean structure but two spec misses: format blank-line on empty stdout, and '...
deepseek-flash	peer	pass	7.0	16.0	rewrite	Core structure is OK but output format and truncation have real spec violatio...
glm	peer	pass	7.0	14.0	ship-with-cleanup	Compact but flawed: f-string output format produces extra newlines, truncatio...
kimi	peer	pass	7.0	15.0	ship-with-cleanup	Compact and readable, but output formatting is sloppy and truncation splits m...
mimo	self	pass	8.0	17.0	rewrite	Concise and readable but has format bugs (extra blank line on empty stdout/st...
minimax	peer	pass	7.0	16.0	rewrite	Good resource handling but output format adds unnecessary newlines, missing -...
qwen	peer	pass	5.0	16.0	rewrite	Minimal and readable but has the most spec violations: output format bug, no ...

hidden tests9/9 passed

✓test_exit_code_nonzero
✓test_network_bridge
✓test_network_default_isolated
✓test_no_host_shell_injection
✓test_output_format
✓test_simple_echo
✓test_timeout
✓test_truncation
✓test_workspace_mount

#minimaxminimax-m2.5view full run →

builds/minimax/rounds/sandbox-2026-05-05-r3

judge	tier	build	spec	qual	verdict	note
deepseek	peer	pass	8.0	13.0	ship-with-cleanup	Two fundamental bugs (output format and byte-vs-character truncation) plus a ...
deepseek-flash	peer	pass	8.0	15.0	rewrite	Clean structure but two spec violations: missing \n after stdout body and cha...
glm	peer	pass	8.0	13.0	ship-with-cleanup	Functional core with Popen-based timeout handling, but output format bug (mis...
kimi	peer	pass	8.0	14.0	ship-with-cleanup	Clean argv-based invocation, but output formatting and byte-level truncation ...
mimo	peer	pass	7.0	14.0	rewrite	Decent Popen-based approach with good timeout cleanup, but truncation is byte...
minimax	self	pass	7.0	16.0	rewrite	Good structure but missing -- separator handling, char-based truncation can e...
qwen	peer	pass	6.0	15.0	rewrite	Character-count truncation is a real bug; -- separator handling via argparse ...

hidden tests9/9 passed

✓test_exit_code_nonzero
✓test_network_bridge
✓test_network_default_isolated
✓test_no_host_shell_injection
✓test_output_format
✓test_simple_echo
✓test_timeout
✓test_truncation
✓test_workspace_mount

#qwenqwen3.6-plusview full run →

builds/qwen/rounds/sandbox-2026-05-05-r3

judge	tier	build	spec	qual	verdict	note
deepseek	peer	pass	10.0	16.0	ship-with-cleanup	Clean, correct implementation; truncation approach and stderr trailing-newlin...
deepseek-flash	peer	pass	9.0	18.0	ship-with-cleanup	Clean, well-structured implementation with correct param handling; truncation...
glm	peer	pass	8.0	14.0	ship-with-cleanup	Well-structured but discards timeout output and exceeds the 50KB cap; stderr ...
kimi	peer	pass	7.0	13.0	ship-with-cleanup	Well-factored but over-cleans stdout/stderr, and byte-wise truncation can cor...
mimo	peer	pass	8.0	15.0	rewrite	Solid structure with good abspath handling, but output format bugs on empty b...
minimax	peer	pass	8.0	—	rewrite	Good structure but output formatting uses rstrip losing data, truncation exce...
qwen	self	pass	7.0	16.0	ship-with-cleanup	Cleanest code structure of the batch, but stderr stripping and head-truncatio...

hidden tests9/9 passed

✓test_exit_code_nonzero
✓test_network_bridge
✓test_network_default_isolated
✓test_no_host_shell_injection
✓test_output_format
✓test_simple_echo
✓test_timeout
✓test_truncation
✓test_workspace_mount

cost & efficiency

impl	slug	loc	wall	tokens	cost	tests	$/test
$deepseek-flash	deepseek-v4-flash`	115	6m18s	472.8k	$0.010	9	$0.0014
minimax	minimax-m2.5`	100	0m28s	196.1k	$0.010	9	$0.0015
kimi	kimi-k2.6`	96	3m11s	161.1k	$0.020	9	$0.0024
qwen	qwen3.6-plus`	103	1m26s	223.9k	$0.040	9	$0.0039
mimo	mimo-v2.5-pro`	87	1m23s	209.5k	$0.070	9	$0.0079
deepseek	deepseek-v4-pro`	140	5m59s	298.7k	$0.080	9	$0.0091
glm	glm-5.1`	109	3m32s	209.4k	$0.140	9	$0.0161

round May 5, 2026

scoreboard

runs

self-bias check

per-judge ranking

inter-judge agreement

per-implementation detail

#deepseekdeepseek-v4-proview full run →

#deepseek-flashdeepseek-v4-flashview full run →

#glmglm-5.1view full run →

#kimikimi-k2.6view full run →

#mimomimo-v2.5-proview full run →

#minimaxminimax-m2.5view full run →

#qwenqwen3.6-plusview full run →

cost & efficiency