Name: open-bench round 2026-05-08
Creator: open-bench
Published: 2026-05-08

scoreboard

total = peer-judged spec /15 + quality /15. hidden-tests gate the verdict.

impl	total	spec	qual	build	tests	verdict
01 mimo mimo-v2.5-pro	29.0	10.0	19.0	pass	12/12	ship-with-cleanup
02 kimi kimi-k2.6	24.0	10.0	14.0	pass	12/12	ship-with-cleanup
03 minimax minimax-m2.5	24.0	9.0	15.0	pass	12/12	ship-with-cleanup
04 qwen qwen3.6-plus	24.0	9.0	15.0	pass	12/12	ship-with-cleanup
05 deepseek deepseek-v4-pro	23.0	9.0	14.0	pass	12/12	ship-with-cleanup
06 deepseek-flash deepseek-v4-flash	21.0	8.0	13.0	pass	12/12	ship-with-cleanup
07 glm glm-5.1	21.0	9.0	12.0	pass	12/12	ship-with-cleanup

runs


deepseek deepseek-v4-pro	1	3m29s	$0.051	98.4k	—	✓
deepseek-flash deepseek-v4-flash	1	2m5s	$0.0046	102.7k	—	✓
glm glm-5.1	1	1m39s	$0.058	72.7k	—	✓
kimi kimi-k2.6	1	5m46s	$0.080	115.7k	—	✓
mimo mimo-v2.5-pro	1	49s	$0.034	80.7k	—	✓
minimax minimax-m2.5	1	20s	$0.0099	99.8k	—	✓
qwen qwen3.6-plus	1	53s	$0.022	115.7k	—	✓
deepseek deepseek-v4-pro (n=1)	—	3m29s	$0.051	98.4k	—	✓
deepseek-flash deepseek-v4-flash (n=1)	—	2m5s	$0.0046	102.7k	—	✓
glm glm-5.1 (n=1)	—	1m39s	$0.058	72.7k	—	✓
kimi kimi-k2.6 (n=1)	—	5m46s	$0.080	115.7k	—	✓
mimo mimo-v2.5-pro (n=1)	—	49s	$0.034	80.7k	—	✓
minimax minimax-m2.5 (n=1)	—	20s	$0.0099	99.8k	—	✓
qwen qwen3.6-plus (n=1)	—	53s	$0.022	115.7k	—	✓

self-bias check

Δ = self − peer median. +red = overrated self. −green = humble.

impl	self spec	peer med	Δ spec	self qual	peer med	Δ qual
deepseek deepseek-v4-pro	10.0	9.0	+1.0	14.0	14.0	0.0
deepseek-flash deepseek-v4-flash	9.0	8.0	+1.0	15.0	13.0	+2.0
glm glm-5.1	10.0	9.0	+1.0	12.0	12.0	0.0
kimi kimi-k2.6	—	10.0	—	—	14.0	—
mimo mimo-v2.5-pro	9.0	10.0	-1.0	17.0	19.0	-2.0
minimax minimax-m2.5	10.0	9.0	+1.0	16.0	15.0	+1.0
qwen qwen3.6-plus	9.0	9.0	0.0	16.0	15.0	+1.0

per-judge ranking

judge	1st	2nd	3rd
deepseek deepseek-v4-pro	deepseek (10)	kimi (10)	mimo (10)
deepseek-flash deepseek-v4-flash	glm (10)	kimi (10)	mimo (10)
glm glm-5.1	deepseek-flash (10)	glm (10)	mimo (10)
kimi kimi-k2.6	—	—	—
mimo mimo-v2.5-pro	glm (9)	kimi (9)	mimo (9)
minimax minimax-m2.5	deepseek (10)	deepseek-flash (10)	glm (10)
qwen qwen3.6-plus	deepseek (10)	kimi (10)	mimo (10)

inter-judge agreement

impl	min	max	range	stdev	judges
deepseek deepseek-v4-pro	8.0	10.0	2.0	0.82	deepseek, deepseek-flash, glm, mimo, minimax, qwen
deepseek-flash deepseek-v4-flash	7.0	10.0	3.0	1.21	deepseek, deepseek-flash, glm, mimo, minimax, qwen
glm glm-5.1	9.0	10.0	1.0	0.55	deepseek, deepseek-flash, glm, mimo, minimax, qwen
kimi kimi-k2.6	9.0	10.0	1.0	0.52	deepseek, deepseek-flash, glm, mimo, minimax, qwen
mimo mimo-v2.5-pro	9.0	10.0	1.0	0.41	deepseek, deepseek-flash, glm, mimo, minimax, qwen
minimax minimax-m2.5	8.0	10.0	2.0	0.82	deepseek, deepseek-flash, glm, mimo, minimax, qwen
qwen qwen3.6-plus	7.0	9.0	2.0	0.82	deepseek, deepseek-flash, glm, mimo, minimax, qwen

per-implementation detail

#deepseekdeepseek-v4-proview full run →

builds/deepseek/rounds/atomic-write-2026-05-08

judge	tier	build	spec	qual	verdict	note
deepseek	self	pass	10.0	14.0	ship-with-cleanup	Lean and correct: minimal code that hits every spec requirement, but bare wit...
deepseek-flash	peer	pass	9.0	15.0	ship-with-cleanup	Minimal and mostly correct implementation, concise but lacking comments and w...
glm	peer	pass	9.0	12.0	ship-with-cleanup	Shortest and most direct implementation; passes all hard-fails but uses full ...
kimi	peer	fail	—	—	—	—
mimo	peer	pass	8.0	14.0	ship-with-cleanup	Most concise implementation with clean structure — except Exception instead o...
minimax	peer	pass	10.0	14.0	ship-with-cleanup	Minimal, correct implementation. No comments. May be missing explicit FileNot...
qwen	peer	pass	10.0	14.0	ship-with-cleanup	Fully spec-compliant with explicit error checks and correct fsync ordering, b...

hidden tests12/12 passed

✓test_bytes_basic_write
✓test_cli_stdin_to_path
✓test_concurrent_writers_no_corruption
✓test_missing_parent_raises_filenotfound
✓test_mode_applied_when_set
✓test_mode_preserved_when_none_and_target_exists
✓test_no_tmp_residue_on_open_failure
✓test_no_tmp_residue_on_success
✓test_path_is_directory_raises
✓test_replaces_existing
✓test_symlink_writes_to_target
✓test_text_basic_write

#deepseek-flashdeepseek-v4-flashview full run →

builds/deepseek-flash/rounds/atomic-write-2026-05-08

judge	tier	build	spec	qual	verdict	note
deepseek	peer	pass	8.0	13.0	ship-with-cleanup	Functional but bare: lacks explicit exception guards for SPEC-mandated error ...
deepseek-flash	self	pass	9.0	15.0	ship-with-cleanup	Minimal and correct implementation, hurt only by a missing IsADirectoryError ...
glm	peer	pass	10.0	12.0	ship-with-cleanup	Correct spec-compliant implementation with proper mode handling before replac...
kimi	peer	fail	—	—	—	—
mimo	peer	pass	7.0	15.0	ship-with-cleanup	Minimal, correct implementation with proper fsync and cleanup — missing IsADi...
minimax	peer	pass	10.0	16.0	ship-with-cleanup	Robust atomic write implementation with proper fsync, temp-in-same-dir, and m...
qwen	peer	pass	8.0	13.0	ship-with-cleanup	Solid atomic write with correct fsync ordering and thorough cleanup, but lack...

hidden tests12/12 passed

✓test_bytes_basic_write
✓test_cli_stdin_to_path
✓test_concurrent_writers_no_corruption
✓test_missing_parent_raises_filenotfound
✓test_mode_applied_when_set
✓test_mode_preserved_when_none_and_target_exists
✓test_no_tmp_residue_on_open_failure
✓test_no_tmp_residue_on_success
✓test_path_is_directory_raises
✓test_replaces_existing
✓test_symlink_writes_to_target
✓test_text_basic_write

#glmglm-5.1view full run →

builds/glm/rounds/atomic-write-2026-05-08

judge	tier	build	spec	qual	verdict	note
deepseek	peer	pass	9.0	12.0	ship-with-cleanup	Functional but muddled: encoding-sentinel dispatch is fragile, nested excepti...
deepseek-flash	peer	pass	10.0	12.0	ship-with-cleanup	Functionally correct but has a latent double-close bug in the error handling ...
glm	self	pass	10.0	12.0	ship-with-cleanup	Solid spec-compliant implementation with correct mode handling and full durab...
kimi	peer	fail	—	—	—	—
mimo	peer	pass	9.0	14.0	ship-with-cleanup	Elegant encoding-parameter pattern avoids code duplication — solid error hand...
minimax	peer	pass	10.0	14.0	ship-with-cleanup	Functional implementation with correct fsync and mode handling. No explanator...
qwen	peer	pass	9.0	11.0	ship-with-cleanup	Spec-compliant except for missing FileNotFoundError, but the encoding-as-disc...

hidden tests12/12 passed

✓test_bytes_basic_write
✓test_cli_stdin_to_path
✓test_concurrent_writers_no_corruption
✓test_missing_parent_raises_filenotfound
✓test_mode_applied_when_set
✓test_mode_preserved_when_none_and_target_exists
✓test_no_tmp_residue_on_open_failure
✓test_no_tmp_residue_on_success
✓test_path_is_directory_raises
✓test_replaces_existing
✓test_symlink_writes_to_target
✓test_text_basic_write

#kimikimi-k2.6view full run →

builds/kimi/rounds/atomic-write-2026-05-08

judge	tier	build	spec	qual	verdict	note
deepseek	peer	pass	10.0	13.0	ship-with-cleanup	Best exception specificity with proper errno codes and strerror messages, but...
deepseek-flash	peer	pass	10.0	15.0	ship-with-cleanup	Correct implementation with good structure but lacks explanatory comments for...
glm	peer	pass	9.0	12.0	ship-with-cleanup	Professionally structured with errno-based exceptions and O_EXCL atomic creat...
kimi	self	fail	—	—	—	—
mimo	peer	pass	9.0	15.0	ship-with-cleanup	Robust implementation with explicit error checks and UUID-based temp files — ...
minimax	peer	pass	10.0	15.0	ship-with-cleanup	Solid implementation with proper errno-based exceptions and UUID-based unique...
qwen	peer	pass	10.0	13.0	ship-with-cleanup	Fully spec-compliant with good errno-based error messages and atomic O_EXCL f...

hidden tests12/12 passed

✓test_bytes_basic_write
✓test_cli_stdin_to_path
✓test_concurrent_writers_no_corruption
✓test_missing_parent_raises_filenotfound
✓test_mode_applied_when_set
✓test_mode_preserved_when_none_and_target_exists
✓test_no_tmp_residue_on_open_failure
✓test_no_tmp_residue_on_success
✓test_path_is_directory_raises
✓test_replaces_existing
✓test_symlink_writes_to_target
✓test_text_basic_write

#mimomimo-v2.5-proview full run →

builds/mimo/rounds/atomic-write-2026-05-08

judge	tier	build	spec	qual	verdict	note
deepseek	peer	pass	10.0	19.0	ship-with-cleanup	Polished implementation: correct durability ordering, comprehensive error han...
deepseek-flash	peer	pass	10.0	19.0	ship-with-cleanup	Solid, well-structured implementation covering all spec requirements with cle...
glm	peer	pass	10.0	16.0	ship-with-cleanup	Best-in-class implementation: fchmod on fd before close for optimal atomicity...
kimi	peer	fail	—	—	—	—
mimo	self	pass	9.0	17.0	ship-with-cleanup	Clean, well-decomposed implementation with correct symlink handling and robus...
minimax	peer	pass	10.0	19.0	ship-with-cleanup	Exemplary implementation with clear documentation, proper symlink handling, a...
qwen	peer	pass	10.0	18.0	ship-with-cleanup	Best-structured implementation: clean helper decomposition, uses fchmod on op...

hidden tests12/12 passed

✓test_bytes_basic_write
✓test_cli_stdin_to_path
✓test_concurrent_writers_no_corruption
✓test_missing_parent_raises_filenotfound
✓test_mode_applied_when_set
✓test_mode_preserved_when_none_and_target_exists
✓test_no_tmp_residue_on_open_failure
✓test_no_tmp_residue_on_success
✓test_path_is_directory_raises
✓test_replaces_existing
✓test_symlink_writes_to_target
✓test_text_basic_write

#minimaxminimax-m2.5view full run →

builds/minimax/rounds/atomic-write-2026-05-08

judge	tier	build	spec	qual	verdict	note
deepseek	peer	pass	9.0	15.0	ship-with-cleanup	Well-structured with idiomatic CLI and good symlink handling; explicit mode a...
deepseek-flash	peer	pass	10.0	17.0	ship-with-cleanup	Very clean implementation with excellent structure and all spec items handled...
glm	peer	pass	9.0	13.0	ship-with-cleanup	Feature-complete with explicit error checks and symlink handling; marred by c...
kimi	peer	fail	—	—	—	—
mimo	peer	pass	8.0	13.0	ship-with-cleanup	Well-structured implementation with good helpers — bare except, unused return...
minimax	self	pass	10.0	16.0	rewrite	Well-structured implementation with good documentation but applies mode AFTER...
qwen	peer	pass	10.0	17.0	ship-with-cleanup	Well-structured with good helper decomposition and full spec compliance, but ...

hidden tests12/12 passed

✓test_bytes_basic_write
✓test_cli_stdin_to_path
✓test_concurrent_writers_no_corruption
✓test_missing_parent_raises_filenotfound
✓test_mode_applied_when_set
✓test_mode_preserved_when_none_and_target_exists
✓test_no_tmp_residue_on_open_failure
✓test_no_tmp_residue_on_success
✓test_path_is_directory_raises
✓test_replaces_existing
✓test_symlink_writes_to_target
✓test_text_basic_write

#qwenqwen3.6-plusview full run →

builds/qwen/rounds/atomic-write-2026-05-08

judge	tier	build	spec	qual	verdict	note
deepseek	peer	pass	9.0	15.0	ship-with-cleanup	Clean, well-typed implementation with good decomposition; sole spec gap is mi...
deepseek-flash	peer	pass	9.0	16.0	ship-with-cleanup	Clean and correct implementation with good structure, missing only an IsADire...
glm	peer	pass	9.0	12.0	ship-with-cleanup	Correct mode handling with before-replace application; missing explicit IsADi...
kimi	peer	fail	—	—	—	—
mimo	peer	pass	7.0	14.0	ship-with-cleanup	Lean, correct implementation with good error handling and type checks — missi...
minimax	peer	pass	9.0	16.0	ship-with-cleanup	Clean, well-structured implementation with type hints. Missing explicit IsADi...
qwen	self	pass	9.0	16.0	ship-with-cleanup	Clean and minimal implementation with correct fsync ordering and thorough cle...

hidden tests12/12 passed

✓test_bytes_basic_write
✓test_cli_stdin_to_path
✓test_concurrent_writers_no_corruption
✓test_missing_parent_raises_filenotfound
✓test_mode_applied_when_set
✓test_mode_preserved_when_none_and_target_exists
✓test_no_tmp_residue_on_open_failure
✓test_no_tmp_residue_on_success
✓test_path_is_directory_raises
✓test_replaces_existing
✓test_symlink_writes_to_target
✓test_text_basic_write

cost & efficiency

impl	slug	loc	wall	tokens	cost	tests	$/test
$deepseek-flash	deepseek-v4-flash`	62	2m05s	102.7k	$0.00000	12	$0.0004
minimax	minimax-m2.5`	105	0m20s	99.8k	$0.010	12	$0.0008
qwen	qwen3.6-plus`	77	0m53s	115.7k	$0.020	12	$0.0018
mimo	mimo-v2.5-pro`	93	0m49s	80.7k	$0.030	12	$0.0028
deepseek	deepseek-v4-pro`	51	3m28s	98.4k	$0.050	12	$0.0043
glm	glm-5.1`	68	1m38s	72.7k	$0.060	12	$0.0049
kimi	kimi-k2.6`	72	5m46s	115.7k	$0.080	12	$0.0067

atomic-write · May 8, 2026

scoreboard

runs

self-bias check

per-judge ranking

inter-judge agreement

per-implementation detail

#deepseekdeepseek-v4-proview full run →

#deepseek-flashdeepseek-v4-flashview full run →

#glmglm-5.1view full run →

#kimikimi-k2.6view full run →

#mimomimo-v2.5-proview full run →

#minimaxminimax-m2.5view full run →

#qwenqwen3.6-plusview full run →

cost & efficiency