schema (csv)

column	type	description
`round`	string (YYYY-MM-DD)	Round date.
`impl`	string	Implementer slug (the model lineup name).
`model_slug`	string\|null	Provider/model identifier from meta.json (e.g. opencode/kimi-k2.6).
`spec_peer`	number\|null	Median spec score from peer judges, /15.
`quality_peer`	number\|null	Median quality score from peer judges, /15.
`composite`	number\|null	spec_peer + quality_peer; /30. Null when either component missing.
`passed_hard_fail`	boolean	True if the impl passed the hard-fail gate (hidden tests + build).
`tests`	string	Hidden test pass count, e.g. "9/9".
`verdict`	string	Mode-of-judges recommendation: ship \| ship-with-cleanup \| rewrite \| reject.
`samples`	integer	Number of independent runs aggregated for this (round, impl).
`total_cost_usd`	number	Sum of inference cost across samples, USD.
`total_tokens`	integer	Sum of input+output+cache_read tokens across samples.
`median_wall_seconds`	number\|null	Median wall-clock seconds across samples.
`median_loc`	number\|null	Median lines of code in submitted implementation.

license

The dataset is released under the MIT license, matching the repository. Use it for anything; attribution appreciated, not required.

citation

@misc{openbench,
  author       = {fole},
  title        = {open-bench: weekly LLM coding battle royale},
  year         = {2026},
  howpublished = {\url{https://openbenchmark.dev/dataset}},
  note         = {schema_version=1}
}

schema versioning

The JSON file carries a meta.schema_version. Breaking changes bump it. Old shapes stay reachable in git history under frontend/src/lib/dataset.ts.

raw artifacts

This dataset is the aggregated table. Per-run inputs/outputs (transcripts, diffs, hidden test outputs, judge prompts) live in the repo at builds/ and results/.