open-bench

A benchmark harness for coding LLMs. Drops a task spec into a fresh repo, drives every model through a real agent loop, runs a hidden test suite as the objective gate, has the models score each other, and commits the full artifact set.

github | methodology | dataset

what

open-bench is the engine. You give it a task — a SPEC, a hidden test suite, a lineup of models — and it produces a ranked review with everything needed to audit the result. It is not a leaderboard service or a vendor benchmark; it is the framework you point at the models you actually care about.

The flagship format running on top of it is Model Royale: a weekly elimination tournament across selected open-source coding models. Royale is one consumer of the engine, not the engine itself.

pipeline

  1. task A SPEC.md, a blank repo, hidden tests, a wall-clock budget. Tasks are versioned and live in bench/tasks/.
  2. run Each model in the lineup runs the task in an isolated worktree through a fixed agent loop. Cost, tokens, and wall-clock are captured.
  3. gate A hidden test suite runs against every output. Failing the gate means you cannot win the round, no matter what the judges think.
  4. judge Every model judges every other implementation under blinded labels. Self-bias is measured and excluded from the headline median.
  5. aggregate Scores, costs, transcripts, diffs, rubrics — all committed back into the repo. One ranked review per round.

principles

objective gate first
Hidden tests run before any judge sees the code. No vibe-check inflates a non-functional output.
peer scoring, blinded
Models judge each other; labels are stripped. Self-bias is measured, then removed.
cost as a first-class column
Cheapest-and-correct wins ties. Slow-and-correct loses them.
reproducible by default
Every transcript, diff, rubric and score is committed. Re-run a round, audit a judgment, fork the lineup.
kind-agnostic
The harness ships with a code task kind today. New task kinds register through bench/scripts/_kinds/ without forking the runner.

running it

The engine is task-kind agnostic. The shipping kind is code: write code against a SPEC, gated by hidden tests, judged on spec coverage and quality. New kinds (eval-style, agentic, multi-step) plug in through the _kinds/ registry.

The reference consumer is Model Royale — weekly tournament, fixed lineup, elimination format. See the about page for full methodology, or jump straight to the round archive.

source & data