about

What open-bench is, why it exists, and how a round actually runs.

[01] what

open-bench is a benchmark harness for coding LLMs. It drops a task spec into a fresh repo, drives every model through a real agent loop, runs a hidden test suite as the objective gate, has the models score each other, and commits the full artifact set — transcripts, diffs, judge rubrics, scores — to the repository. One round, one ranked review.

The engine is provider-agnostic: tasks, runs, judging, and aggregation work without any specific agent harness. The bundled --auto driver shells out to opencode today; replacing it with Claude Code, Aider, or anything that can take a prompt and write to a worktree is a single-module swap. Manual flow needs no driver at all.

The flagship format on top of it is Model Royale: a weekly elimination tournament between selected open-source coding models. The harness is general; the tournament is one thing you can run on it.

[02] why

open-bench captures four numbers per implementation: hidden-test pass rate, blinded peer scores from the other models in the lineup, model wall-clock, and dollar cost. Vendor leaderboards report only the first, against synthetic problems, on the vendor's release schedule. open-bench reports all four, against tasks you author, every round, with the full transcript and diff committed alongside.

[03] how

  1. 1 Every model in the lineup gets the same task and an isolated git worktree.
  2. 2 Each writes its implementation through an agent harness (opencode by default, swappable) under a fixed wall-clock budget.
  3. 3 A hidden pytest battery runs against each output — the objective gate.
  4. 4 Every model judges every implementation, blinded. Cost and tokens are captured.
  5. 5 Scores are aggregated into one ranked review per round, plus a self-bias delta.

[04] principles

objective gate
Hidden tests run before any judge sees the code. No vibe-check inflating a non-functional output.
peer scoring
Models judge each other under blinded labels. Self-bias is measured, then excluded from the headline median.
cost & wall-clock
Tracked through the harness. Cheapest-and-correct wins ties; slow-and-correct loses them.
reproducible
Every transcript, diff, rubric and score is committed. Re-run a round, audit a judgment, fork the lineup.

[05] more