about
What open-bench is, why it exists, and how a round actually runs.
What open-bench is, why it exists, and how a round actually runs.
open-bench is a benchmark harness for coding LLMs. It drops a task spec into a fresh repo, drives every model through a real agent loop, runs a hidden test suite as the objective gate, has the models score each other, and commits the full artifact set — transcripts, diffs, judge rubrics, scores — to the repository. One round, one ranked review.
The engine is provider-agnostic: tasks, runs, judging, and aggregation work without any specific agent harness. The bundled --auto driver shells out to opencode today; replacing it with Claude Code, Aider, or anything that can take a prompt and write to a worktree is a single-module swap. Manual flow needs no driver at all.
The flagship format on top of it is Model Royale: a weekly elimination tournament between selected open-source coding models. The harness is general; the tournament is one thing you can run on it.
open-bench captures four numbers per implementation: hidden-test pass rate, blinded peer scores from the other models in the lineup, model wall-clock, and dollar cost. Vendor leaderboards report only the first, against synthetic problems, on the vendor's release schedule. open-bench reports all four, against tasks you author, every round, with the full transcript and diff committed alongside.