about
What open-bench is, why it exists, and how a round actually runs.
What open-bench is, why it exists, and how a round actually runs.
open-bench is a benchmark harness for coding LLMs. It drops a task spec into a fresh repo, drives every model through a real agent loop, runs a hidden test suite as the objective gate, and commits the full artifact set — transcripts, diffs, scores, costs — to the repository. One run, one reviewable record, everything auditable.
The engine is provider-agnostic: tasks, runs, scoring, and aggregation work without any specific agent harness. The bundled --auto driver shells out to opencode today; replacing it with Claude Code, Aider, or anything that can take a prompt and write to a worktree is a single-module swap. Manual flow needs no driver at all.
What it produces is a set of standalone benchmarks — each task run across a model lineup, with the full receipts committed. It is not a leaderboard service and not a tournament: Model Royale, the first format that ran on it, has been retired. The artifacts are the product; the reader is the judge.
Vendor leaderboards give you one number, against synthetic problems, on the vendor's release schedule. open-bench gives you the receipts: the actual code each model wrote, the actual transcript, the actual cost — against tasks you author.
The objective signal is the hidden-test result, the dollar cost, and the wall-clock. Peer review — models scoring each other — is kept as a softer second read, not a verdict. You don't take a ranking on faith; you open the diff and judge for yourself.