about

What open-bench is, why it exists, and how a round actually runs.

[01] what

open-bench is a benchmark harness for coding LLMs. It drops a task spec into a fresh repo, drives every model through a real agent loop, runs a hidden test suite as the objective gate, and commits the full artifact set — transcripts, diffs, scores, costs — to the repository. One run, one reviewable record, everything auditable.

The engine is provider-agnostic: tasks, runs, scoring, and aggregation work without any specific agent harness. The bundled --auto driver shells out to opencode today; replacing it with Claude Code, Aider, or anything that can take a prompt and write to a worktree is a single-module swap. Manual flow needs no driver at all.

What it produces is a set of standalone benchmarks — each task run across a model lineup, with the full receipts committed. It is not a leaderboard service and not a tournament: Model Royale, the first format that ran on it, has been retired. The artifacts are the product; the reader is the judge.

[02] why

Vendor leaderboards give you one number, against synthetic problems, on the vendor's release schedule. open-bench gives you the receipts: the actual code each model wrote, the actual transcript, the actual cost — against tasks you author.

The objective signal is the hidden-test result, the dollar cost, and the wall-clock. Peer review — models scoring each other — is kept as a softer second read, not a verdict. You don't take a ranking on faith; you open the diff and judge for yourself.

[03] how

  1. 1 Every model in the lineup gets the same task and an isolated git worktree.
  2. 2 Each writes its implementation through an agent harness (opencode by default, swappable) under a fixed wall-clock budget.
  3. 3 A hidden pytest battery runs against each output — the objective gate.
  4. 4 The models also review each other blinded, as a secondary read; cost and tokens are captured.
  5. 5 Everything — diffs, transcripts, scores, costs — is committed as one reviewable record per run.

[04] principles

objective gate
Hidden tests run before any review. They decide pass/fail; no opinion inflates a non-functional output.
transparency over ranking
Every diff, transcript, score and cost is committed. The artifacts are the point — you read them and judge, not a number taken on faith.
peer review is a second read
Models do score each other, blinded, with self-bias measured. It is a soft signal, not a verdict — useful colour, not the headline.
cost & wall-clock
Tracked as first-class columns. What a result cost is part of the result.
reproducible
Re-run a benchmark, audit a score, fork the lineup. Nothing is hidden behind a service.

[05] more