open-bench
A benchmark harness for coding LLMs. Drops a task spec into a fresh repo, drives every model through a real agent loop, runs a hidden test suite as the objective gate, has the models score each other, and commits the full artifact set.
A benchmark harness for coding LLMs. Drops a task spec into a fresh repo, drives every model through a real agent loop, runs a hidden test suite as the objective gate, has the models score each other, and commits the full artifact set.
open-bench is the engine. You give it a task — a SPEC, a hidden test suite, a lineup of models — and it produces a ranked review with everything needed to audit the result. It is not a leaderboard service or a vendor benchmark; it is the framework you point at the models you actually care about.
What it produces is a set of standalone benchmarks — each task run across the model lineup, full receipts committed. The first format on top of it was Model Royale, a weekly elimination tournament; it surfaced the methodology problems it was always going to, and is paused.
bench/tasks/. bench/scripts/_kinds/ without forking the runner.
The engine is task-kind agnostic. The shipping kind is
code: write code against a SPEC, gated by hidden tests,
judged on spec coverage and quality. New kinds (eval-style, agentic,
multi-step) plug in through the _kinds/ registry.
See the about page for full methodology, browse the benchmarks for results with full artifacts, or read the writeups. Model Royale, the original tournament format, is retired here.