open-bench
A benchmark harness for coding LLMs. Drops a task spec into a fresh repo, drives every model through a real agent loop, runs a hidden test suite as the objective gate, has the models score each other, and commits the full artifact set.
A benchmark harness for coding LLMs. Drops a task spec into a fresh repo, drives every model through a real agent loop, runs a hidden test suite as the objective gate, has the models score each other, and commits the full artifact set.
open-bench is the engine. You give it a task — a SPEC, a hidden test suite, a lineup of models — and it produces a ranked review with everything needed to audit the result. It is not a leaderboard service or a vendor benchmark; it is the framework you point at the models you actually care about.
The flagship format running on top of it is Model Royale: a weekly elimination tournament across selected open-source coding models. Royale is one consumer of the engine, not the engine itself.
bench/tasks/. bench/scripts/_kinds/ without forking the runner.
The engine is task-kind agnostic. The shipping kind is
code: write code against a SPEC, gated by hidden tests,
judged on spec coverage and quality. New kinds (eval-style, agentic,
multi-step) plug in through the _kinds/ registry.
The reference consumer is Model Royale — weekly tournament, fixed lineup, elimination format. See the about page for full methodology, or jump straight to the round archive.