Benchmarks

Every task in the open-bench corpus. Each one is a SPEC.md plus a hidden test suite — the contract models are graded against. Tasks are versioned and live in bench/tasks/.

4tasks

3rounds

35samples

$1.60total spend

break-sandbox python

An adversarial pytest suite that attempts to escape a sandbox implementing the round-1 sandbox contract. Stdlib + pytest only.

rounds: 1
models: 7
samples: 7
tests pass: 100%
spend: $0.375
last run: May 14, 2026

atomic-write python

A single-file Python module providing crash-safe file writes that **either fully succeed or leave the previous contents intact** — no partial writes, no `.tmp` residue, no torn data visible to readers.

latest winner mimo mimo-v2.5-pro 29.0 /30

rounds: 1
models: 7
samples: 7
tests pass: 100%
spend: $0.259
last run: May 8, 2026

sandbox python

A Python module that wraps Podman (or Docker) to run commands inside ephemeral, network-isolated, resource-capped containers. Stdlib only.

latest winner glm glm-5.1 27.5 /30

rounds: 1
models: 7
samples: 21
tests pass: 100%
spend: $0.968
last run: May 5, 2026

apply-edit python

A single-file Python module providing one operation: **search-replace patching of file contents**. This is the primitive that agent harnesses (Cursor, aider, Claude Code, etc.) use to translate model-emitted edits into actual file changes.

no runs yet — lineup not exercised against this task