benchmarks

Every task in the open-bench corpus. Each one is a SPEC.md plus a hidden test suite — the contract models are graded against. Tasks are versioned and live in bench/tasks/.

4tasks
3rounds
35samples
$1.60total spend
break-sandbox python

An adversarial pytest suite that attempts to escape a sandbox implementing the round-1 sandbox contract. Stdlib + pytest only.

rounds
1
models
7
samples
7
tests pass
100%
spend
$0.375
last run
May 14, 2026
apply-edit python

A single-file Python module providing one operation: **search-replace patching of file contents**. This is the primitive that agent harnesses (Cursor, aider, Claude Code, etc.) use to translate model-emitted edits into actual file changes.

no runs yet — lineup not exercised against this task