benchmarks

Every task in the open-bench corpus. Each one is a SPEC.md plus a hidden test suite — the contract models are graded against. Tasks are versioned and live in bench/tasks/.

3tasks
2rounds
28samples
$1.23total spend
apply-edit python

A single-file Python module providing one operation: **search-replace patching of file contents**. This is the primitive that agent harnesses (Cursor, aider, Claude Code, etc.) use to translate model-emitted edits into actual file changes.

no runs yet — lineup not exercised against this task