An adversarial pytest suite that attempts to escape a sandbox implementing the round-1 sandbox contract. Stdlib + pytest only.
- rounds
- 1
- models
- 7
- samples
- 7
- tests pass
- 100%
- spend
- $0.375
- last run
- May 14, 2026
Every task in the open-bench corpus. Each one is a SPEC.md plus a hidden test suite — the contract models are graded against. Tasks are versioned and live in bench/tasks/.
An adversarial pytest suite that attempts to escape a sandbox implementing the round-1 sandbox contract. Stdlib + pytest only.
A single-file Python module providing crash-safe file writes that **either fully succeed or leave the previous contents intact** — no partial writes, no `.tmp` residue, no torn data visible to readers.
latest winner mimo mimo-v2.5-pro 29.0 /30A Python module that wraps Podman (or Docker) to run commands inside ephemeral, network-isolated, resource-capped containers. Stdlib only.
latest winner glm glm-5.1 27.5 /30A single-file Python module providing one operation: **search-replace patching of file contents**. This is the primitive that agent harnesses (Cursor, aider, Claude Code, etc.) use to translate model-emitted edits into actual file changes.
no runs yet — lineup not exercised against this task