sandbox

A Python module that wraps Podman (or Docker) to run commands inside ephemeral, network-isolated, resource-capped containers. Stdlib only.

languagepython
entrypointsandbox.py
test runnerpytest
kindcode

latest run

round May 5, 2026
winner glm glm-5.1
27.5/30 spec 9.5 · qual 18.0
models
7
samples
21
tests
all pass
spend
$0.968

contract

must

  • Python 3.10+, stdlib only — no pip install, no new dependencies.
  • Use subprocess.run(argv, shell=False) (or equivalent argv-list form).
  • network parameter defaults to "none". Any deviation is a fail.
  • Every podman invocation includes --memory, --pids-limit, --cpus,
  • Output truncated at 50,000 bytes total.
  • Output format must match the example in SPEC.md exactly:

out of scope

  • Persistent sandboxes (sandbox_create / sandbox_exec / sandbox_destroy)
  • Image allowlist / trusted registries
  • Concurrent sandbox pooling
  • tmpfs-only workspace mode
  • Mia harness integration files (tools/sandbox.py etc.)

full contract

The raw documents handed to every model and every judge. Read these as the source of truth.

SPEC spec.md what done looks like

sandbox.py — implementation spec (v0.1)

A Python module that wraps Podman (or Docker) to run commands inside ephemeral, network-isolated, resource-capped containers. Stdlib only.

Public function

def sandbox_run(
    command: str,
    workspace: str | None = None,   # host dir to bind r/w at /workspace; None = no mount
    image: str = "debian:stable-slim",
    timeout: int = 60,
    network: str = "none",          # "none" | "bridge"
    memory: str = "2g",
    pids: int = 512,
    cpus: float = 2.0,
) -> str:

Behaviour

  • Runs command inside an ephemeral container of image.
  • If workspace is a path, it is bind-mounted read-write at /workspace, and the container's working directory is set to /workspace. If workspace is None, no host directory is mounted.
  • Container is destroyed after the command exits (--rm).
  • Network: none (default) means no DNS, no outbound. bridge enables default networking.
  • Resource limits applied on every run: --memory, --pids-limit, --cpus, --cap-drop=ALL, --security-opt=no-new-privileges.
  • Subprocess invocation uses an argv list with shell=False. The command string is passed as an argument to sh -c inside the container. The host shell must never interpolate it.
  • Wall-clock timeout enforced via subprocess.run(timeout=...). On timeout, the container is terminated and the returned string indicates a timeout.
  • The formatted return string is truncated to 50,000 bytes total after construction (i.e. apply truncation to the final string with the headers in place, not to stdout/stderr separately and not via a proportional split). Slice the tail; do not split mid-byte across a multibyte sequence (decode first, then truncate by characters that re-encode within the cap). Truncation is silent (no error), but a clear marker like ... [truncated] may be appended.
  • First-call latency is allowed: --pull=missing is fine for v0.1.

Return format

A single string:

exit=<n>
--- stdout ---
<stdout bytes, decoded>
--- stderr ---
<stderr bytes, decoded>

Where <n> is the container's exit code (or 124 on timeout, matching GNU timeout convention). Decoding errors are replaced (errors="replace").

Format rules (normative):

  • Each header (exit=<n>, --- stdout ---, --- stderr ---) is on its own line, terminated by \n.
  • The stdout body, if non-empty, comes immediately after the --- stdout ---\n line and ends with exactly one \n before the --- stderr --- header.
  • The stderr body, if non-empty, comes immediately after the --- stderr ---\n line. It may or may not end with a trailing newline (preserve whatever the underlying stream produced).
  • If a body is empty, the next header (or end of string) follows directly after the previous header line — no blank line is inserted.

Podman invocation (reference)

podman run --rm --pull=missing \
    --network=<network> \
    --memory=<memory> \
    --pids-limit=<pids> \
    --cpus=<cpus> \
    --cap-drop=ALL \
    --security-opt=no-new-privileges \
    [-v <workspace>:/workspace:rw -w /workspace] \
    <image> sh -c "<command>"

If podman is not on PATH, fall back to docker. If neither is present, raise a clear RuntimeError mentioning both.

Standalone CLI

sandbox.py must be runnable directly:

python sandbox.py [--image IMAGE] [--timeout N] [--network none|bridge]
                  [--memory SIZE] [--pids N] [--cpus N] [--workspace DIR]
                  -- COMMAND [ARG ...]
  • Use argparse.
  • The -- separator divides flags from the command. Everything after -- is joined with a single space and passed as command.
  • Default workspace for the CLI is os.getcwd() (so the user's working dir is mounted by default when invoked from the shell).
  • The script prints the formatted output string to stdout.
  • Exit code matches the container's exit code (so the CLI is composable with shell pipelines). Timeout exits with code 124.

Example

Library use:

from sandbox import sandbox_run
print(sandbox_run("echo hi"))
# exit=0
# --- stdout ---
# hi
# --- stderr ---

CLI use:

$ python sandbox.py -- echo hi
exit=0
--- stdout ---
hi
--- stderr ---

$ echo $?
0

Out of scope (do not implement)

  • Persistent sandboxes (sandbox_create / sandbox_exec / sandbox_destroy)
  • Image allowlist / trusted registries
  • Concurrent sandbox pooling
  • tmpfs-only workspace mode
  • Mia harness integration files (tools/sandbox.py etc.)
PRMT prompt.md what the model reads

Task: implement sandbox.py

Read SPEC.md in this directory. Implement sandbox.py exactly per spec: the sandbox_run(...) function, and the standalone CLI entry point.

This task covers only sandbox.py. The mia harness integration files (tools/sandbox.py, repl/tool_registry.py, permissions.py) are out of scope — do not create them.

Hard constraints

  • Python 3.10+, stdlib only — no pip install, no new dependencies.
  • Use subprocess.run(argv, shell=False) (or equivalent argv-list form). Never invoke a host shell to interpolate command.
  • network parameter defaults to "none". Any deviation is a fail.
  • Every podman invocation includes --memory, --pids-limit, --cpus, --cap-drop=ALL, --security-opt=no-new-privileges.
  • Output truncated at 50,000 bytes total.
  • Output format must match the example in SPEC.md exactly:
    exit=<n>
    --- stdout ---
    <stdout>
    --- stderr ---
    <stderr>
    

Deliverable

A single file sandbox.py at the repo root that:

  1. Exposes sandbox_run(command, workspace=None, image="debian:stable-slim", timeout=60, network="none", memory="2g", pids=512, cpus=2.0) returning the formatted output string.
  2. Has a __main__ block exposing the CLI shape from SPEC.md.

What to do when finished

  1. Run these two smoke checks and confirm both work:
    • python sandbox.py -- echo hi → exit 0, output matches the format example in SPEC.md exactly (exit=0, then --- stdout ---, then hi, then --- stderr ---).
    • python sandbox.py --timeout 2 -- sleep 30 → exit 124, no traceback.
  2. Print the final sandbox.py contents to confirm.
  3. State: "Done. Implementation in sandbox.py."

You may write your own scratch tests in model_tests/ if useful — they will not be graded, but they may appear in the diff.

What NOT to do

  • Do not modify PROMPT.md or SPEC.md.
  • Do not add a requirements.txt, pyproject.toml, Pipfile, or any dependency manifest.
  • Do not create a virtualenv.
  • Do not implement persistent sandboxes, image allowlists, or any item under "Future (v0.2+)" in SPEC.md.
  • Do not split the implementation across multiple modules — one file.
RUBR judge_rubric.md how judges score

Judge rubric: sandbox task

Fill one copy per implementation, saved as output/<label>_rubric.md. Also write output/<label>_scores.json with the structured form (see JUDGE_PROMPT.md).

Implementation reviewed: <label> (e.g. A, B, C) File: implementations/<label>.py

Hard-fail (any miss = fail run)

Cite line numbers when something fails.

  • [ ] sandbox.py exists at the expected location (provided as <label>.py)
  • [ ] Top-level sandbox_run(...) function with signature matching SPEC
  • [ ] Subprocess invocation uses argv list / shell=False (no host shell interpolation of command)
  • [ ] network param defaults to "none"
  • [ ] No external Python dependencies (stdlib only — import statements reference only stdlib modules)

Hard-fail result: pass / fail If fail, reasons (with line refs):

Spec compliance — score 0–10

Award 1 point per item present and correct. Cite line numbers for the items you award and for the ones you don't.

  • [ ] All param defaults match SPEC (image, timeout, network, memory, pids, cpus)
  • [ ] Resource limits on every podman invocation (--memory, --pids-limit, --cpus, --cap-drop=ALL, --security-opt=no-new-privileges)
  • [ ] Workspace mount semantics correct (None = no mount; path = bind r/w at /workspace with -w /workspace)
  • [ ] Output format matches exit=<n>\n--- stdout ---\n<...>\n--- stderr ---\n<...>
  • [ ] 50KB output truncation
  • [ ] Timeout returns exit code 124
  • [ ] Decoding uses errors="replace" (or equivalent — no UnicodeDecodeError on binary output)
  • [ ] CLI: argparse + -- separator works
  • [ ] CLI: default workspace = os.getcwd()
  • [ ] CLI: exit code matches inner container exit

Subtotal: __ / 10 Notes:

Hidden tests

Skip — reviewer fills from <model>/<task>-<date>/test-output.txt.

You did not see the test suite. The aggregator wires objective test results into the final review.

Code quality — score each 0–5

  • [ ] Clarity — naming, structure, function decomposition (0–5): __
  • [ ] Conciseness — no over-engineering, no unused branches (0–5): __
  • [ ] Error handling — proportional, fails loud at boundaries (0–5): __
  • [ ] Comments — only where the why is non-obvious (0–5): __

Subtotal: __ / 20

One-line summary

(One sentence — what stood out.)

Verdict

ship-with-cleanup / rewrite / unusable