full contract

The raw documents handed to every model and every judge. Read these as the source of truth.

SPEC spec.md what done looks like

sandbox.py — implementation spec (v0.1)

A Python module that wraps Podman (or Docker) to run commands inside ephemeral, network-isolated, resource-capped containers. Stdlib only.

Public function

def sandbox_run(
    command: str,
    workspace: str | None = None,   # host dir to bind r/w at /workspace; None = no mount
    image: str = "debian:stable-slim",
    timeout: int = 60,
    network: str = "none",          # "none" | "bridge"
    memory: str = "2g",
    pids: int = 512,
    cpus: float = 2.0,
) -> str:

Behaviour

Runs command inside an ephemeral container of image.
If workspace is a path, it is bind-mounted read-write at /workspace, and the container's working directory is set to /workspace. If workspace is None, no host directory is mounted.
Container is destroyed after the command exits (--rm).
Network: none (default) means no DNS, no outbound. bridge enables default networking.
Resource limits applied on every run: --memory, --pids-limit, --cpus, --cap-drop=ALL, --security-opt=no-new-privileges.
Subprocess invocation uses an argv list with shell=False. The command string is passed as an argument to sh -c inside the container. The host shell must never interpolate it.
Wall-clock timeout enforced via subprocess.run(timeout=...). On timeout, the container is terminated and the returned string indicates a timeout.
The formatted return string is truncated to 50,000 bytes total after construction (i.e. apply truncation to the final string with the headers in place, not to stdout/stderr separately and not via a proportional split). Slice the tail; do not split mid-byte across a multibyte sequence (decode first, then truncate by characters that re-encode within the cap). Truncation is silent (no error), but a clear marker like ... [truncated] may be appended.
First-call latency is allowed: --pull=missing is fine for v0.1.

Return format

A single string:

exit=<n>
--- stdout ---
<stdout bytes, decoded>
--- stderr ---
<stderr bytes, decoded>

Where <n> is the container's exit code (or 124 on timeout, matching GNU timeout convention). Decoding errors are replaced (errors="replace").

Format rules (normative):

Each header (exit=<n>, --- stdout ---, --- stderr ---) is on its own line, terminated by \n.
The stdout body, if non-empty, comes immediately after the --- stdout ---\n line and ends with exactly one \n before the --- stderr --- header.
The stderr body, if non-empty, comes immediately after the --- stderr ---\n line. It may or may not end with a trailing newline (preserve whatever the underlying stream produced).
If a body is empty, the next header (or end of string) follows directly after the previous header line — no blank line is inserted.

Podman invocation (reference)

podman run --rm --pull=missing \
    --network=<network> \
    --memory=<memory> \
    --pids-limit=<pids> \
    --cpus=<cpus> \
    --cap-drop=ALL \
    --security-opt=no-new-privileges \
    [-v <workspace>:/workspace:rw -w /workspace] \
    <image> sh -c "<command>"

If podman is not on PATH, fall back to docker. If neither is present, raise a clear RuntimeError mentioning both.

Standalone CLI

sandbox.py must be runnable directly:

python sandbox.py [--image IMAGE] [--timeout N] [--network none|bridge]
                  [--memory SIZE] [--pids N] [--cpus N] [--workspace DIR]
                  -- COMMAND [ARG ...]

Use argparse.
The -- separator divides flags from the command. Everything after -- is joined with a single space and passed as command.
Default workspace for the CLI is os.getcwd() (so the user's working dir is mounted by default when invoked from the shell).
The script prints the formatted output string to stdout.
Exit code matches the container's exit code (so the CLI is composable with shell pipelines). Timeout exits with code 124.

Example

Library use:

from sandbox import sandbox_run
print(sandbox_run("echo hi"))
# exit=0
# --- stdout ---
# hi
# --- stderr ---

CLI use:

$ python sandbox.py -- echo hi
exit=0
--- stdout ---
hi
--- stderr ---

$ echo $?
0

Out of scope (do not implement)

Persistent sandboxes (sandbox_create / sandbox_exec / sandbox_destroy)
Image allowlist / trusted registries
Concurrent sandbox pooling
tmpfs-only workspace mode
Mia harness integration files (tools/sandbox.py etc.)

PRMT prompt.md what the model reads

Task: implement `sandbox.py`

Read SPEC.md in this directory. Implement sandbox.py exactly per spec: the sandbox_run(...) function, and the standalone CLI entry point.

This task covers only sandbox.py. The mia harness integration files (tools/sandbox.py, repl/tool_registry.py, permissions.py) are out of scope — do not create them.

Hard constraints

Python 3.10+, stdlib only — no pip install, no new dependencies.
Use subprocess.run(argv, shell=False) (or equivalent argv-list form). Never invoke a host shell to interpolate command.
network parameter defaults to "none". Any deviation is a fail.
Every podman invocation includes --memory, --pids-limit, --cpus, --cap-drop=ALL, --security-opt=no-new-privileges.
Output truncated at 50,000 bytes total.

Output format must match the example in SPEC.md exactly:

exit=<n>
--- stdout ---
<stdout>
--- stderr ---
<stderr>

Deliverable

A single file sandbox.py at the repo root that:

Exposes sandbox_run(command, workspace=None, image="debian:stable-slim", timeout=60, network="none", memory="2g", pids=512, cpus=2.0) returning the formatted output string.
Has a __main__ block exposing the CLI shape from SPEC.md.

What to do when finished

Run these two smoke checks and confirm both work:
- python sandbox.py -- echo hi → exit 0, output matches the format example in SPEC.md exactly (exit=0, then --- stdout ---, then hi, then --- stderr ---).
- python sandbox.py --timeout 2 -- sleep 30 → exit 124, no traceback.
Print the final sandbox.py contents to confirm.
State: "Done. Implementation in sandbox.py."

You may write your own scratch tests in model_tests/ if useful — they will not be graded, but they may appear in the diff.

What NOT to do

Do not modify PROMPT.md or SPEC.md.
Do not add a requirements.txt, pyproject.toml, Pipfile, or any dependency manifest.
Do not create a virtualenv.
Do not implement persistent sandboxes, image allowlists, or any item under "Future (v0.2+)" in SPEC.md.
Do not split the implementation across multiple modules — one file.

RUBR judge_rubric.md how judges score

Judge rubric: sandbox task

Fill one copy per implementation, saved as output/<label>_rubric.md. Also write output/<label>_scores.json with the structured form (see JUDGE_PROMPT.md).

Implementation reviewed: <label> (e.g. A, B, C) File: implementations/<label>.py

Hard-fail (any miss = fail run)

Cite line numbers when something fails.

[ ] sandbox.py exists at the expected location (provided as <label>.py)
[ ] Top-level sandbox_run(...) function with signature matching SPEC
[ ] Subprocess invocation uses argv list / shell=False (no host shell interpolation of command)
[ ] network param defaults to "none"
[ ] No external Python dependencies (stdlib only — import statements reference only stdlib modules)

Hard-fail result: pass / fail If fail, reasons (with line refs):

Spec compliance — score 0–10

Award 1 point per item present and correct. Cite line numbers for the items you award and for the ones you don't.

[ ] All param defaults match SPEC (image, timeout, network, memory, pids, cpus)
[ ] Resource limits on every podman invocation (--memory, --pids-limit, --cpus, --cap-drop=ALL, --security-opt=no-new-privileges)
[ ] Workspace mount semantics correct (None = no mount; path = bind r/w at /workspace with -w /workspace)
[ ] Output format matches exit=<n>\n--- stdout ---\n<...>\n--- stderr ---\n<...>
[ ] 50KB output truncation
[ ] Timeout returns exit code 124
[ ] Decoding uses errors="replace" (or equivalent — no UnicodeDecodeError on binary output)
[ ] CLI: argparse + -- separator works
[ ] CLI: default workspace = os.getcwd()
[ ] CLI: exit code matches inner container exit

Subtotal: __ / 10 Notes:

Hidden tests

Skip — reviewer fills from <model>/<task>-<date>/test-output.txt.

You did not see the test suite. The aggregator wires objective test results into the final review.

Code quality — score each 0–5

[ ] Clarity — naming, structure, function decomposition (0–5): __
[ ] Conciseness — no over-engineering, no unused branches (0–5): __
[ ] Error handling — proportional, fails loud at boundaries (0–5): __
[ ] Comments — only where the why is non-obvious (0–5): __

Subtotal: __ / 20

One-line summary

(One sentence — what stood out.)

Verdict

ship-with-cleanup / rewrite / unusable

sandbox

latest run

contract

must

out of scope