sandbox
A Python module that wraps Podman (or Docker) to run commands inside ephemeral, network-isolated, resource-capped containers. Stdlib only.
A Python module that wraps Podman (or Docker) to run commands inside ephemeral, network-isolated, resource-capped containers. Stdlib only.
The raw documents handed to every model and every judge. Read these as the source of truth.
A Python module that wraps Podman (or Docker) to run commands inside ephemeral, network-isolated, resource-capped containers. Stdlib only.
def sandbox_run(
command: str,
workspace: str | None = None, # host dir to bind r/w at /workspace; None = no mount
image: str = "debian:stable-slim",
timeout: int = 60,
network: str = "none", # "none" | "bridge"
memory: str = "2g",
pids: int = 512,
cpus: float = 2.0,
) -> str:
command inside an ephemeral container of image.workspace is a path, it is bind-mounted read-write at /workspace,
and the container's working directory is set to /workspace. If
workspace is None, no host directory is mounted.--rm).none (default) means no DNS, no outbound. bridge enables
default networking.--memory, --pids-limit, --cpus,
--cap-drop=ALL, --security-opt=no-new-privileges.shell=False. The command
string is passed as an argument to sh -c inside the container. The
host shell must never interpolate it.subprocess.run(timeout=...). On timeout,
the container is terminated and the returned string indicates a timeout.... [truncated] may be appended.--pull=missing is fine for v0.1.A single string:
exit=<n>
--- stdout ---
<stdout bytes, decoded>
--- stderr ---
<stderr bytes, decoded>
Where <n> is the container's exit code (or 124 on timeout, matching
GNU timeout convention). Decoding errors are replaced (errors="replace").
Format rules (normative):
exit=<n>, --- stdout ---, --- stderr ---) is on its
own line, terminated by \n.--- stdout ---\n line and ends with exactly one \n before the
--- stderr --- header.--- stderr ---\n line. It may or may not end with a trailing newline
(preserve whatever the underlying stream produced).podman run --rm --pull=missing \
--network=<network> \
--memory=<memory> \
--pids-limit=<pids> \
--cpus=<cpus> \
--cap-drop=ALL \
--security-opt=no-new-privileges \
[-v <workspace>:/workspace:rw -w /workspace] \
<image> sh -c "<command>"
If podman is not on PATH, fall back to docker. If neither is present,
raise a clear RuntimeError mentioning both.
sandbox.py must be runnable directly:
python sandbox.py [--image IMAGE] [--timeout N] [--network none|bridge]
[--memory SIZE] [--pids N] [--cpus N] [--workspace DIR]
-- COMMAND [ARG ...]
argparse.-- separator divides flags from the command. Everything after --
is joined with a single space and passed as command.workspace for the CLI is os.getcwd() (so the user's working
dir is mounted by default when invoked from the shell).Library use:
from sandbox import sandbox_run
print(sandbox_run("echo hi"))
# exit=0
# --- stdout ---
# hi
# --- stderr ---
CLI use:
$ python sandbox.py -- echo hi
exit=0
--- stdout ---
hi
--- stderr ---
$ echo $?
0
sandbox_create / sandbox_exec / sandbox_destroy)tools/sandbox.py etc.)sandbox.pyRead SPEC.md in this directory. Implement sandbox.py exactly per spec:
the sandbox_run(...) function, and the standalone CLI entry point.
This task covers only sandbox.py. The mia harness integration files
(tools/sandbox.py, repl/tool_registry.py, permissions.py) are out of
scope — do not create them.
pip install, no new dependencies.subprocess.run(argv, shell=False) (or equivalent argv-list form).
Never invoke a host shell to interpolate command.network parameter defaults to "none". Any deviation is a fail.--memory, --pids-limit, --cpus,
--cap-drop=ALL, --security-opt=no-new-privileges.exit=<n>
--- stdout ---
<stdout>
--- stderr ---
<stderr>
A single file sandbox.py at the repo root that:
sandbox_run(command, workspace=None, image="debian:stable-slim", timeout=60, network="none", memory="2g", pids=512, cpus=2.0) returning the formatted output string.__main__ block exposing the CLI shape from SPEC.md.python sandbox.py -- echo hi → exit 0, output matches the format
example in SPEC.md exactly (exit=0, then --- stdout ---, then
hi, then --- stderr ---).python sandbox.py --timeout 2 -- sleep 30 → exit 124, no traceback.sandbox.py contents to confirm.You may write your own scratch tests in model_tests/ if useful — they will
not be graded, but they may appear in the diff.
PROMPT.md or SPEC.md.requirements.txt, pyproject.toml, Pipfile, or any
dependency manifest.Fill one copy per implementation, saved as output/<label>_rubric.md.
Also write output/<label>_scores.json with the structured form (see
JUDGE_PROMPT.md).
Implementation reviewed: <label> (e.g. A, B, C)
File: implementations/<label>.py
Cite line numbers when something fails.
sandbox.py exists at the expected location (provided as <label>.py)sandbox_run(...) function with signature matching SPECshell=False (no host shell
interpolation of command)network param defaults to "none"import statements
reference only stdlib modules)Hard-fail result: pass / fail If fail, reasons (with line refs):
Award 1 point per item present and correct. Cite line numbers for the items you award and for the ones you don't.
image, timeout, network,
memory, pids, cpus)--memory,
--pids-limit, --cpus, --cap-drop=ALL,
--security-opt=no-new-privileges)None = no mount; path = bind
r/w at /workspace with -w /workspace)exit=<n>\n--- stdout ---\n<...>\n--- stderr ---\n<...>errors="replace" (or equivalent — no
UnicodeDecodeError on binary output)-- separator worksos.getcwd()Subtotal: __ / 10 Notes:
Skip — reviewer fills from <model>/<task>-<date>/test-output.txt.
You did not see the test suite. The aggregator wires objective test results into the final review.
Subtotal: __ / 20
(One sentence — what stood out.)
ship-with-cleanup / rewrite / unusable