Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 11 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
# Snowl

[![CI](https://github.com/Qitor/snowl/actions/workflows/ci.yml/badge.svg)](https://github.com/Qitor/snowl/actions/workflows/ci.yml)
![Python](https://img.shields.io/badge/python-%3E%3D3.10-blue)
![Docker Sandbox](https://img.shields.io/badge/docker--sandbox-ready-2496ED)
![Benchmarks](https://img.shields.io/badge/benchmarks-20%2B-success)
![License](https://img.shields.io/badge/license-see%20repo-lightgrey)

[English](./README.md) | [简体中文](./README.zh-CN.md)

Snowl is an open-source safety evaluation framework for AI agents.
Expand Down Expand Up @@ -49,7 +55,11 @@ change.
- Built-in agent evaluator primitives for answer matching, function-call
matching, tool trace policy, canary leakage, workspace/state checks, command
checks, checkpoint scoring, rubric judging, and grouped metrics
- Local runtime orchestration for terminal and GUI-style benchmark tasks
- Phase-aware local runtime orchestration for terminal, GUI, sandbox, and
container-backed benchmark tasks
- Runtime-owned isolated workspaces with before/after snapshots, diff metadata,
and artifact collection hooks
- Runtime-owned container cleanup for compose and Docker container providers
- Provider-aware concurrency controls for OpenAI-compatible model clients
- Automatic live artifacts: `manifest.json`, `plan.json`, `events.jsonl`,
`runtime_state.json`, `outcomes.json`, `aggregate.json`, CSV exports, and
Expand Down
22 changes: 12 additions & 10 deletions docs/architecture/runtime_and_scheduler.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,14 +72,16 @@ Current implementation details:

- Contract normalization lives in `snowl/runtime/container_contract.py`.
- Runtime registration and cleanup ownership live in `snowl/runtime/container_lifecycle.py`.
- Runtime-owned per-trial workspace materialization and snapshots live in `snowl/runtime/workspace.py`.
- `prepare_trial_phase()` injects both `__snowl_container_session` and `__snowl_runtime_container_spec` into agent context.
- When a sample declares workspace inputs, runtime injects `__snowl_workspace`, `workspace_dir`, and before/after snapshot metadata for scorers.
- TerminalBench and OSWorld example agents now treat a missing runtime-managed session as a runtime contract violation.

What this does not mean yet:

- runtime-owned containers are not warm-pooled by default
- `spec_hash` does not yet drive dispatch priority or reuse
- `max_container_slots` is still not a universal admission gate across every benchmark container path
- `max_container_slots` gates runtime-managed container prepare through `begin_prepare()`

## Planner / Eval / Runtime Relationship

Expand Down Expand Up @@ -133,10 +135,10 @@ The main eval loop in `snowl/eval.py` is the real runtime behavior for repo-leve
5. For each trial:
- delegate one-trial side effects to internal `EvalTrialLifecycle`
- construct `TrialRequest`
- call `prepare_trial_phase(request)` under `scheduler.running_trial_slot()`
- call `execute_agent_phase(prepared)` under the same running-trial admission
- call `score_trial_phase(prepared, partial)` under `scheduler.scoring_slot()`
- call `finalize_trial_phase(prepared, outcome)` after scoring
- call `prepare_trial_phase(request)` under `scheduler.begin_prepare(...)`
- call `execute_agent_phase(prepared)` under `scheduler.begin_execute(...)`
- call `score_trial_phase(prepared, partial)` under `scheduler.begin_score(...)`
- call `finalize_trial_phase(prepared, outcome)` under `scheduler.begin_finalize(...)`
- record the recovery attempt
- schedule deferred auto retry if the outcome is retry-eligible
6. In the run `finally` path:
Expand All @@ -157,18 +159,18 @@ The main eval loop and `execute_trial()` are now aligned on phase order:
- score
- finalize

The remaining mismatch is not phase omission; it is phase admission depth. Prepare still happens while the trial is already holding `running_trial_slot()`, and finalize is still a helper call rather than a separately scheduled phase.
The remaining mismatch is dispatch depth, not phase admission. The outer eval loop still schedules whole trial coroutines, while the lifecycle admits prepare, execute, score, and finalize separately inside each coroutine.

## Known Contract Mismatches

These are confirmed mismatches between exposed runtime surfaces and the main eval-loop behavior.

- `prepare_trial_phase()` is a real helper, but the main eval loop still admits it under `scheduler.running_trial_slot()` semantics. Future scheduler work must not describe prepare as independently admitted today.
- `prepare_trial_phase()` is admitted through `scheduler.begin_prepare(...)`, but the outer dispatch loop is still not a dedicated prepare worker pool.
- Provider budgets are enforced most strongly at model-call time through `OpenAICompatibleChatClient.set_global_model_call_slot_resolver(...)` and `_acquire_model_slot()`. The dispatch loop does not currently choose the next trial based on provider headroom.
- Runtime-owned container cleanup is centralized for runtime-managed resources, but `max_container_slots` still does not serve as a universal dispatcher gate for every benchmark container prepare path.
- Runtime-owned container cleanup is centralized for runtime-managed resources, and `max_container_slots` gates `runtime_container.requires_container` provider prepare paths.
- `spec_hash` is computed from the normalized container contract and carried into trial payload/trace, but it does not drive dispatch priority, batching, locality-aware reuse, or warm-pool preference.
- `TaskExecutionPlan` and `TrialDescriptor` exist on `TrialRequest`, but `run_eval_with_components()` does not populate them for repo-level runs. Their presence is not proof of plan-aware scheduling.
- `begin_prepare()` and `begin_finalize()` exist on `ResourceScheduler`, but the main eval loop uses only `running_trial_slot()` and `scoring_slot()` directly.
- `TaskExecutionPlan` and `TrialDescriptor` are populated on `TrialRequest` for phase admission, but their presence is not proof of plan-aware dispatch ordering.
- `begin_prepare()`, `begin_execute()`, `begin_score()`, and `begin_finalize()` are used by `EvalTrialLifecycle`.
- Benchmark/sample metadata may still carry raw provider startup fields such as compose paths or OSWorld settings for benchmark compatibility, but runtime ownership decisions must come from the normalized `runtime_container` contract, not from agent-side interpretation of those raw fields.

## Resource Budgets
Expand Down
29 changes: 15 additions & 14 deletions docs/current_state.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,10 +40,12 @@ What works well today:
- Provider budgets are enforced for OpenAI-compatible model calls through `OpenAICompatibleChatClient` and `scheduler.provider_slot(...)`.
- TerminalBench and OSWorld now use a task-declared `runtime_container` contract that runtime resolves before agent execution.
- Runtime-owned benchmark container resources are registered, leased, released, and summarized by a shared lifecycle manager.
- `runtime_container` now supports provider-name-first v2 fields for workspace, init/start/check commands, network, env, mounts, artifacts, and resource limits while retaining legacy startup fields.
- Runtime-owned per-trial workspaces can materialize source files, inject `workspace_dir`, snapshot before/after files, and expose workspace diff metadata to scorers.
- Samples can restrict available tools with `metadata.tool_names` or `metadata.target_functions`; missing requested tools fail in prepare with a non-retryable validation error.
- Samples can also declare dynamic OpenAI-style tool schemas in `metadata.tool_schemas`; runtime converts them into `ToolSpec`s, merges them with project tools, and fails prepare on schema conflicts.
- Agent-oriented scorer primitives now cover normalized trace extraction, answer matching, function-call matching, trace policy, command checks, workspace diffs, canary leakage, state transitions, checkpoint aggregation, rubric judges, and grouped metrics.
- `compose_terminal` is available as a generic runtime container provider and can be selected through `runtime_container.provider_name`.
- `compose_terminal` and `docker_container` are available as generic runtime container providers and can be selected through `runtime_container.provider_name`.
- The `toolemu` built-in scorer is Snowl-native and no longer imports or executes an external evaluator runtime.
- Repo-level `run_eval()` now performs trial finalize and a run-end cleanup barrier before closing live event output.
- Deferred auto-retry and manual `snowl retry` both reuse a recovery ledger instead of inventing a separate retry system.
Expand All @@ -55,11 +57,11 @@ What works well today:
| Topic | Implemented now | Partially implemented / inconsistent | Planned / not yet real |
| --- | --- | --- | --- |
| Provider budgets | `provider_budgets` are real controls and model calls acquire `scheduler.provider_slot(...)` through `OpenAICompatibleChatClient`. | Dispatch does not prioritize by provider headroom, so trials can be admitted and then wait later on model-call slots. | Scheduler-visible provider-aware dispatch and richer provider backpressure policies. |
| Prepare phase | `prepare_trial_phase()` exists, resolves task-declared container contracts, and performs container/sandbox setup. | In main eval flow, prepare still runs while holding `running_trial_slot()` rather than through an independently admitted prepare queue. | Independently admitted prepare scheduling. |
| Score decoupling | Score is admitted separately under `scoring_slot()` and no longer uses the same slot as execution. | The split is still coarse; prepare and finalize are not independently scheduled in the main loop. | Fully phase-aware scheduling across prepare, execute, score, and finalize. |
| Finalize behavior | `finalize_trial_phase()` is now used in both `execute_trial()` and the repo-level eval loop. | Finalize is still a helper call, not a first-class scheduler-managed phase with its own admission policy. | Finalize as a normal, explicitly scheduled phase in repo-level evals. |
| Runtime-owned container lifecycle | TerminalBench and OSWorld runtime-created containers are registered with run/trial ownership, released at trial end, and covered by a run-end cleanup barrier. | The shared lifecycle model is currently implemented only for these benchmark provider paths; historical or future container-backed paths still need explicit adoption. | Broader generalized container ownership across every container-backed benchmark path. |
| Container slot enforcement | `max_container_slots` exists and is tracked in scheduler/profiling data. Sandbox runtimes can be wrapped with it. | It is not a universal admission gate across every benchmark container prepare path in the main eval loop. | One control plane that gates container-backed work consistently. |
| Prepare phase | `prepare_trial_phase()` resolves workspaces, task-declared container contracts, and sandbox setup under `begin_prepare()`. | The outer dispatch loop is still coroutine-based rather than a fully materialized prepare worker pool. | Queue-level prepare batching and locality-aware reuse. |
| Score decoupling | Score is admitted separately under `begin_score()` and no longer uses the same slot as execution. | The outer dispatch loop still bounds total in-flight coroutines coarsely. | Richer score queue prioritization. |
| Finalize behavior | `finalize_trial_phase()` is admitted through `begin_finalize()` and releases runtime-owned containers. | Finalize has phase stats but not a dedicated concurrency limit. | Dedicated finalize policies if teardown becomes a bottleneck. |
| Runtime-owned container lifecycle | TerminalBench, OSWorld, compose_terminal, and docker_container sessions are registered with run/trial ownership, released at trial end, and covered by a run-end cleanup barrier. | Warm reuse is intentionally absent by default. | Warm-pool reuse and broader provider-specific diagnostics. |
| Container slot enforcement | `max_container_slots` gates runtime-managed container prepare through `begin_prepare()` and sandbox runtimes through the scheduled sandbox wrapper. | Local non-container workspace prepare is not gated by container slots. | More detailed prepare resource classes. |
| `spec_hash` locality | Container providers compute `spec_hash` and trial payloads/traces can carry it. | Queue dispatch does not use it for batching, warm-locality, or reuse preference. | Locality-aware dispatch and stronger prepare reuse. |
| Phase-aware retry | Provider HTTP retry and deferred whole-trial auto retry are real. | Retry is still mostly whole-trial; prepare/score/finalize are not retried as distinct scheduled phases. | Phase-specific retry and recovery policies. |

Expand All @@ -85,11 +87,11 @@ The web monitor currently indexes runs from `.snowl/runs/` and uses:

These areas are real, but still coarse or inconsistent:

- `TaskExecutionPlan` and `TrialDescriptor` exist in `snowl/runtime/resource_scheduler.py`, but `run_eval_with_components()` does not yet populate or use them for smarter dispatch.
- The scheduler exposes prepare/execute/score/finalize APIs, but the main eval loop only uses execute and score admission directly.
- `TrialRequest.execution_plan` and `TrialRequest.trial_descriptor` exist, but repo-level eval code does not populate them.
- `TaskExecutionPlan` and `TrialDescriptor` are populated for repo-level trial lifecycle admission, but not yet used for smarter dispatch ordering.
- The eval trial lifecycle uses scheduler prepare/execute/score/finalize APIs; the outer loop still handles dispatch and retry queues.
- `TrialRequest.execution_plan` and `TrialRequest.trial_descriptor` are populated by `EvalTrialLifecycle`.
- `spec_hash` is computed from normalized container contracts, but the runtime does not yet use it for locality-aware dispatch, warm-pool reuse, or batching.
- `max_container_slots` is wired into sandbox wrapping and scheduler APIs, but not all container-provider prepare paths are centrally admitted through that budget yet.
- `max_container_slots` gates runtime-managed container-provider prepare paths selected by `runtime_container.requires_container`.
- The main dispatch loop is still close to FIFO: it drains `fresh_queue` in plan order, then consumes deferred retries when ready.
- Provider capacity is enforced at model-call admission time, not by a scheduler that prioritizes work based on provider headroom.
- Task/sample rows may still carry raw benchmark startup fields such as compose paths or OSWorld settings, but runtime ownership decisions should come from the normalized `runtime_container` contract.
Expand Down Expand Up @@ -131,7 +133,7 @@ The following show up in docs and scaffolding, but are not current runtime behav

- Scheduler-driven phase planning with explicit `TrialDescriptor` / `TaskExecutionPlan` inputs.
- Locality-aware dispatch using `spec_hash`.
- Broad prepare/finalize admission through `begin_prepare()` and `begin_finalize()`.
- Dedicated queue workers for prepare/finalize beyond the current per-trial phase admission.
- Benchmark container warm reuse or pooling by default.
- More sophisticated blocked-group/canary-first scheduling.
- Distributed or multi-machine execution.
Expand All @@ -140,8 +142,7 @@ The following show up in docs and scaffolding, but are not current runtime behav

- Treat `docs/runtime_scheduling*.md` as design notes, not source-of-truth behavior docs.
- Treat `run_eval()` as the runtime path that matters for end-to-end repo behavior.
- Do not assume `prepare_trial_phase()` or `finalize_trial_phase()` are independently scheduled just because helpers exist.
- Do not assume task/sample raw benchmark fields are the ownership contract; runtime now resolves `runtime_container` and agents must not use raw compose/OSWorld fields to decide whether to start containers.
- Do not assume `max_container_slots` fully governs every container-backed path yet.
- Do not assume `TaskExecutionPlan`, `TrialDescriptor`, or `spec_hash` are wired into dispatch just because the types exist.
- Do not assume `max_container_slots` applies to non-container local workspace materialization.
- Do not assume `TaskExecutionPlan`, `TrialDescriptor`, or `spec_hash` drive dispatch order yet.
- Do not assume multiple providers, distributed execution, or cross-run pooling exist just because the scheduler types look extensible.
10 changes: 10 additions & 0 deletions examples/sandbox-coding-smoke/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Sandbox Coding Smoke

Minimal project showing Snowl runtime-owned workspaces for coding-agent style
benchmarks. The task seeds a tiny repository, the agent edits the isolated
workspace, and the scorer checks the resulting file diff.

```bash
snowl eval examples/sandbox-coding-smoke/project.yml
```

25 changes: 25 additions & 0 deletions examples/sandbox-coding-smoke/agent.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
from pathlib import Path

from snowl.core import StopReason


class WorkspaceFixAgent:
agent_id = "workspace_fix_agent"

async def run(self, state, context, tools=None):
_ = tools
workspace = context.metadata.get("__snowl_workspace") or {}
workspace_dir = Path(str(workspace.get("workspace_dir") or "."))
target = workspace_dir / "src" / "app.py"
target.write_text("def add(a, b):\n return a + b\n", encoding="utf-8")
state.output = {
"message": {"role": "assistant", "content": "patched src/app.py"},
"usage": {"input_tokens": 1, "output_tokens": 1, "total_tokens": 2},
"trace_events": [{"event": "example.patch", "path": "src/app.py"}],
}
state.stop_reason = StopReason.COMPLETED
return state


agent = WorkspaceFixAgent()

27 changes: 27 additions & 0 deletions examples/sandbox-coding-smoke/project.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
project:
name: sandbox-coding-smoke
root_dir: .

provider:
id: local
kind: openai_compatible
base_url: http://127.0.0.1:9/v1
api_key: unused

agent_matrix:
models:
- id: local
model: local/no-model

eval:
benchmark: sandbox_coding_smoke
code:
base_dir: .
task_module: ./task.py
agent_module: ./agent.py
scorer_module: ./scorer.py

runtime:
max_running_trials: 1
max_container_slots: 1
max_scoring_tasks: 1
18 changes: 18 additions & 0 deletions examples/sandbox-coding-smoke/scorer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
from snowl.core import Score
from snowl.scorer import workspace_diff


class WorkspaceSmokeScorer:
scorer_id = "workspace_smoke"

def score(self, task_result, trace, context):
base = workspace_diff(metric_name="workspace_changed").score(task_result, trace, context)
changed = base["workspace_changed"]
return {
"workspace_changed": changed,
"accuracy": Score(changed.value, metadata=dict(changed.metadata)),
}


scorer = WorkspaceSmokeScorer()

31 changes: 31 additions & 0 deletions examples/sandbox-coding-smoke/task.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
from snowl.core import EnvSpec, Task


def _samples():
yield {
"id": "fix-add",
"input": "Fix src/app.py so add(1, 2) returns 3.",
"metadata": {
"workspace": {
"enabled": True,
"repo_files": {
"src/app.py": "def add(a, b):\n return 0\n",
"tests/test_app.py": "from src.app import add\n\n\ndef test_add():\n assert add(1, 2) == 3\n",
},
},
"required_changed_paths": ["src/app.py"],
"check_command": "python -m pytest -q",
},
}


task = Task(
task_id="sandbox-coding-smoke",
env_spec=EnvSpec(env_type="terminal", provided_ops=("process.run", "terminal.exec", "terminal.capture", "terminal.wait")),
sample_iter_factory=_samples,
metadata={
"benchmark": "sandbox_coding_smoke",
"primary_metric": "workspace_changed",
},
)

Loading
Loading