Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
da61fc8
feat(dashboard): add MAS visual debugger activity stack
cm2435 Apr 26, 2026
75d194c
fix(dashboard): align visual debugger styling with Claude design
cm2435 Apr 26, 2026
541035a
Refine run debugger and smoke test boundaries
cm2435 Apr 26, 2026
acf16e5
Merge main into visual debugger branch
cm2435 Apr 26, 2026
8589b83
Fix CI formatting and generated contracts
cm2435 Apr 26, 2026
070e30f
Fix type checks after main sync
cm2435 Apr 26, 2026
e9d92e5
Fix CI build and migration head
cm2435 Apr 26, 2026
8e54706
Add evaluation visibility and smoke coverage
cm2435 Apr 27, 2026
9bbde99
Make criterion rubric details first class
cm2435 Apr 27, 2026
1d4a9c4
Fix graph rubric glyph test id collision
cm2435 Apr 27, 2026
3fe6212
Improve research rubric rollout diagnostics
cm2435 Apr 27, 2026
ca3b720
Consolidate LLM context capture in builtins
cm2435 Apr 27, 2026
2b9788a
wip: cli fixes and refactors
cm2435 Apr 27, 2026
e361806
docs: plan core schema deduplication
cm2435 Apr 28, 2026
89b50b2
refactor: narrow public API surface
cm2435 Apr 28, 2026
4944832
refactor: improve research rubric evaluation
cm2435 Apr 28, 2026
b423097
chore: apply lint and test cleanup
cm2435 Apr 28, 2026
98e64e4
test: keep OpenRouter budget helper in real LLM tests
cm2435 Apr 28, 2026
05fe264
docs: update OpenRouter budget helper references
cm2435 Apr 28, 2026
63f7f07
Consolidate graph status conventions
cm2435 Apr 28, 2026
eedabec
Use graph status conventions in propagation
cm2435 Apr 28, 2026
9beec0f
Align propagation contract with blocked successors
cm2435 Apr 28, 2026
0d8facb
Use canonical evaluation criterion status
cm2435 Apr 28, 2026
c0f3eba
Unify graph mutation payload contracts
cm2435 Apr 28, 2026
3ac0cfc
Collapse duplicate task node projections
cm2435 Apr 28, 2026
b114390
test: add missing tool budget module
cm2435 Apr 28, 2026
3debc68
fix: emit assigned worker slug in task status events
cm2435 Apr 28, 2026
de7c73b
Centralize task cancellation causes
cm2435 Apr 28, 2026
7803087
Share typed context event payload schemas
cm2435 Apr 28, 2026
1f134ab
Guard generation to context event mapping
cm2435 Apr 28, 2026
060b927
Guard core schema source ownership
cm2435 Apr 28, 2026
88fbc6f
Align dashboard contracts with core schemas
cm2435 Apr 28, 2026
7256795
Use worker slug in dashboard task state
cm2435 Apr 28, 2026
3b0c838
docs: audit runtime service layout
cm2435 Apr 28, 2026
164b17d
Stream generation parts through context events
cm2435 Apr 28, 2026
187019e
docs: capture runtime cleanup plans
cm2435 Apr 28, 2026
14ee2b8
Refactor runtime debugging infrastructure
cm2435 Apr 28, 2026
cd21b0c
docs: trim cleanup plan trailing whitespace
cm2435 Apr 28, 2026
00e5569
Merge core schema cleanup into MAS debugger branch
cm2435 Apr 28, 2026
f629cbe
fix: trim schema trailing whitespace
cm2435 Apr 28, 2026
4875c94
fix: align field docs guard with context stream schema
cm2435 Apr 28, 2026
eae839c
fix: address CI integration type issues
cm2435 Apr 28, 2026
ab28db3
Merge main into MAS debugger branch
cm2435 Apr 28, 2026
1ec99e3
Consolidate recovered branch cleanup work
cm2435 Apr 28, 2026
0da06aa
docs: capture cleanup and layout plans
cm2435 Apr 29, 2026
323a1b2
refactor: split public benchmark API package
cm2435 Apr 29, 2026
aed3d1c
refactor: split public criterion and rubric APIs
cm2435 Apr 29, 2026
2dd0e12
refactor: split public worker API package
cm2435 Apr 29, 2026
4f18c87
refactor: introduce explicit component registry API
cm2435 Apr 29, 2026
8839ac2
refactor: move shared core utilities into core shared package
cm2435 Apr 29, 2026
34b84ce
refactor: move experiment domain models into domain package
cm2435 Apr 29, 2026
94b3a29
refactor: move generation context models into domain package
cm2435 Apr 29, 2026
b6c7c17
refactor: move experiment application services
cm2435 Apr 29, 2026
85dbd79
refactor: move task and graph application services
cm2435 Apr 29, 2026
1cba909
refactor: move evaluation application services
cm2435 Apr 29, 2026
857f0c5
refactor: move workflow application services
cm2435 Apr 29, 2026
0499ecc
refactor: move runtime jobs into application package
cm2435 Apr 29, 2026
8172aba
refactor: move read models out of runtime services
cm2435 Apr 29, 2026
297a883
refactor: move Inngest integration into infrastructure package
cm2435 Apr 29, 2026
984a946
refactor: move sandbox infrastructure package
cm2435 Apr 29, 2026
5a2fd4b
refactor: move tracing and dashboard infrastructure
cm2435 Apr 29, 2026
23cf32f
refactor: move FastAPI routes into rest api package
cm2435 Apr 29, 2026
c214063
refactor: update persistence imports for new core layout
cm2435 Apr 29, 2026
0133cd6
refactor: extract builtin worker factories and shared helpers
cm2435 Apr 29, 2026
5369bb9
refactor: register builtins through explicit registry hooks
cm2435 Apr 29, 2026
57dea6b
refactor: update builtin benchmarks and evaluators for new APIs
cm2435 Apr 29, 2026
5202186
refactor: update CLI experiment and benchmark flows
cm2435 Apr 29, 2026
db51579
test: move smoke fixtures into shared test fixtures
cm2435 Apr 29, 2026
22b7e62
test: update package tests and black-box harness layout
cm2435 Apr 29, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
4 changes: 2 additions & 2 deletions .github/workflows/e2e-benchmarks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ jobs:
env:
SMOKE_ENV: ${{ matrix.env }}
ENABLE_TEST_HARNESS: "1"
ERGON_STARTUP_PLUGINS: "ergon_core.test_support.smoke_fixtures:register_smoke_fixtures"
ERGON_STARTUP_PLUGINS: "ergon_builtins.registry:register_builtins,tests.fixtures.smoke_components:register_smoke_fixtures"
TEST_HARNESS_SECRET: ${{ secrets.TEST_HARNESS_SECRET || 'ci-test-harness' }}
E2B_API_KEY: ${{ secrets.E2B_API_KEY }}
GITHUB_PR_NUMBER: ${{ github.event.pull_request.number }}
Expand Down Expand Up @@ -74,7 +74,7 @@ jobs:
# Unified compose reads these as overrides (see docker-compose.yml).
POSTGRES_PASSWORD: ci_test
ENABLE_TEST_HARNESS: "1"
ERGON_STARTUP_PLUGINS: "ergon_core.test_support.smoke_fixtures:register_smoke_fixtures"
ERGON_STARTUP_PLUGINS: "ergon_builtins.registry:register_builtins,tests.fixtures.smoke_components:register_smoke_fixtures"
run: docker compose up -d --build --wait
timeout-minutes: 5

Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ build/

# Environment
.env
.logfire/

# Databases
*.db
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -37,4 +37,4 @@ RUN cd ergon_cli && uv pip install --system -e "."

EXPOSE 9000

CMD ["uvicorn", "ergon_core.core.api.app:app", "--host", "0.0.0.0", "--port", "9000"]
CMD ["uvicorn", "ergon_core.core.rest_api.app:app", "--host", "0.0.0.0", "--port", "9000"]
4 changes: 2 additions & 2 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ services:
- INNGEST_API_BASE_URL=http://inngest-dev:8288
- ERGON_API_BASE_URL=http://api:9000
- ENABLE_TEST_HARNESS=${ENABLE_TEST_HARNESS:-1}
- ERGON_STARTUP_PLUGINS=${ERGON_STARTUP_PLUGINS-ergon_core.test_support.smoke_fixtures:register_smoke_fixtures}
- ERGON_STARTUP_PLUGINS=${ERGON_STARTUP_PLUGINS-ergon_builtins.registry:register_builtins,tests.fixtures.smoke_components:register_smoke_fixtures}
- TEST_HARNESS_SECRET=${TEST_HARNESS_SECRET:-local-dev}
- ERGON_BLOB_ROOT=/tmp/ergon-blob
- OTEL_TRACES_ENABLED=false
Expand Down Expand Up @@ -120,7 +120,7 @@ services:
postgres:
condition: service_healthy
command: >
uvicorn ergon_core.core.api.app:app
uvicorn ergon_core.core.rest_api.app:app
--host 0.0.0.0 --port 9000 --reload
--reload-dir /app/ergon_core
--reload-dir /app/ergon_builtins
Expand Down
31 changes: 17 additions & 14 deletions docs/architecture/03_providers.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,26 +2,28 @@

## 1. Purpose

The providers layer is Ergon's boundary between runtime code and external execution substrates. It owns four concerns: resolving `model_id` strings to `pydantic_ai.models.Model` instances, provisioning and tearing down E2B sandboxes via per-benchmark manager subclasses, surfacing sandbox state transitions as dashboard events, and publishing worker outputs as content-addressed blobs that evaluators can re-read. Everything that crosses the process boundary (LLM API, container runtime, blob storage) is routed through this layer so the runtime, workers, and evaluators stay substrate-agnostic.
The provider-style boundaries are Ergon's adapters between runtime code and external execution substrates. Model resolution lives in the generation registry, while sandbox infrastructure now lives under `ergon_core.core.sandbox` because it owns lifecycle, instrumentation, event emission, and artifact publishing rather than just a third-party provider adapter.

## 2. Core abstractions

| Name | Kind | Location | Freeze status | Owner |
| --- | --- | --- | --- | --- |
| `_BACKEND_REGISTRY` | module-level dict | `ergon_core/core/providers/generation/model_resolution.py` | Frozen shape; entries grow via registration. | Providers layer. |
| `resolve_model_target` | function | `ergon_core/core/providers/generation/model_resolution.py` | Public, frozen signature. Returns `ResolvedModel`. | Providers layer. |
| `BaseSandboxManager` | abstract class + singleton | `ergon_core/core/providers/sandbox/manager.py` | Shape stable; `event_sink` activation path in flux. | Providers layer. |
| `DefaultSandboxManager` | concrete class | `ergon_core/core/providers/sandbox/manager.py` | Frozen. | Providers layer. |
| `register_model_backend` | function | `ergon_core/core/providers/generation/model_resolution.py` | Public, frozen signature. | Providers layer; callers are backend modules executing at import time. |
| `BaseSandboxManager` | abstract class + singleton | `ergon_core/core/sandbox/manager.py` | Shape stable; `event_sink` activation path in flux. | Sandbox domain. |
| `DefaultSandboxManager` | concrete class | `ergon_core/core/sandbox/manager.py` | Frozen. | Sandbox domain. |
| `SWEBenchSandboxManager`, `MiniF2FSandboxManager`, `ResearchRubricsSandboxManager` | concrete subclasses | `ergon_builtins/` | Owned per benchmark; singletons. | Benchmark authors. |
| `SandboxEventSink` | `typing.Protocol` | `ergon_core/core/providers/sandbox/event_sink.py` | Frozen protocol; activation path in flux. | Providers layer. |
| `NoopSandboxEventSink`, `DashboardEmitterSandboxEventSink` | implementations | `ergon_core/core/providers/sandbox/event_sink.py` | Frozen. | Providers layer. |
| `SandboxResourcePublisher` | class | `ergon_core/core/providers/sandbox/resource_publisher.py` | Frozen API; storage backend swappable via `ERGON_BLOB_ROOT`. | Providers layer. |
| `SandboxEventSink` | `typing.Protocol` | `ergon_core/core/sandbox/event_sink.py` | Frozen protocol; activation path in flux. | Sandbox domain. |
| `NoopSandboxEventSink`, `DashboardEmitterSandboxEventSink` | implementations | `ergon_core/core/sandbox/event_sink.py` | Frozen. | Sandbox domain. |
| `SandboxResourcePublisher` | class | `ergon_core/core/sandbox/resource_publisher.py` | Frozen API; storage backend swappable via `ERGON_BLOB_ROOT`. | Sandbox domain. |
| `TransformersModel` | `pydantic_ai.models.Model` subclass | `ergon_builtins/ergon_builtins/models/transformers_backend.py` | Frozen. | ML team (TRL training loop callers). |

### 2.1 Model target resolution
### 2.1 Generation registry

`resolve_model_target` is the single dispatch point for model target strings. It splits the target on its first colon and returns a `ResolvedModel` wrapping a concrete `pydantic_ai.models.Model` instance. Unknown prefixes raise immediately instead of falling through to PydanticAI inference.
`_BACKEND_REGISTRY` is a prefix-keyed dispatch table of resolver callables. `resolve_model_target` splits the target on its first colon, dispatches to the resolver, and returns a `ResolvedModel` wrapping either a `pydantic_ai.models.Model` instance or a passthrough string. Unknown prefixes fall through to a passthrough `ResolvedModel` — PydanticAI's own `infer_model` is invoked on use. Backends mutate the registry at import time; the builtins pack registers all four in a single loop at `ergon_builtins/ergon_builtins/registry.py:81`.

The supported prefixes are `vllm:<base-url>[#<model>]`, `openai-compatible:<base-url>#<model>`, and cloud provider prefixes `openai:*` / `anthropic:*` / `google:*`. Cloud provider prefixes always route through OpenRouter via PydanticAI's OpenRouter provider; they do not call direct OpenAI, Anthropic, or Google APIs.
The four prefixes registered today are `vllm:*` (local vLLM server via PydanticAI's `OpenAIChatModel`), `openai:*` / `anthropic:*` / `google:*` (passthrough to `infer_model`), and `transformers:*` (custom `TransformersModel` for TRL-trained checkpoints not served over vLLM).

Workers are expected to hold no hardcoded SDK client constructions (`AsyncOpenAI`, `anthropic.Client`, `genai.Client`). This is an invariant (Section 4), not a coincidence, and is currently honored — enforcement is grep discipline.

Expand Down Expand Up @@ -85,7 +87,7 @@ The decentralized shape means `ergon benchmark setup` iterates over whatever sub
Worker.execute()
|
+-> resolve_model_target(self.model) --> ResolvedModel
| (explicit prefix dispatch; cloud targets route via OpenRouter)
| (prefix dispatch; 4 backends + fallthrough to infer_model)
|
+-> ManagerClass() (singleton; returns cached instance)
| ManagerClass().create(sandbox_key=task_id, run_id=run_id, ...)
Expand Down Expand Up @@ -124,7 +126,7 @@ Movement of data across this diagram:
## 4. Invariants

1. **One entry point to LLM resolution.** Every model reference goes through `resolve_model_target`. Enforced by grep discipline and review; no runtime check.
2. **Cloud provider prefixes use OpenRouter.** `openai:*`, `anthropic:*`, and `google:*` model targets are OpenRouter-hosted targets. Direct cloud SDK model routing is intentionally outside the grammar.
2. **Backends register at import time.** `register_model_backend` must be called before any caller hits `resolve_model_target`. Enforced by the builtins pack running its registration loop at import, before any worker module imports.
3. **Singleton managers hold authoritative sandbox state.** A subclass's class-level state is the only source of truth for in-process reconnect. Enforced by `__new__` caching the instance and `get_sandbox` reading the class dict. Applies only within a single Python process; cross-process actors must use `terminate_by_sandbox_id` or provision their own sandbox.
4. **Sandbox lifecycle is per-task.** Enforced by `create` accepting `sandbox_key` and by the worker runtime persisting `sandbox_id` on the execution row.
5. **Sandbox lives across evaluator fan-out.** Teardown runs at the end of `check_evaluators`, not at worker completion, not in `finalize_success`. Enforced by the evaluator harness, not by the manager itself.
Expand All @@ -144,9 +146,10 @@ Movement of data across this diagram:

### 5.1 Add a new LLM backend

1. Add an explicit prefix branch in `resolve_model_target` and keep the constructor logic in a sibling module under `ergon_core/core/providers/generation/`.
2. Return a concrete `pydantic_ai.models.Model` instance wrapped in `ResolvedModel`.
3. Add an entry to `LLMProvider` and `PROVIDER_KEY_MAP` in `ergon_cli/onboarding/profile.py` so onboarding prompts for the key or server URL.
1. Write a resolver that maps `"myprefix:foo"` to a `pydantic_ai.models.Model` instance wrapped in `ResolvedModel`.
2. Register it in the builtins-pack registration loop so `register_model_backend` is called at import time.
3. Ensure the builtins pack is imported before any worker that references `myprefix:*` model ids.
4. Add an entry to `LLMProvider` and `PROVIDER_KEY_MAP` in `ergon_cli/onboarding/profile.py` so onboarding prompts for the key or server URL.

### 5.2 Add a new sandbox manager

Expand Down
64 changes: 38 additions & 26 deletions docs/architecture/07_testing.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,17 +25,17 @@ Path-based, not marker-based. The local gate and the CI workflow both dispatch b

Every PR runs three benchmark legs in parallel via `.github/workflows/e2e-benchmarks.yml`:

| Leg | Slot 1 | Slot 2 | Slot 3 |
|---|---|---|---|
| `researchrubrics` | happy | happy | **sad** — `l_2` forced FAIL |
| `minif2f` | happy | happy | happy |
| `swebench-verified` | happy | happy | happy |
| Leg | Slot 1 | Slot 2 |
|---|---|---|
| `researchrubrics` | happy | **sad** — `l_2` forced FAIL |
| `minif2f` | happy | **sad** — `l_2` forced FAIL |
| `swebench-verified` | happy | **sad** — `l_2` forced FAIL |

**9 top-level runs per PR; 80 leaf sandbox acquisitions** (8 happy × 9 leaves + 1 sad × 8 leaves — `l_3` never provisioned because its dependency failed).
**6 top-level runs per PR; 57 dynamic child sandbox acquisitions** (3 happy × 11 child tasks + 3 sad × 8 child tasks — `l_3` never provisions on sad runs because its dependency failed).

### 3.1 Immutable 9-leaf DAG
### 3.1 Smoke DAG

Every smoke run — happy or sad — spawns exactly this graph:
Every smoke run starts with the same 9 direct children:

```
Diamond (4): Line (3): Singletons (2):
Expand All @@ -46,9 +46,18 @@ d_left d_right
d_join
```

Topology is enforced by `tests/e2e/_fixtures/smoke_base/worker_base.py::SmokeWorkerBase.execute` being decorated `@typing.final`. Subclasses supply the leaf slug via `leaf_slug` and (optionally) override `_spec_for(slug, deps, desc)` to route specific slugs elsewhere — the sad-path subclass uses this to route `l_2` to a failing leaf. They cannot change the DAG itself.
Happy-path runs route top-level `l_2` to `{env}-smoke-recursive-worker`, which plans a nested two-node line under `l_2`:

The single source of truth for topology is [`tests/e2e/_fixtures/smoke_base/constants.py`](../../tests/e2e/_fixtures/smoke_base/constants.py):
```text
l_2
└─ l_2_a → l_2_b
```

Top-level `l_3` depends on `l_2`, so the smoke proves dependency propagation waits for a non-leaf dynamic task before releasing downstream work. Sad-path runs route `l_2` to the failing leaf instead, so `l_3` remains blocked.

Topology is enforced by `ergon_core/test_support/smoke_fixtures/smoke_base/worker_base.py::SmokeWorkerBase.execute` being decorated `@typing.final`. Subclasses supply the leaf slug via `leaf_slug` and override `_spec_for(slug, deps, desc)` only to route specific slugs elsewhere. They cannot change the direct-child DAG itself.

The single source of truth for the direct-child topology is [`ergon_core/test_support/smoke_fixtures/smoke_base/constants.py`](../../ergon_core/ergon_core/test_support/smoke_fixtures/smoke_base/constants.py):

```python
EXPECTED_SUBTASK_SLUGS = (
Expand All @@ -60,29 +69,32 @@ EXPECTED_SUBTASK_SLUGS = (

### 3.2 Fixture residency — test-only, out of `ergon_builtins`

`ergon_builtins/` contains only production baselines (ReActWorker, TrainingStubWorker). All smoke workers, leaves, and criteria live under [`tests/e2e/_fixtures/`](../../tests/e2e/_fixtures/) and register into the process-level `WORKERS` / `EVALUATORS` dicts via an import side-effect in `tests/e2e/_fixtures/__init__.py`, which `tests/e2e/conftest.py` imports at session start.
`ergon_builtins/` contains only production baselines (ReActWorker, TrainingStubWorker). All smoke workers, leaves, and criteria live under [`tests/fixtures/smoke_components/`](../../tests/fixtures/smoke_components/) and register into the process-level core component registry through `register_smoke_fixtures()`.

11 registry rows total — none production:
19 registry rows total — none production:

| Slug | Kind |
|---|---|
| `{env}-smoke-worker` × 3 | Worker (parent) — inherits `SmokeWorkerBase` |
| `{env}-smoke-leaf` × 3 | Worker (leaf) — inherits `BaseSmokeLeafWorker` |
| `researchrubrics-sadpath-smoke-worker` | Worker (sad-path parent) |
| `researchrubrics-smoke-leaf-failing` | Worker (sad-path failing leaf) |
| `{env}-smoke-recursive-worker` × 3 | Worker (nested `l_2` parent) — inherits `RecursiveSmokeWorkerBase` |
| `{env}-sadpath-smoke-worker` × 3 | Worker (sad-path parent) |
| `{env}-smoke-leaf-failing` × 3 | Worker (sad-path failing leaf) |
| `{env}-smoke-criterion` × 3 | Criterion — inherits `SmokeCriterionBase` |
| `smoke-post-root-timing-criterion` | Criterion — second root evaluator used for timing assertions |

where `{env} ∈ {researchrubrics, minif2f, swebench}`.

### 3.3 Turn persistence

- Parent `SmokeWorkerBase.execute` yields **3** `GenerationTurn`s (planning → planned → awaiting) so incremental turn persistence is exercised on every run.
- Happy-path recursive `l_2` yields **3** `GenerationTurn`s.
- Each leaf `BaseSmokeLeafWorker.execute` yields **2** turns (attaching → done).
- Total per happy run: **1 × 3 + 9 × 2 = 21** `GenerationTurn` rows; driver asserts on this.
- Total per happy run: **3 + 3 + 10 × 2 = 26** `GenerationTurn` rows; driver asserts on this.

### 3.4 Inter-agent messaging

Each happy-path leaf calls `CommunicationService.save_message` once on the `smoke-completion` thread (first production caller of that service). 9 `ThreadMessage` rows per happy run, sequence_num 1..9 per thread. Sad-path `l_2` raises before reaching this call — 8 messages on a sad run, with `l_2` missing.
Each happy-path leaf calls `CommunicationService.save_message` once on the `smoke-completion` thread (first production caller of that service). The recursive `l_2` worker also sends one completion message after nested children finish. Happy runs emit 11 `ThreadMessage` rows (`9` direct slugs + `l_2_a`, `l_2_b`), sequence_num 1..11 per thread. Sad-path `l_2` raises before reaching this call and `l_3` blocks — 7 messages on a sad run, with `l_2` and `l_3` missing.

### 3.5 Sandbox-side checks

Expand All @@ -98,14 +110,14 @@ For each run in a cohort, the pytest driver asserts:

| Channel | What it checks |
|---|---|
| `RunGraphNode` | 10 nodes (1 root + 9 leaves); all COMPLETED (happy) or cascade pattern (sad); `sorted(slugs) == EXPECTED_SUBTASK_SLUGS` |
| `RunGraphEdge` | 6 expected dependency edges (diamond + line) |
| `RunResource` | ≥ 18 rows (9 outputs + 9 probes); all with non-empty `content_hash` |
| `GenerationTurn` | Exactly 21 rows per happy run (derived from `PARENT_TURN_COUNT + 9 × LEAF_TURN_COUNT`) |
| `ThreadMessage` (topic `smoke-completion`) | 9 messages per happy run / 8 per sad; `sequence_num` strictly 1..N |
| `RunGraphNode` | Happy: 12 nodes (1 root + 9 direct children + 2 nested children), all COMPLETED; sad: cascade pattern with `l_2` FAILED and `l_3` BLOCKED |
| `RunGraphEdge` | Expected dependency edges (diamond, top-level line, nested `l_2_a → l_2_b`) |
| `RunResource` | Happy: 20 rows (10 outputs + 10 probes); all with non-empty `content_hash` |
| `GenerationTurn` | Exactly 26 rows per happy run |
| `ThreadMessage` (topic `smoke-completion`) | 11 messages per happy run / 7 per sad; `sequence_num` strictly 1..N |
| Blob store round-trip | Re-read of one probe JSON is byte-stable + parses |
| Temporal ordering | `RunTaskExecution.started_at` of children ≥ `completed_at` of parents |
| `RunTaskEvaluation` | Exactly 1 row; score 1.0 (happy) / 0.0 (sad); failed slug named in sad feedback |
| `RunTaskEvaluation` | Happy: 2 root rows, both score 1.0 and created after root execution completion; sad: no successful final score |

Sad-path adds: partial artifact persisted (partial_*.md exists as RunResource), pre-failure WAL entry present, `l_3` status BLOCKED/CANCELLED per RFC `static-sibling-failure-semantics`.

Expand Down Expand Up @@ -153,18 +165,18 @@ Required `data-testid` attributes: `run-status`, `task-node-{slug}` (one per `EX
3. **Test stubs live in `tests/e2e/_fixtures/`, not `ergon_builtins/`.** Production registry (`ergon_builtins/registry_core.py`) contains only production baselines. Exception: `training_stub_worker.py` — it's a real RL-trajectory baseline, not test scaffolding; operators invoke it via CLI.
4. **Criteria reconnect via the CriterionRuntime DI container, never via `AsyncSandbox.connect` directly.** Enforced by code inspection; the anti-pattern previously fixed by `bugs/fixed/2026-04-18-swebench-criterion-spawns-sandbox.md`.
5. **Sandbox outlives the task until all criteria finish.** RFC `sandbox-lifetime-covers-criteria`. Smoke is the living regression test for this.
6. **Cohort parallelism exercised on every PR.** 3-run cohorts prove concurrent workflow submission and cohort aggregation at the scale smoke uses.
6. **Cohort parallelism exercised on every PR.** 2-run happy/sad cohorts prove concurrent workflow submission and cohort aggregation at the scale smoke uses.
7. **Partial work persists on FAILED leaves.** Sad-path `AlwaysFailSubworker` writes a file + runs a probe command, then raises. Driver asserts the partial artifact and pre-failure WAL entry survive.

## 9. Budget

| Measure | Value |
|---|---|
| Per matrix leg | 10-min job timeout; 5-min pytest timeout |
| Leaf-subtask sandbox acquisitions per leg | 26 or 27 (researchrubrics has 26 because the sad slot skips `l_3`) |
| Leaf-subtask sandbox acquisitions per PR | 80 across 3 sandbox images |
| Dynamic child sandbox acquisitions per leg | 19 (1 happy × 11 child tasks + 1 sad × 8 child tasks) |
| Dynamic child sandbox acquisitions per PR | 57 across 3 sandbox images |
| Parent-task sandbox per run | 1 (used by parent worker + attached to by the criterion). Not additional at evaluation time. |
| Parallel workflow runs per PR | 9 (3 legs × 3-run cohort) |
| Parallel workflow runs per PR | 6 (3 legs × 2-run cohort) |
| Warm wall-clock per leg | 1–3 min (post-Docker cache) |
| Cold wall-clock per leg | up to 5 min |

Expand Down
Loading
Loading