DeepFlow-research · cm2435 · Apr 26, 2026 · Apr 26, 2026 · Apr 26, 2026 · Apr 26, 2026
diff --git a/.github/workflows/e2e-benchmarks.yml b/.github/workflows/e2e-benchmarks.yml
@@ -35,7 +35,7 @@ jobs:
     env:
       SMOKE_ENV: ${{ matrix.env }}
       ENABLE_TEST_HARNESS: "1"
-      ERGON_STARTUP_PLUGINS: "ergon_core.test_support.smoke_fixtures:register_smoke_fixtures"
+      ERGON_STARTUP_PLUGINS: "ergon_builtins.registry:register_builtins,tests.fixtures.smoke_components:register_smoke_fixtures"
       TEST_HARNESS_SECRET: ${{ secrets.TEST_HARNESS_SECRET || 'ci-test-harness' }}
       E2B_API_KEY: ${{ secrets.E2B_API_KEY }}
       GITHUB_PR_NUMBER: ${{ github.event.pull_request.number }}
@@ -74,7 +74,7 @@ jobs:
           # Unified compose reads these as overrides (see docker-compose.yml).
           POSTGRES_PASSWORD: ci_test
           ENABLE_TEST_HARNESS: "1"
-          ERGON_STARTUP_PLUGINS: "ergon_core.test_support.smoke_fixtures:register_smoke_fixtures"
+          ERGON_STARTUP_PLUGINS: "ergon_builtins.registry:register_builtins,tests.fixtures.smoke_components:register_smoke_fixtures"
         run: docker compose up -d --build --wait
         timeout-minutes: 5
 

diff --git a/.gitignore b/.gitignore
@@ -13,6 +13,7 @@ build/
 
 # Environment
 .env
+.logfire/
 
 # Databases
 *.db

diff --git a/Dockerfile b/Dockerfile
@@ -37,4 +37,4 @@ RUN cd ergon_cli && uv pip install --system -e "."
 
 EXPOSE 9000
 
-CMD ["uvicorn", "ergon_core.core.api.app:app", "--host", "0.0.0.0", "--port", "9000"]
+CMD ["uvicorn", "ergon_core.core.rest_api.app:app", "--host", "0.0.0.0", "--port", "9000"]
diff --git a/docker-compose.yml b/docker-compose.yml
@@ -84,7 +84,7 @@ services:
       - INNGEST_API_BASE_URL=http://inngest-dev:8288
       - ERGON_API_BASE_URL=http://api:9000
       - ENABLE_TEST_HARNESS=${ENABLE_TEST_HARNESS:-1}
-      - ERGON_STARTUP_PLUGINS=${ERGON_STARTUP_PLUGINS-ergon_core.test_support.smoke_fixtures:register_smoke_fixtures}
+      - ERGON_STARTUP_PLUGINS=${ERGON_STARTUP_PLUGINS-ergon_builtins.registry:register_builtins,tests.fixtures.smoke_components:register_smoke_fixtures}
       - TEST_HARNESS_SECRET=${TEST_HARNESS_SECRET:-local-dev}
       - ERGON_BLOB_ROOT=/tmp/ergon-blob
       - OTEL_TRACES_ENABLED=false
@@ -120,7 +120,7 @@ services:
       postgres:
         condition: service_healthy
     command: >
-      uvicorn ergon_core.core.api.app:app
+      uvicorn ergon_core.core.rest_api.app:app
       --host 0.0.0.0 --port 9000 --reload
       --reload-dir /app/ergon_core
       --reload-dir /app/ergon_builtins

diff --git a/docs/architecture/03_providers.md b/docs/architecture/03_providers.md
@@ -2,26 +2,28 @@
 
 ## 1. Purpose
 
-The providers layer is Ergon's boundary between runtime code and external execution substrates. It owns four concerns: resolving `model_id` strings to `pydantic_ai.models.Model` instances, provisioning and tearing down E2B sandboxes via per-benchmark manager subclasses, surfacing sandbox state transitions as dashboard events, and publishing worker outputs as content-addressed blobs that evaluators can re-read. Everything that crosses the process boundary (LLM API, container runtime, blob storage) is routed through this layer so the runtime, workers, and evaluators stay substrate-agnostic.
+The provider-style boundaries are Ergon's adapters between runtime code and external execution substrates. Model resolution lives in the generation registry, while sandbox infrastructure now lives under `ergon_core.core.sandbox` because it owns lifecycle, instrumentation, event emission, and artifact publishing rather than just a third-party provider adapter.
 
 ## 2. Core abstractions
 
 | Name | Kind | Location | Freeze status | Owner |
 | --- | --- | --- | --- | --- |
+| `_BACKEND_REGISTRY` | module-level dict | `ergon_core/core/providers/generation/model_resolution.py` | Frozen shape; entries grow via registration. | Providers layer. |
 | `resolve_model_target` | function | `ergon_core/core/providers/generation/model_resolution.py` | Public, frozen signature. Returns `ResolvedModel`. | Providers layer. |
-| `BaseSandboxManager` | abstract class + singleton | `ergon_core/core/providers/sandbox/manager.py` | Shape stable; `event_sink` activation path in flux. | Providers layer. |
-| `DefaultSandboxManager` | concrete class | `ergon_core/core/providers/sandbox/manager.py` | Frozen. | Providers layer. |
+| `register_model_backend` | function | `ergon_core/core/providers/generation/model_resolution.py` | Public, frozen signature. | Providers layer; callers are backend modules executing at import time. |
+| `BaseSandboxManager` | abstract class + singleton | `ergon_core/core/sandbox/manager.py` | Shape stable; `event_sink` activation path in flux. | Sandbox domain. |
+| `DefaultSandboxManager` | concrete class | `ergon_core/core/sandbox/manager.py` | Frozen. | Sandbox domain. |
 | `SWEBenchSandboxManager`, `MiniF2FSandboxManager`, `ResearchRubricsSandboxManager` | concrete subclasses | `ergon_builtins/` | Owned per benchmark; singletons. | Benchmark authors. |
-| `SandboxEventSink` | `typing.Protocol` | `ergon_core/core/providers/sandbox/event_sink.py` | Frozen protocol; activation path in flux. | Providers layer. |
-| `NoopSandboxEventSink`, `DashboardEmitterSandboxEventSink` | implementations | `ergon_core/core/providers/sandbox/event_sink.py` | Frozen. | Providers layer. |
-| `SandboxResourcePublisher` | class | `ergon_core/core/providers/sandbox/resource_publisher.py` | Frozen API; storage backend swappable via `ERGON_BLOB_ROOT`. | Providers layer. |
+| `SandboxEventSink` | `typing.Protocol` | `ergon_core/core/sandbox/event_sink.py` | Frozen protocol; activation path in flux. | Sandbox domain. |
+| `NoopSandboxEventSink`, `DashboardEmitterSandboxEventSink` | implementations | `ergon_core/core/sandbox/event_sink.py` | Frozen. | Sandbox domain. |
+| `SandboxResourcePublisher` | class | `ergon_core/core/sandbox/resource_publisher.py` | Frozen API; storage backend swappable via `ERGON_BLOB_ROOT`. | Sandbox domain. |
 | `TransformersModel` | `pydantic_ai.models.Model` subclass | `ergon_builtins/ergon_builtins/models/transformers_backend.py` | Frozen. | ML team (TRL training loop callers). |
 
-### 2.1 Model target resolution
+### 2.1 Generation registry
 
-`resolve_model_target` is the single dispatch point for model target strings. It splits the target on its first colon and returns a `ResolvedModel` wrapping a concrete `pydantic_ai.models.Model` instance. Unknown prefixes raise immediately instead of falling through to PydanticAI inference.
+`_BACKEND_REGISTRY` is a prefix-keyed dispatch table of resolver callables. `resolve_model_target` splits the target on its first colon, dispatches to the resolver, and returns a `ResolvedModel` wrapping either a `pydantic_ai.models.Model` instance or a passthrough string. Unknown prefixes fall through to a passthrough `ResolvedModel` — PydanticAI's own `infer_model` is invoked on use. Backends mutate the registry at import time; the builtins pack registers all four in a single loop at `ergon_builtins/ergon_builtins/registry.py:81`.
 
-The supported prefixes are `vllm:<base-url>[#<model>]`, `openai-compatible:<base-url>#<model>`, and cloud provider prefixes `openai:*` / `anthropic:*` / `google:*`. Cloud provider prefixes always route through OpenRouter via PydanticAI's OpenRouter provider; they do not call direct OpenAI, Anthropic, or Google APIs.
+The four prefixes registered today are `vllm:*` (local vLLM server via PydanticAI's `OpenAIChatModel`), `openai:*` / `anthropic:*` / `google:*` (passthrough to `infer_model`), and `transformers:*` (custom `TransformersModel` for TRL-trained checkpoints not served over vLLM).
 
 Workers are expected to hold no hardcoded SDK client constructions (`AsyncOpenAI`, `anthropic.Client`, `genai.Client`). This is an invariant (Section 4), not a coincidence, and is currently honored — enforcement is grep discipline.
 
@@ -85,7 +87,7 @@ The decentralized shape means `ergon benchmark setup` iterates over whatever sub
 Worker.execute()
     |
     +-> resolve_model_target(self.model)  -->  ResolvedModel
-    |       (explicit prefix dispatch; cloud targets route via OpenRouter)
+    |       (prefix dispatch; 4 backends + fallthrough to infer_model)
     |
     +-> ManagerClass()                    (singleton; returns cached instance)
     |   ManagerClass().create(sandbox_key=task_id, run_id=run_id, ...)
@@ -124,7 +126,7 @@ Movement of data across this diagram:
 ## 4. Invariants
 
 1. **One entry point to LLM resolution.** Every model reference goes through `resolve_model_target`. Enforced by grep discipline and review; no runtime check.
-2. **Cloud provider prefixes use OpenRouter.** `openai:*`, `anthropic:*`, and `google:*` model targets are OpenRouter-hosted targets. Direct cloud SDK model routing is intentionally outside the grammar.
+2. **Backends register at import time.** `register_model_backend` must be called before any caller hits `resolve_model_target`. Enforced by the builtins pack running its registration loop at import, before any worker module imports.
 3. **Singleton managers hold authoritative sandbox state.** A subclass's class-level state is the only source of truth for in-process reconnect. Enforced by `__new__` caching the instance and `get_sandbox` reading the class dict. Applies only within a single Python process; cross-process actors must use `terminate_by_sandbox_id` or provision their own sandbox.
 4. **Sandbox lifecycle is per-task.** Enforced by `create` accepting `sandbox_key` and by the worker runtime persisting `sandbox_id` on the execution row.
 5. **Sandbox lives across evaluator fan-out.** Teardown runs at the end of `check_evaluators`, not at worker completion, not in `finalize_success`. Enforced by the evaluator harness, not by the manager itself.
@@ -144,9 +146,10 @@ Movement of data across this diagram:
 
 ### 5.1 Add a new LLM backend
 
-1. Add an explicit prefix branch in `resolve_model_target` and keep the constructor logic in a sibling module under `ergon_core/core/providers/generation/`.
-2. Return a concrete `pydantic_ai.models.Model` instance wrapped in `ResolvedModel`.
-3. Add an entry to `LLMProvider` and `PROVIDER_KEY_MAP` in `ergon_cli/onboarding/profile.py` so onboarding prompts for the key or server URL.
+1. Write a resolver that maps `"myprefix:foo"` to a `pydantic_ai.models.Model` instance wrapped in `ResolvedModel`.
+2. Register it in the builtins-pack registration loop so `register_model_backend` is called at import time.
+3. Ensure the builtins pack is imported before any worker that references `myprefix:*` model ids.
+4. Add an entry to `LLMProvider` and `PROVIDER_KEY_MAP` in `ergon_cli/onboarding/profile.py` so onboarding prompts for the key or server URL.
 
 ### 5.2 Add a new sandbox manager
 

diff --git a/docs/architecture/07_testing.md b/docs/architecture/07_testing.md
@@ -25,17 +25,17 @@ Path-based, not marker-based. The local gate and the CI workflow both dispatch b
 
 Every PR runs three benchmark legs in parallel via `.github/workflows/e2e-benchmarks.yml`:
 
-| Leg | Slot 1 | Slot 2 | Slot 3 |
-|---|---|---|---|
-| `researchrubrics` | happy | happy | **sad** — `l_2` forced FAIL |
-| `minif2f` | happy | happy | happy |
-| `swebench-verified` | happy | happy | happy |
+| Leg | Slot 1 | Slot 2 |
+|---|---|---|
+| `researchrubrics` | happy | **sad** — `l_2` forced FAIL |
+| `minif2f` | happy | **sad** — `l_2` forced FAIL |
+| `swebench-verified` | happy | **sad** — `l_2` forced FAIL |
 
-**9 top-level runs per PR; 80 leaf sandbox acquisitions** (8 happy × 9 leaves + 1 sad × 8 leaves — `l_3` never provisioned because its dependency failed).
+**6 top-level runs per PR; 57 dynamic child sandbox acquisitions** (3 happy × 11 child tasks + 3 sad × 8 child tasks — `l_3` never provisions on sad runs because its dependency failed).
 
-### 3.1 Immutable 9-leaf DAG
+### 3.1 Smoke DAG
 
-Every smoke run — happy or sad — spawns exactly this graph:
+Every smoke run starts with the same 9 direct children:
 
 ```
 Diamond (4):           Line (3):               Singletons (2):
@@ -46,9 +46,18 @@ d_left   d_right
     d_join
 ```
 
-Topology is enforced by `tests/e2e/_fixtures/smoke_base/worker_base.py::SmokeWorkerBase.execute` being decorated `@typing.final`. Subclasses supply the leaf slug via `leaf_slug` and (optionally) override `_spec_for(slug, deps, desc)` to route specific slugs elsewhere — the sad-path subclass uses this to route `l_2` to a failing leaf. They cannot change the DAG itself.
+Happy-path runs route top-level `l_2` to `{env}-smoke-recursive-worker`, which plans a nested two-node line under `l_2`:
 
-The single source of truth for topology is [`tests/e2e/_fixtures/smoke_base/constants.py`](../../tests/e2e/_fixtures/smoke_base/constants.py):
+```text
+l_2
+└─ l_2_a → l_2_b
+```
+
+Top-level `l_3` depends on `l_2`, so the smoke proves dependency propagation waits for a non-leaf dynamic task before releasing downstream work. Sad-path runs route `l_2` to the failing leaf instead, so `l_3` remains blocked.
+
+Topology is enforced by `ergon_core/test_support/smoke_fixtures/smoke_base/worker_base.py::SmokeWorkerBase.execute` being decorated `@typing.final`. Subclasses supply the leaf slug via `leaf_slug` and override `_spec_for(slug, deps, desc)` only to route specific slugs elsewhere. They cannot change the direct-child DAG itself.
+
+The single source of truth for the direct-child topology is [`ergon_core/test_support/smoke_fixtures/smoke_base/constants.py`](../../ergon_core/ergon_core/test_support/smoke_fixtures/smoke_base/constants.py):
 
 ```python
 EXPECTED_SUBTASK_SLUGS = (
@@ -60,29 +69,32 @@ EXPECTED_SUBTASK_SLUGS = (
 
 ### 3.2 Fixture residency — test-only, out of `ergon_builtins`
 
-`ergon_builtins/` contains only production baselines (ReActWorker, TrainingStubWorker). All smoke workers, leaves, and criteria live under [`tests/e2e/_fixtures/`](../../tests/e2e/_fixtures/) and register into the process-level `WORKERS` / `EVALUATORS` dicts via an import side-effect in `tests/e2e/_fixtures/__init__.py`, which `tests/e2e/conftest.py` imports at session start.
+`ergon_builtins/` contains only production baselines (ReActWorker, TrainingStubWorker). All smoke workers, leaves, and criteria live under [`tests/fixtures/smoke_components/`](../../tests/fixtures/smoke_components/) and register into the process-level core component registry through `register_smoke_fixtures()`.
 
-11 registry rows total — none production:
+19 registry rows total — none production:
 
 | Slug | Kind |
 |---|---|
 | `{env}-smoke-worker` × 3 | Worker (parent) — inherits `SmokeWorkerBase` |
 | `{env}-smoke-leaf` × 3 | Worker (leaf) — inherits `BaseSmokeLeafWorker` |
-| `researchrubrics-sadpath-smoke-worker` | Worker (sad-path parent) |
-| `researchrubrics-smoke-leaf-failing` | Worker (sad-path failing leaf) |
+| `{env}-smoke-recursive-worker` × 3 | Worker (nested `l_2` parent) — inherits `RecursiveSmokeWorkerBase` |
+| `{env}-sadpath-smoke-worker` × 3 | Worker (sad-path parent) |
+| `{env}-smoke-leaf-failing` × 3 | Worker (sad-path failing leaf) |
 | `{env}-smoke-criterion` × 3 | Criterion — inherits `SmokeCriterionBase` |
+| `smoke-post-root-timing-criterion` | Criterion — second root evaluator used for timing assertions |
 
 where `{env} ∈ {researchrubrics, minif2f, swebench}`.
 
 ### 3.3 Turn persistence
 
 - Parent `SmokeWorkerBase.execute` yields **3** `GenerationTurn`s (planning → planned → awaiting) so incremental turn persistence is exercised on every run.
+- Happy-path recursive `l_2` yields **3** `GenerationTurn`s.
 - Each leaf `BaseSmokeLeafWorker.execute` yields **2** turns (attaching → done).
-- Total per happy run: **1 × 3 + 9 × 2 = 21** `GenerationTurn` rows; driver asserts on this.
+- Total per happy run: **3 + 3 + 10 × 2 = 26** `GenerationTurn` rows; driver asserts on this.
 
 ### 3.4 Inter-agent messaging
 
-Each happy-path leaf calls `CommunicationService.save_message` once on the `smoke-completion` thread (first production caller of that service). 9 `ThreadMessage` rows per happy run, sequence_num 1..9 per thread. Sad-path `l_2` raises before reaching this call — 8 messages on a sad run, with `l_2` missing.
+Each happy-path leaf calls `CommunicationService.save_message` once on the `smoke-completion` thread (first production caller of that service). The recursive `l_2` worker also sends one completion message after nested children finish. Happy runs emit 11 `ThreadMessage` rows (`9` direct slugs + `l_2_a`, `l_2_b`), sequence_num 1..11 per thread. Sad-path `l_2` raises before reaching this call and `l_3` blocks — 7 messages on a sad run, with `l_2` and `l_3` missing.
 
 ### 3.5 Sandbox-side checks
 
@@ -98,14 +110,14 @@ For each run in a cohort, the pytest driver asserts:
 
 | Channel | What it checks |
 |---|---|
-| `RunGraphNode` | 10 nodes (1 root + 9 leaves); all COMPLETED (happy) or cascade pattern (sad); `sorted(slugs) == EXPECTED_SUBTASK_SLUGS` |
-| `RunGraphEdge` | 6 expected dependency edges (diamond + line) |
-| `RunResource` | ≥ 18 rows (9 outputs + 9 probes); all with non-empty `content_hash` |
-| `GenerationTurn` | Exactly 21 rows per happy run (derived from `PARENT_TURN_COUNT + 9 × LEAF_TURN_COUNT`) |
-| `ThreadMessage` (topic `smoke-completion`) | 9 messages per happy run / 8 per sad; `sequence_num` strictly 1..N |
+| `RunGraphNode` | Happy: 12 nodes (1 root + 9 direct children + 2 nested children), all COMPLETED; sad: cascade pattern with `l_2` FAILED and `l_3` BLOCKED |
+| `RunGraphEdge` | Expected dependency edges (diamond, top-level line, nested `l_2_a → l_2_b`) |
+| `RunResource` | Happy: 20 rows (10 outputs + 10 probes); all with non-empty `content_hash` |
+| `GenerationTurn` | Exactly 26 rows per happy run |
+| `ThreadMessage` (topic `smoke-completion`) | 11 messages per happy run / 7 per sad; `sequence_num` strictly 1..N |
 | Blob store round-trip | Re-read of one probe JSON is byte-stable + parses |
 | Temporal ordering | `RunTaskExecution.started_at` of children ≥ `completed_at` of parents |
-| `RunTaskEvaluation` | Exactly 1 row; score 1.0 (happy) / 0.0 (sad); failed slug named in sad feedback |
+| `RunTaskEvaluation` | Happy: 2 root rows, both score 1.0 and created after root execution completion; sad: no successful final score |
 
 Sad-path adds: partial artifact persisted (partial_*.md exists as RunResource), pre-failure WAL entry present, `l_3` status BLOCKED/CANCELLED per RFC `static-sibling-failure-semantics`.
 
@@ -153,18 +165,18 @@ Required `data-testid` attributes: `run-status`, `task-node-{slug}` (one per `EX
 3. **Test stubs live in `tests/e2e/_fixtures/`, not `ergon_builtins/`.** Production registry (`ergon_builtins/registry_core.py`) contains only production baselines. Exception: `training_stub_worker.py` — it's a real RL-trajectory baseline, not test scaffolding; operators invoke it via CLI.
 4. **Criteria reconnect via the CriterionRuntime DI container, never via `AsyncSandbox.connect` directly.** Enforced by code inspection; the anti-pattern previously fixed by `bugs/fixed/2026-04-18-swebench-criterion-spawns-sandbox.md`.
 5. **Sandbox outlives the task until all criteria finish.** RFC `sandbox-lifetime-covers-criteria`. Smoke is the living regression test for this.
-6. **Cohort parallelism exercised on every PR.** 3-run cohorts prove concurrent workflow submission and cohort aggregation at the scale smoke uses.
+6. **Cohort parallelism exercised on every PR.** 2-run happy/sad cohorts prove concurrent workflow submission and cohort aggregation at the scale smoke uses.
 7. **Partial work persists on FAILED leaves.** Sad-path `AlwaysFailSubworker` writes a file + runs a probe command, then raises. Driver asserts the partial artifact and pre-failure WAL entry survive.
 
 ## 9. Budget
 
 | Measure | Value |
 |---|---|
 | Per matrix leg | 10-min job timeout; 5-min pytest timeout |
-| Leaf-subtask sandbox acquisitions per leg | 26 or 27 (researchrubrics has 26 because the sad slot skips `l_3`) |
-| Leaf-subtask sandbox acquisitions per PR | 80 across 3 sandbox images |
+| Dynamic child sandbox acquisitions per leg | 19 (1 happy × 11 child tasks + 1 sad × 8 child tasks) |
+| Dynamic child sandbox acquisitions per PR | 57 across 3 sandbox images |
 | Parent-task sandbox per run | 1 (used by parent worker + attached to by the criterion). Not additional at evaluation time. |
-| Parallel workflow runs per PR | 9 (3 legs × 3-run cohort) |
+| Parallel workflow runs per PR | 6 (3 legs × 2-run cohort) |
 | Warm wall-clock per leg | 1–3 min (post-Docker cache) |
 | Cold wall-clock per leg | up to 5 min |
-Original file line number
+Diff line change
@@ Expand Up / @@ -13,6 +13,7 @@ build/ @@
     # Environment
     .env
+    .logfire/
     # Databases
     *.db
@@ Expand Down @@
Original file line number	Diff line number	Diff line change
Expand Up		@@ -37,4 +37,4 @@ RUN cd ergon_cli && uv pip install --system -e "."

		EXPOSE 9000

		CMD ["uvicorn", "ergon_core.core.api.app:app", "--host", "0.0.0.0", "--port", "9000"]
		CMD ["uvicorn", "ergon_core.core.rest_api.app:app", "--host", "0.0.0.0", "--port", "9000"]