DeepFlow-research · cm2435 · Apr 27, 2026 · Apr 26, 2026 · Apr 26, 2026 · Apr 27, 2026
diff --git a/README.md b/README.md
@@ -42,8 +42,9 @@ cp .env.example .env
 # Start the stack (Postgres, API, Inngest, dashboard)
 docker compose up
 
-# Run a benchmark
-ergon benchmark run smoke_test
+# Define and run an experiment
+ergon experiment define smoke_test --worker training-stub --model stub:constant --limit 1
+ergon experiment run <experiment-id>
 ```
 
 ## Configuration

diff --git a/ci/wait_for_stack.sh b/ci/wait_for_stack.sh
@@ -26,9 +26,8 @@ check() {
 # Postgres via docker exec (host may not have pg_isready installed).
 check "postgres"  "docker compose exec -T postgres pg_isready -U ergon > /dev/null 2>&1"
 check "inngest"   "curl -sf http://localhost:8289/v1/events/test > /dev/null 2>&1"
-# The api has no / or /healthz route today; any HTTP response (including
-# 404) from uvicorn counts as "reachable".  ``curl -s`` without ``-f``
-# returns 0 on any HTTP status; ``--connect-timeout 2`` keeps probes snappy.
-check "api"       "curl -s -o /dev/null --connect-timeout 2 http://localhost:9000/ 2>/dev/null"
+# Wait for an application-level route so Uvicorn accepting a socket during
+# FastAPI lifespan startup does not race ahead of migrations/plugin setup.
+check "api"       "curl -sf --connect-timeout 2 http://localhost:9000/health > /dev/null 2>&1"
 
 echo "stack up"
diff --git a/docker-compose.yml b/docker-compose.yml
@@ -24,6 +24,9 @@
 #     POSTGRES_PASSWORD=ergon \
 #     TEST_HARNESS_SECRET=real-llm-secret \
 #     OPENROUTER_API_KEY="$OPENROUTER_API_KEY" \
+#     OPENAI_API_KEY="$OPENAI_API_KEY" \
+#     EXA_API_KEY="$EXA_API_KEY" \
+#     HF_API_KEY="$HF_API_KEY" \
 #     docker compose up -d
 #
 # Observability stack (otel + jaeger) on demand:
@@ -88,6 +91,9 @@ services:
       - OTEL_SERVICE_NAME=ergon-core
       - E2B_API_KEY=${E2B_API_KEY:-}
       - OPENROUTER_API_KEY=${OPENROUTER_API_KEY:-}
+      - OPENAI_API_KEY=${OPENAI_API_KEY:-}
+      - EXA_API_KEY=${EXA_API_KEY:-}
+      - HF_API_KEY=${HF_API_KEY:-}
       # Put /app on sys.path so editable source mounts resolve in the API
       # container while the smoke fixtures live in ergon_core.test_support.
       - PYTHONPATH=/app

diff --git a/docs/architecture/02_runtime_lifecycle.md b/docs/architecture/02_runtime_lifecycle.md
@@ -156,7 +156,7 @@ A brief index of where runtime functions live. The architectural claims above st
 
 | Concern | File |
 | --- | --- |
-| Entry + init | `runtime/inngest/benchmark_run_start.py`, `runtime/inngest/start_workflow.py` |
+| Entry + init | `runtime/services/experiment_launch_service.py`, `runtime/inngest/start_workflow.py` |
 | Task orchestration | `runtime/inngest/execute_task.py` |
 | Task child steps | `runtime/inngest/sandbox_setup.py`, `runtime/inngest/worker_execute.py`, `runtime/inngest/persist_outputs.py` |
 | Propagation | `runtime/inngest/propagate_execution.py` |

diff --git a/docs/architecture/03_providers.md b/docs/architecture/03_providers.md
@@ -8,9 +8,7 @@ The providers layer is Ergon's boundary between runtime code and external execut
 
 | Name | Kind | Location | Freeze status | Owner |
 | --- | --- | --- | --- | --- |
-| `_BACKEND_REGISTRY` | module-level dict | `ergon_core/core/providers/generation/model_resolution.py` | Frozen shape; entries grow via registration. | Providers layer. |
 | `resolve_model_target` | function | `ergon_core/core/providers/generation/model_resolution.py` | Public, frozen signature. Returns `ResolvedModel`. | Providers layer. |
-| `register_model_backend` | function | `ergon_core/core/providers/generation/model_resolution.py` | Public, frozen signature. | Providers layer; callers are backend modules executing at import time. |
 | `BaseSandboxManager` | abstract class + singleton | `ergon_core/core/providers/sandbox/manager.py` | Shape stable; `event_sink` activation path in flux. | Providers layer. |
 | `DefaultSandboxManager` | concrete class | `ergon_core/core/providers/sandbox/manager.py` | Frozen. | Providers layer. |
 | `SWEBenchSandboxManager`, `MiniF2FSandboxManager`, `ResearchRubricsSandboxManager` | concrete subclasses | `ergon_builtins/` | Owned per benchmark; singletons. | Benchmark authors. |
@@ -19,11 +17,11 @@ The providers layer is Ergon's boundary between runtime code and external execut
 | `SandboxResourcePublisher` | class | `ergon_core/core/providers/sandbox/resource_publisher.py` | Frozen API; storage backend swappable via `ERGON_BLOB_ROOT`. | Providers layer. |
 | `TransformersModel` | `pydantic_ai.models.Model` subclass | `ergon_builtins/ergon_builtins/models/transformers_backend.py` | Frozen. | ML team (TRL training loop callers). |
 
-### 2.1 Generation registry
+### 2.1 Model target resolution
 
-`_BACKEND_REGISTRY` is a prefix-keyed dispatch table of resolver callables. `resolve_model_target` splits the target on its first colon, dispatches to the resolver, and returns a `ResolvedModel` wrapping either a `pydantic_ai.models.Model` instance or a passthrough string. Unknown prefixes fall through to a passthrough `ResolvedModel` — PydanticAI's own `infer_model` is invoked on use. Backends mutate the registry at import time; the builtins pack registers all four in a single loop at `ergon_builtins/ergon_builtins/registry.py:81`.
+`resolve_model_target` is the single dispatch point for model target strings. It splits the target on its first colon and returns a `ResolvedModel` wrapping a concrete `pydantic_ai.models.Model` instance. Unknown prefixes raise immediately instead of falling through to PydanticAI inference.
 
-The four prefixes registered today are `vllm:*` (local vLLM server via PydanticAI's `OpenAIChatModel`), `openai:*` / `anthropic:*` / `google:*` (passthrough to `infer_model`), and `transformers:*` (custom `TransformersModel` for TRL-trained checkpoints not served over vLLM).
+The supported prefixes are `vllm:<base-url>[#<model>]`, `openai-compatible:<base-url>#<model>`, and cloud provider prefixes `openai:*` / `anthropic:*` / `google:*`. Cloud provider prefixes always route through OpenRouter via PydanticAI's OpenRouter provider; they do not call direct OpenAI, Anthropic, or Google APIs.
 
 Workers are expected to hold no hardcoded SDK client constructions (`AsyncOpenAI`, `anthropic.Client`, `genai.Client`). This is an invariant (Section 4), not a coincidence, and is currently honored — enforcement is grep discipline.
 
@@ -87,7 +85,7 @@ The decentralized shape means `ergon benchmark setup` iterates over whatever sub
 Worker.execute()
     |
     +-> resolve_model_target(self.model)  -->  ResolvedModel
-    |       (prefix dispatch; 4 backends + fallthrough to infer_model)
+    |       (explicit prefix dispatch; cloud targets route via OpenRouter)
     |
     +-> ManagerClass()                    (singleton; returns cached instance)
     |   ManagerClass().create(sandbox_key=task_id, run_id=run_id, ...)
@@ -126,7 +124,7 @@ Movement of data across this diagram:
 ## 4. Invariants
 
 1. **One entry point to LLM resolution.** Every model reference goes through `resolve_model_target`. Enforced by grep discipline and review; no runtime check.
-2. **Backends register at import time.** `register_model_backend` must be called before any caller hits `resolve_model_target`. Enforced by the builtins pack running its registration loop at import, before any worker module imports.
+2. **Cloud provider prefixes use OpenRouter.** `openai:*`, `anthropic:*`, and `google:*` model targets are OpenRouter-hosted targets. Direct cloud SDK model routing is intentionally outside the grammar.
 3. **Singleton managers hold authoritative sandbox state.** A subclass's class-level state is the only source of truth for in-process reconnect. Enforced by `__new__` caching the instance and `get_sandbox` reading the class dict. Applies only within a single Python process; cross-process actors must use `terminate_by_sandbox_id` or provision their own sandbox.
 4. **Sandbox lifecycle is per-task.** Enforced by `create` accepting `sandbox_key` and by the worker runtime persisting `sandbox_id` on the execution row.
 5. **Sandbox lives across evaluator fan-out.** Teardown runs at the end of `check_evaluators`, not at worker completion, not in `finalize_success`. Enforced by the evaluator harness, not by the manager itself.
@@ -146,10 +144,9 @@ Movement of data across this diagram:
 
 ### 5.1 Add a new LLM backend
 
-1. Write a resolver that maps `"myprefix:foo"` to a `pydantic_ai.models.Model` instance wrapped in `ResolvedModel`.
-2. Register it in the builtins-pack registration loop so `register_model_backend` is called at import time.
-3. Ensure the builtins pack is imported before any worker that references `myprefix:*` model ids.
-4. Add an entry to `LLMProvider` and `PROVIDER_KEY_MAP` in `ergon_cli/onboarding/profile.py` so onboarding prompts for the key or server URL.
+1. Add an explicit prefix branch in `resolve_model_target` and keep the constructor logic in a sibling module under `ergon_core/core/providers/generation/`.
+2. Return a concrete `pydantic_ai.models.Model` instance wrapped in `ResolvedModel`.
+3. Add an entry to `LLMProvider` and `PROVIDER_KEY_MAP` in `ergon_cli/onboarding/profile.py` so onboarding prompts for the key or server URL.
 
 ### 5.2 Add a new sandbox manager
 

diff --git a/docs/architecture/06_builtins.md b/docs/architecture/06_builtins.md
@@ -52,10 +52,11 @@ runnable — not a catalog of registered implementations.
     Rubric nesting is not supported and there are no plans to change that.
   - Third-party users primarily extend at the Criterion layer.
 
-- Model backend registry.
-  - Concrete LLM backends register via
-    `register_model_backend(prefix, resolver)` at import time.
-  - Freeze status: stable API; adding a backend is additive.
+- Model target resolution.
+  - Builtins do not register cloud model backends. Model target strings are
+    resolved centrally by `resolve_model_target` in `ergon_core`.
+  - Freeze status: stable API; adding a backend is additive inside the
+    providers layer.
 
 - ReAct toolkit composition.
   - There is one concrete ReAct worker class — `ReActWorker` (slug `react-v1`,
@@ -145,8 +146,8 @@ Benchmark loader → Task instances → Worker
 - **New worker.** Add under `ergon_builtins/workers/baselines/` if it is
   cross-benchmark; alongside the benchmark otherwise. The contract is which
   task schemas it supports.
-- **New model backend.** Call `register_model_backend(prefix, resolver)` at
-  import time; prefer short, stable prefixes.
+- **New model backend.** Add an explicit `resolve_model_target` branch in
+  `ergon_core/core/providers/generation/`; prefer short, stable prefixes.
 - **New Criterion.** Place in `ergon_builtins/evaluators/criteria/` if
   reusable, alongside the benchmark if benchmark-specific. This is the
   layer third-party users most often extend.