Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,9 @@ cp .env.example .env
# Start the stack (Postgres, API, Inngest, dashboard)
docker compose up

# Run a benchmark
ergon benchmark run smoke_test
# Define and run an experiment
ergon experiment define smoke_test --worker training-stub --model stub:constant --limit 1
ergon experiment run <experiment-id>
```

## Configuration
Expand Down
7 changes: 3 additions & 4 deletions ci/wait_for_stack.sh
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,8 @@ check() {
# Postgres via docker exec (host may not have pg_isready installed).
check "postgres" "docker compose exec -T postgres pg_isready -U ergon > /dev/null 2>&1"
check "inngest" "curl -sf http://localhost:8289/v1/events/test > /dev/null 2>&1"
# The api has no / or /healthz route today; any HTTP response (including
# 404) from uvicorn counts as "reachable". ``curl -s`` without ``-f``
# returns 0 on any HTTP status; ``--connect-timeout 2`` keeps probes snappy.
check "api" "curl -s -o /dev/null --connect-timeout 2 http://localhost:9000/ 2>/dev/null"
# Wait for an application-level route so Uvicorn accepting a socket during
# FastAPI lifespan startup does not race ahead of migrations/plugin setup.
check "api" "curl -sf --connect-timeout 2 http://localhost:9000/health > /dev/null 2>&1"

echo "stack up"
6 changes: 6 additions & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,9 @@
# POSTGRES_PASSWORD=ergon \
# TEST_HARNESS_SECRET=real-llm-secret \
# OPENROUTER_API_KEY="$OPENROUTER_API_KEY" \
# OPENAI_API_KEY="$OPENAI_API_KEY" \
# EXA_API_KEY="$EXA_API_KEY" \
# HF_API_KEY="$HF_API_KEY" \
# docker compose up -d
#
# Observability stack (otel + jaeger) on demand:
Expand Down Expand Up @@ -88,6 +91,9 @@ services:
- OTEL_SERVICE_NAME=ergon-core
- E2B_API_KEY=${E2B_API_KEY:-}
- OPENROUTER_API_KEY=${OPENROUTER_API_KEY:-}
- OPENAI_API_KEY=${OPENAI_API_KEY:-}
- EXA_API_KEY=${EXA_API_KEY:-}
- HF_API_KEY=${HF_API_KEY:-}
# Put /app on sys.path so editable source mounts resolve in the API
# container while the smoke fixtures live in ergon_core.test_support.
- PYTHONPATH=/app
Expand Down
2 changes: 1 addition & 1 deletion docs/architecture/02_runtime_lifecycle.md
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,7 @@ A brief index of where runtime functions live. The architectural claims above st

| Concern | File |
| --- | --- |
| Entry + init | `runtime/inngest/benchmark_run_start.py`, `runtime/inngest/start_workflow.py` |
| Entry + init | `runtime/services/experiment_launch_service.py`, `runtime/inngest/start_workflow.py` |
| Task orchestration | `runtime/inngest/execute_task.py` |
| Task child steps | `runtime/inngest/sandbox_setup.py`, `runtime/inngest/worker_execute.py`, `runtime/inngest/persist_outputs.py` |
| Propagation | `runtime/inngest/propagate_execution.py` |
Expand Down
19 changes: 8 additions & 11 deletions docs/architecture/03_providers.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,7 @@ The providers layer is Ergon's boundary between runtime code and external execut

| Name | Kind | Location | Freeze status | Owner |
| --- | --- | --- | --- | --- |
| `_BACKEND_REGISTRY` | module-level dict | `ergon_core/core/providers/generation/model_resolution.py` | Frozen shape; entries grow via registration. | Providers layer. |
| `resolve_model_target` | function | `ergon_core/core/providers/generation/model_resolution.py` | Public, frozen signature. Returns `ResolvedModel`. | Providers layer. |
| `register_model_backend` | function | `ergon_core/core/providers/generation/model_resolution.py` | Public, frozen signature. | Providers layer; callers are backend modules executing at import time. |
| `BaseSandboxManager` | abstract class + singleton | `ergon_core/core/providers/sandbox/manager.py` | Shape stable; `event_sink` activation path in flux. | Providers layer. |
| `DefaultSandboxManager` | concrete class | `ergon_core/core/providers/sandbox/manager.py` | Frozen. | Providers layer. |
| `SWEBenchSandboxManager`, `MiniF2FSandboxManager`, `ResearchRubricsSandboxManager` | concrete subclasses | `ergon_builtins/` | Owned per benchmark; singletons. | Benchmark authors. |
Expand All @@ -19,11 +17,11 @@ The providers layer is Ergon's boundary between runtime code and external execut
| `SandboxResourcePublisher` | class | `ergon_core/core/providers/sandbox/resource_publisher.py` | Frozen API; storage backend swappable via `ERGON_BLOB_ROOT`. | Providers layer. |
| `TransformersModel` | `pydantic_ai.models.Model` subclass | `ergon_builtins/ergon_builtins/models/transformers_backend.py` | Frozen. | ML team (TRL training loop callers). |

### 2.1 Generation registry
### 2.1 Model target resolution

`_BACKEND_REGISTRY` is a prefix-keyed dispatch table of resolver callables. `resolve_model_target` splits the target on its first colon, dispatches to the resolver, and returns a `ResolvedModel` wrapping either a `pydantic_ai.models.Model` instance or a passthrough string. Unknown prefixes fall through to a passthrough `ResolvedModel` — PydanticAI's own `infer_model` is invoked on use. Backends mutate the registry at import time; the builtins pack registers all four in a single loop at `ergon_builtins/ergon_builtins/registry.py:81`.
`resolve_model_target` is the single dispatch point for model target strings. It splits the target on its first colon and returns a `ResolvedModel` wrapping a concrete `pydantic_ai.models.Model` instance. Unknown prefixes raise immediately instead of falling through to PydanticAI inference.

The four prefixes registered today are `vllm:*` (local vLLM server via PydanticAI's `OpenAIChatModel`), `openai:*` / `anthropic:*` / `google:*` (passthrough to `infer_model`), and `transformers:*` (custom `TransformersModel` for TRL-trained checkpoints not served over vLLM).
The supported prefixes are `vllm:<base-url>[#<model>]`, `openai-compatible:<base-url>#<model>`, and cloud provider prefixes `openai:*` / `anthropic:*` / `google:*`. Cloud provider prefixes always route through OpenRouter via PydanticAI's OpenRouter provider; they do not call direct OpenAI, Anthropic, or Google APIs.

Workers are expected to hold no hardcoded SDK client constructions (`AsyncOpenAI`, `anthropic.Client`, `genai.Client`). This is an invariant (Section 4), not a coincidence, and is currently honored — enforcement is grep discipline.

Expand Down Expand Up @@ -87,7 +85,7 @@ The decentralized shape means `ergon benchmark setup` iterates over whatever sub
Worker.execute()
|
+-> resolve_model_target(self.model) --> ResolvedModel
| (prefix dispatch; 4 backends + fallthrough to infer_model)
| (explicit prefix dispatch; cloud targets route via OpenRouter)
|
+-> ManagerClass() (singleton; returns cached instance)
| ManagerClass().create(sandbox_key=task_id, run_id=run_id, ...)
Expand Down Expand Up @@ -126,7 +124,7 @@ Movement of data across this diagram:
## 4. Invariants

1. **One entry point to LLM resolution.** Every model reference goes through `resolve_model_target`. Enforced by grep discipline and review; no runtime check.
2. **Backends register at import time.** `register_model_backend` must be called before any caller hits `resolve_model_target`. Enforced by the builtins pack running its registration loop at import, before any worker module imports.
2. **Cloud provider prefixes use OpenRouter.** `openai:*`, `anthropic:*`, and `google:*` model targets are OpenRouter-hosted targets. Direct cloud SDK model routing is intentionally outside the grammar.
3. **Singleton managers hold authoritative sandbox state.** A subclass's class-level state is the only source of truth for in-process reconnect. Enforced by `__new__` caching the instance and `get_sandbox` reading the class dict. Applies only within a single Python process; cross-process actors must use `terminate_by_sandbox_id` or provision their own sandbox.
4. **Sandbox lifecycle is per-task.** Enforced by `create` accepting `sandbox_key` and by the worker runtime persisting `sandbox_id` on the execution row.
5. **Sandbox lives across evaluator fan-out.** Teardown runs at the end of `check_evaluators`, not at worker completion, not in `finalize_success`. Enforced by the evaluator harness, not by the manager itself.
Expand All @@ -146,10 +144,9 @@ Movement of data across this diagram:

### 5.1 Add a new LLM backend

1. Write a resolver that maps `"myprefix:foo"` to a `pydantic_ai.models.Model` instance wrapped in `ResolvedModel`.
2. Register it in the builtins-pack registration loop so `register_model_backend` is called at import time.
3. Ensure the builtins pack is imported before any worker that references `myprefix:*` model ids.
4. Add an entry to `LLMProvider` and `PROVIDER_KEY_MAP` in `ergon_cli/onboarding/profile.py` so onboarding prompts for the key or server URL.
1. Add an explicit prefix branch in `resolve_model_target` and keep the constructor logic in a sibling module under `ergon_core/core/providers/generation/`.
2. Return a concrete `pydantic_ai.models.Model` instance wrapped in `ResolvedModel`.
3. Add an entry to `LLMProvider` and `PROVIDER_KEY_MAP` in `ergon_cli/onboarding/profile.py` so onboarding prompts for the key or server URL.

### 5.2 Add a new sandbox manager

Expand Down
13 changes: 7 additions & 6 deletions docs/architecture/06_builtins.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,10 +52,11 @@ runnable — not a catalog of registered implementations.
Rubric nesting is not supported and there are no plans to change that.
- Third-party users primarily extend at the Criterion layer.

- Model backend registry.
- Concrete LLM backends register via
`register_model_backend(prefix, resolver)` at import time.
- Freeze status: stable API; adding a backend is additive.
- Model target resolution.
- Builtins do not register cloud model backends. Model target strings are
resolved centrally by `resolve_model_target` in `ergon_core`.
- Freeze status: stable API; adding a backend is additive inside the
providers layer.

- ReAct toolkit composition.
- There is one concrete ReAct worker class — `ReActWorker` (slug `react-v1`,
Expand Down Expand Up @@ -145,8 +146,8 @@ Benchmark loader → Task instances → Worker
- **New worker.** Add under `ergon_builtins/workers/baselines/` if it is
cross-benchmark; alongside the benchmark otherwise. The contract is which
task schemas it supports.
- **New model backend.** Call `register_model_backend(prefix, resolver)` at
import time; prefer short, stable prefixes.
- **New model backend.** Add an explicit `resolve_model_target` branch in
`ergon_core/core/providers/generation/`; prefer short, stable prefixes.
- **New Criterion.** Place in `ergon_builtins/evaluators/criteria/` if
reusable, alongside the benchmark if benchmark-specific. This is the
layer third-party users most often extend.
Expand Down
Loading
Loading