diff --git a/docker-compose.yml b/docker-compose.yml index 153cc2be..2adb82bf 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -24,6 +24,9 @@ # POSTGRES_PASSWORD=ergon \ # TEST_HARNESS_SECRET=real-llm-secret \ # OPENROUTER_API_KEY="$OPENROUTER_API_KEY" \ +# OPENAI_API_KEY="$OPENAI_API_KEY" \ +# EXA_API_KEY="$EXA_API_KEY" \ +# HF_API_KEY="$HF_API_KEY" \ # docker compose up -d # # Observability stack (otel + jaeger) on demand: @@ -88,6 +91,9 @@ services: - OTEL_SERVICE_NAME=ergon-core - E2B_API_KEY=${E2B_API_KEY:-} - OPENROUTER_API_KEY=${OPENROUTER_API_KEY:-} + - OPENAI_API_KEY=${OPENAI_API_KEY:-} + - EXA_API_KEY=${EXA_API_KEY:-} + - HF_API_KEY=${HF_API_KEY:-} # Put /app on sys.path so editable source mounts resolve in the API # container while the smoke fixtures live in ergon_core.test_support. - PYTHONPATH=/app diff --git a/docs/architecture/03_providers.md b/docs/architecture/03_providers.md index 7a957547..89d9a90a 100644 --- a/docs/architecture/03_providers.md +++ b/docs/architecture/03_providers.md @@ -8,9 +8,7 @@ The providers layer is Ergon's boundary between runtime code and external execut | Name | Kind | Location | Freeze status | Owner | | --- | --- | --- | --- | --- | -| `_BACKEND_REGISTRY` | module-level dict | `ergon_core/core/providers/generation/model_resolution.py` | Frozen shape; entries grow via registration. | Providers layer. | | `resolve_model_target` | function | `ergon_core/core/providers/generation/model_resolution.py` | Public, frozen signature. Returns `ResolvedModel`. | Providers layer. | -| `register_model_backend` | function | `ergon_core/core/providers/generation/model_resolution.py` | Public, frozen signature. | Providers layer; callers are backend modules executing at import time. | | `BaseSandboxManager` | abstract class + singleton | `ergon_core/core/providers/sandbox/manager.py` | Shape stable; `event_sink` activation path in flux. | Providers layer. | | `DefaultSandboxManager` | concrete class | `ergon_core/core/providers/sandbox/manager.py` | Frozen. | Providers layer. | | `SWEBenchSandboxManager`, `MiniF2FSandboxManager`, `ResearchRubricsSandboxManager` | concrete subclasses | `ergon_builtins/` | Owned per benchmark; singletons. | Benchmark authors. | @@ -19,11 +17,11 @@ The providers layer is Ergon's boundary between runtime code and external execut | `SandboxResourcePublisher` | class | `ergon_core/core/providers/sandbox/resource_publisher.py` | Frozen API; storage backend swappable via `ERGON_BLOB_ROOT`. | Providers layer. | | `TransformersModel` | `pydantic_ai.models.Model` subclass | `ergon_builtins/ergon_builtins/models/transformers_backend.py` | Frozen. | ML team (TRL training loop callers). | -### 2.1 Generation registry +### 2.1 Model target resolution -`_BACKEND_REGISTRY` is a prefix-keyed dispatch table of resolver callables. `resolve_model_target` splits the target on its first colon, dispatches to the resolver, and returns a `ResolvedModel` wrapping either a `pydantic_ai.models.Model` instance or a passthrough string. Unknown prefixes fall through to a passthrough `ResolvedModel` — PydanticAI's own `infer_model` is invoked on use. Backends mutate the registry at import time; the builtins pack registers all four in a single loop at `ergon_builtins/ergon_builtins/registry.py:81`. +`resolve_model_target` is the single dispatch point for model target strings. It splits the target on its first colon and returns a `ResolvedModel` wrapping a concrete `pydantic_ai.models.Model` instance. Unknown prefixes raise immediately instead of falling through to PydanticAI inference. -The four prefixes registered today are `vllm:*` (local vLLM server via PydanticAI's `OpenAIChatModel`), `openai:*` / `anthropic:*` / `google:*` (passthrough to `infer_model`), and `transformers:*` (custom `TransformersModel` for TRL-trained checkpoints not served over vLLM). +The supported prefixes are `vllm:[#]`, `openai-compatible:#`, and cloud provider prefixes `openai:*` / `anthropic:*` / `google:*`. Cloud provider prefixes always route through OpenRouter via PydanticAI's OpenRouter provider; they do not call direct OpenAI, Anthropic, or Google APIs. Workers are expected to hold no hardcoded SDK client constructions (`AsyncOpenAI`, `anthropic.Client`, `genai.Client`). This is an invariant (Section 4), not a coincidence, and is currently honored — enforcement is grep discipline. @@ -87,7 +85,7 @@ The decentralized shape means `ergon benchmark setup` iterates over whatever sub Worker.execute() | +-> resolve_model_target(self.model) --> ResolvedModel - | (prefix dispatch; 4 backends + fallthrough to infer_model) + | (explicit prefix dispatch; cloud targets route via OpenRouter) | +-> ManagerClass() (singleton; returns cached instance) | ManagerClass().create(sandbox_key=task_id, run_id=run_id, ...) @@ -126,7 +124,7 @@ Movement of data across this diagram: ## 4. Invariants 1. **One entry point to LLM resolution.** Every model reference goes through `resolve_model_target`. Enforced by grep discipline and review; no runtime check. -2. **Backends register at import time.** `register_model_backend` must be called before any caller hits `resolve_model_target`. Enforced by the builtins pack running its registration loop at import, before any worker module imports. +2. **Cloud provider prefixes use OpenRouter.** `openai:*`, `anthropic:*`, and `google:*` model targets are OpenRouter-hosted targets. Direct cloud SDK model routing is intentionally outside the grammar. 3. **Singleton managers hold authoritative sandbox state.** A subclass's class-level state is the only source of truth for in-process reconnect. Enforced by `__new__` caching the instance and `get_sandbox` reading the class dict. Applies only within a single Python process; cross-process actors must use `terminate_by_sandbox_id` or provision their own sandbox. 4. **Sandbox lifecycle is per-task.** Enforced by `create` accepting `sandbox_key` and by the worker runtime persisting `sandbox_id` on the execution row. 5. **Sandbox lives across evaluator fan-out.** Teardown runs at the end of `check_evaluators`, not at worker completion, not in `finalize_success`. Enforced by the evaluator harness, not by the manager itself. @@ -146,10 +144,9 @@ Movement of data across this diagram: ### 5.1 Add a new LLM backend -1. Write a resolver that maps `"myprefix:foo"` to a `pydantic_ai.models.Model` instance wrapped in `ResolvedModel`. -2. Register it in the builtins-pack registration loop so `register_model_backend` is called at import time. -3. Ensure the builtins pack is imported before any worker that references `myprefix:*` model ids. -4. Add an entry to `LLMProvider` and `PROVIDER_KEY_MAP` in `ergon_cli/onboarding/profile.py` so onboarding prompts for the key or server URL. +1. Add an explicit prefix branch in `resolve_model_target` and keep the constructor logic in a sibling module under `ergon_core/core/providers/generation/`. +2. Return a concrete `pydantic_ai.models.Model` instance wrapped in `ResolvedModel`. +3. Add an entry to `LLMProvider` and `PROVIDER_KEY_MAP` in `ergon_cli/onboarding/profile.py` so onboarding prompts for the key or server URL. ### 5.2 Add a new sandbox manager diff --git a/docs/architecture/06_builtins.md b/docs/architecture/06_builtins.md index b7c082fe..43073cba 100644 --- a/docs/architecture/06_builtins.md +++ b/docs/architecture/06_builtins.md @@ -52,10 +52,11 @@ runnable — not a catalog of registered implementations. Rubric nesting is not supported and there are no plans to change that. - Third-party users primarily extend at the Criterion layer. -- Model backend registry. - - Concrete LLM backends register via - `register_model_backend(prefix, resolver)` at import time. - - Freeze status: stable API; adding a backend is additive. +- Model target resolution. + - Builtins do not register cloud model backends. Model target strings are + resolved centrally by `resolve_model_target` in `ergon_core`. + - Freeze status: stable API; adding a backend is additive inside the + providers layer. - ReAct toolkit composition. - There is one concrete ReAct worker class — `ReActWorker` (slug `react-v1`, @@ -145,8 +146,8 @@ Benchmark loader → Task instances → Worker - **New worker.** Add under `ergon_builtins/workers/baselines/` if it is cross-benchmark; alongside the benchmark otherwise. The contract is which task schemas it supports. -- **New model backend.** Call `register_model_backend(prefix, resolver)` at - import time; prefer short, stable prefixes. +- **New model backend.** Add an explicit `resolve_model_target` branch in + `ergon_core/core/providers/generation/`; prefer short, stable prefixes. - **New Criterion.** Place in `ergon_builtins/evaluators/criteria/` if reusable, alongside the benchmark if benchmark-specific. This is the layer third-party users most often extend. diff --git a/docs/experiments/rq1-cli-specialism/changelog.md b/docs/experiments/rq1-cli-specialism/changelog.md new file mode 100644 index 00000000..11cb5f91 --- /dev/null +++ b/docs/experiments/rq1-cli-specialism/changelog.md @@ -0,0 +1,297 @@ +# RQ1 CLI Specialism Overnight Changelog + +## Goal + +Use the PR #39 workflow-CLI ResearchRubrics agent to produce rollout-card artifacts that support RQ1: returns remain a useful guardrail, but rollout cards preserve richer delegation and role-specialism behaviour that scalar returns discard. + +## 2026-04-26 23:30 UTC+1 - Preflight + +- Worktree: `/Users/charliemasters/Desktop/synced_vm_002/ergon/.worktrees/feature/finish-agent-workflow-cli` +- Branch: `feature/finish-agent-workflow-cli` +- PR: https://github.com/DeepFlow-research/ergon/pull/39 +- Commit at start: `ae7a0a8 Finish agent workflow CLI task editing` +- PR checks: all current checks passing by `gh pr checks 39`: + - `Integration tests (Python)`: pass + - `Lint + type-check (Frontend)`: pass + - `Lint + type-check (Python)`: pass + - `Unit tests (Python)`: pass + - `smoke [minif2f]`: pass + - `smoke [researchrubrics]`: pass + - `smoke [swebench-verified]`: pass +- Local `.env`: not present in the PR worktree. Real-LLM commands source `/Users/charliemasters/Desktop/synced_vm_002/ergon/.env` without copying it. +- Required keys after sourcing main `.env`: `OPENROUTER_API_KEY`, `EXA_API_KEY`, and `E2B_API_KEY` are set. +- Local services: + - `docker compose ps` in the worktree showed no compose-owned services. + - `http://127.0.0.1:3001/` responded. + - `http://127.0.0.1:9000/` responded with HTTP 404, which still indicates a process is listening; harness fixture treats connection success as stack-up. + +## Run Log + +Runs append below. Each entry should include command, env knobs, rollout artifact path, run ID, terminal status, score notes, graph/subtask notes, and prompt/config changes. + +## 2026-04-26 23:36 UTC+1 - Preflight Smoke Blocker + +- Command: + - `ERGON_REAL_LLM=1 ERGON_REAL_LLM_BUDGET_USD=50 uv run pytest tests/real_llm/benchmarks/test_smoke_stub.py -v -s --assume-stack-up` +- Result: + - Failed during test collection before any benchmark/model spend. +- Root cause: + - `telemetry.models` imports `ergon_core.api.json_types`, which executes `ergon_core.api.__init__`. + - `ergon_core.api.__init__` eagerly imported `RunResourceView` from `api.run_resource`. + - `api.run_resource` imports `RunResourceKind` from `telemetry.models` while `telemetry.models` is partially initialized. +- Fix: + - Added `tests/unit/runtime/test_import_boundaries.py` as a regression. + - Changed `ergon_core/ergon_core/api/__init__.py` to lazily expose `RunResourceKind` and `RunResourceView` via `__getattr__`. +- Verification: + - `uv run pytest tests/unit/runtime/test_import_boundaries.py -q` -> `1 passed` + - `uv run ruff format ergon_core/ergon_core/api/__init__.py tests/unit/runtime/test_import_boundaries.py && uv run ruff check ergon_core/ergon_core/api/__init__.py tests/unit/runtime/test_import_boundaries.py` -> `All checks passed` +- Commit: + - `e23c276 Fix run resource API import boundary` + +## 2026-04-26 23:45 UTC+1 - Stack Rebuild + +- Rebuilt the shared `ergon` compose project from the PR #39 worktree: + - `COMPOSE_PROJECT_NAME=ergon docker compose up -d --build --wait` +- Reason: + - The running stack was built before PR #39, so the API/Inngest runtime might not know `researchrubrics-workflow-cli-react`. +- Result: + - `ergon-api-1`, `ergon-dashboard-1`, `ergon-inngest-dev-1`, and `ergon-postgres-1` are running. + - API root returns HTTP 404 but the process is reachable; the real-LLM fixture only requires connection success. + +## 2026-04-26 23:47 UTC+1 - Baseline Workflow-CLI Batch 1 + +- Intent: + - Run 5 ResearchRubrics samples with the current PR #39 workflow-CLI prompt. +- Command: + - `ERGON_REAL_LLM=1 ERGON_REAL_LLM_BUDGET_USD=50 ERGON_REAL_LLM_WORKER=researchrubrics-workflow-cli-react ERGON_REAL_LLM_LIMIT=5 uv run pytest tests/real_llm/benchmarks/test_researchrubrics.py -v -s --assume-stack-up` +- Status: + - Failed after creating run `2626bae9-b058-4b1b-9803-8e6186468023`. +- Failure: + - Harness endpoint `GET /api/test/read/run/2626bae9-b058-4b1b-9803-8e6186468023/state` returned HTTP 500. + - Local DB/API inspection showed `psycopg2.errors.UndefinedColumn: column run_resources.copied_from_resource_id does not exist`. +- Root cause: + - The long-lived local Postgres DB was stamped at Alembic head `0a1b2c3d4e5f`, but was missing the already-existing migration `a2b3c4d5e6f7_add_copied_from_resource_id.py` effect. This is local schema drift, not a missing migration in the branch. +- Local repair: + - Applied idempotent local DDL: + - `ALTER TABLE run_resources ADD COLUMN IF NOT EXISTS copied_from_resource_id UUID NULL` + - `CREATE INDEX IF NOT EXISTS ix_run_resources_copied_from_resource_id ON run_resources (copied_from_resource_id)` + - Add FK constraint `fk_run_resources_copied_from_resource_id_run_resources` if absent. + - Verification: information schema now reports one `copied_from_resource_id` column. +- Post-repair canary: + - `ERGON_REAL_LLM=1 ERGON_REAL_LLM_BUDGET_USD=50 uv run pytest tests/real_llm/benchmarks/test_smoke_stub.py -v -s --assume-stack-up` + - Result: `1 passed` in 27.15s. + +## 2026-04-26 23:45 UTC+1 - Baseline Workflow-CLI Batch 1b + +- Intent: + - Retry 5 ResearchRubrics samples after local schema repair. +- Command: + - `ERGON_REAL_LLM=1 ERGON_REAL_LLM_BUDGET_USD=50 ERGON_REAL_LLM_WORKER=researchrubrics-workflow-cli-react ERGON_REAL_LLM_LIMIT=5 uv run pytest tests/real_llm/benchmarks/test_researchrubrics.py -v -s --assume-stack-up` +- Status: + - Passed, but not useful for headline RQ1 evidence. +- Rollout: + - Directory: `tests/real_llm/.rollouts/20260426T224530Z-3caf7e5c-e09f-47a8-8afb-58fd2693b761/` + - Run ID: `3caf7e5c-e09f-47a8-8afb-58fd2693b761` + - Wall clock: 235.6s + - Budget: $0.477609 +- Findings: + - The hardcoded `researchrubrics` benchmark loaded only 2 private/default smoke rows: `smoke-001`, `smoke-002`. + - Graph had 2 root nodes, 0 edges, 0 child subtasks, 2 resources, 1 evaluation. + - Worker did call `workflow inspect task-tree` once per task, but did not spawn/coordinate specialist subtasks. + - Evaluator returned score 0.0 because the API container did not have `OPENAI_API_KEY`. +- Fixes after analysis: + - `ResearchRubricsBenchmark._payload_from_row` now accepts vanilla dataset rows with `prompt` when `ablated_prompt` is absent. + - `tests/real_llm/benchmarks/test_researchrubrics.py` now honors `ERGON_REAL_LLM_BENCHMARK`, defaulting to `researchrubrics`. + - `docker-compose.yml` now passes `OPENAI_API_KEY`, `EXA_API_KEY`, and `HF_API_KEY` to the API container alongside the existing E2B/OpenRouter keys. + - Focused tests: `uv run pytest tests/unit/state/test_research_rubrics_benchmark.py -q` -> `10 passed`. + - Vanilla load check: `ResearchRubricsVanillaBenchmark(limit=5)` -> 5 rows loaded. + - Stack rebuilt with exported env; API container verified all provider keys present. + +## 2026-04-27 00:00 UTC+1 - Vanilla 5-Sample Workflow-CLI Batch 1 + +- Intent: + - Run the actual 5-row ScaleAI ResearchRubrics benchmark after enabling vanilla rows and backend evaluator env. +- Command: + - `ERGON_REAL_LLM=1 ERGON_REAL_LLM_BUDGET_USD=50 ERGON_REAL_LLM_BENCHMARK=researchrubrics-vanilla ERGON_REAL_LLM_WORKER=researchrubrics-workflow-cli-react ERGON_REAL_LLM_LIMIT=5 uv run pytest tests/real_llm/benchmarks/test_researchrubrics.py -v -s --assume-stack-up` +- Status: + - Run reached terminal `failed`, but pytest timed out waiting for resources/evaluations because all tasks failed before report persistence. +- Rollout: + - Directory: `tests/real_llm/.rollouts/20260426T230154Z-ab57a0df-2a6d-4174-95f5-87185f717707/` + - Run ID: `ab57a0df-2a6d-4174-95f5-87185f717707` + - Row counts: 5 graph nodes, 0 graph edges, 25 mutations, 121 context events, 10 sandbox events, 0 resources, 0 evaluations. +- Findings: + - This was the intended 5 real-row ScaleAI benchmark: five sample IDs were created. + - Behavior was rich but not successful: 116 tool calls total, including 111 `exa_search` and 5 `workflow inspect task-tree`. + - No task called `write_report_draft` or `final_result`; all failed with generic `Worker execution failed`. + - Failure mode appears to be search-budget exhaustion / max-iteration behavior on large vanilla rubrics, not missing provider keys. + - No child subtasks: the workflow tool was available but graph editing was not manager-capable, and the prompt only suggested inspection/resource-copying. +- Core/harness fixes: + - `_wait_for_post_terminal_artifacts` now returns for terminal `failed`/`cancelled` runs with no running executions, so failed-before-output rollouts still dump artifacts. + - `_require_keys` now includes `openai_api_key`. + - Broke a context-event import cycle by storing context `turn_logprobs` as open JSON payloads instead of importing `TokenLogprob` from `ergon_core.api.generation`. + - Added import-boundary coverage for context models. + - Tests: `uv run pytest tests/unit/runtime/test_import_boundaries.py tests/unit/state/test_research_rubrics_benchmark.py -q` -> `12 passed`. + +## 2026-04-27 00:04 UTC+1 - Prompt Hillclimb Variant 1 + +- Prompt/tool changes: + - Workflow-CLI ReAct worker now passes `manager_capable=True` to `make_workflow_cli_tool`. + - Prompt asks level-0 tasks to create exactly three specialist child tasks before research: + - source scout + - rubric compliance checker + - synthesis reviewer + - Prompt tells non-root tasks not to create recursive children. + - Prompt caps own work to at most 6 `exa_search` calls before writing `final_output/report.md`. +- Verification: + - `uv run pytest tests/unit/state/test_research_rubrics_workers.py tests/unit/state/test_workflow_cli_tool.py -q` -> `10 passed`. + - API restarted; provider keys still present in container. +- Next run command: + - `ERGON_REAL_LLM=1 ERGON_REAL_LLM_BUDGET_USD=50 ERGON_REAL_LLM_BENCHMARK=researchrubrics-vanilla ERGON_REAL_LLM_WORKER=researchrubrics-workflow-cli-react ERGON_REAL_LLM_LIMIT=5 uv run pytest tests/real_llm/benchmarks/test_researchrubrics.py -v -s --assume-stack-up` +- Diagnostic run: + - Run ID: `4d3721d0-aacb-4f04-bea9-9217c0549f9e` + - Stopped pytest manually after confirming it was polluted by the async workflow bridge bug. + - Positive signal: at least one root task attempted the desired `workflow manage add-task` specialist pattern before searching: + - source scout + - rubric compliance checker + - synthesis reviewer + - Bug found: agent-side workflow manage commands called the sync CLI bridge, which used `asyncio.run()` inside an already-running event loop. API log showed `RuntimeWarning: coroutine '_handle_manage' was never awaited`. +- Core fix: + - Added `execute_workflow_command_async(...)` in `ergon_cli.commands.workflow`. + - `execute_workflow_command(...)` now remains a sync wrapper for CLI callers. + - `make_workflow_cli_tool(...)` now awaits the async executor. + - Tests: `uv run pytest tests/unit/cli/test_workflow_cli.py tests/unit/state/test_workflow_cli_tool.py -q` -> `10 passed`. + +## 2026-04-27 00:13 UTC+1 - Prompt Hillclimb Variant 1b + +- Intent: + - Re-run Variant 1 with the fixed async workflow bridge. +- Status: + - Cancelled after diagnostic success and provider failures. +- Diagnostic result: + - Run ID: `9a83787a-dac2-45a1-9d3f-823f65984716` + - Early poll showed 20 graph nodes: 5 roots + 15 level-1 specialist children. + - Each root created source scout, rubric compliance, and synthesis reviewer children. + - This is the desired RQ1 graph-specialism signal. + - However, several roots failed on provider/schema errors (`finish_reason=None`) before reports/evaluations landed; remaining children were pending/blocked. + - Cancelled run via `uv run ergon run cancel 9a83787a-dac2-45a1-9d3f-823f65984716`. + +## 2026-04-27 00:22 UTC+1 - Prompt Hillclimb Variant 1c + +- Intent: + - Keep the specialist-subtask prompt, but switch from OpenRouter Sonnet to direct OpenAI to avoid the `finish_reason=None` OpenRouter/PydanticAI failure. +- Command: + - `ERGON_REAL_LLM=1 ERGON_REAL_LLM_BUDGET_USD=50 ERGON_REAL_LLM_MODEL=openai:gpt-4o-mini ERGON_REAL_LLM_BENCHMARK=researchrubrics-vanilla ERGON_REAL_LLM_WORKER=researchrubrics-workflow-cli-react ERGON_REAL_LLM_LIMIT=5 uv run pytest tests/real_llm/benchmarks/test_researchrubrics.py -v -s --assume-stack-up` +- Status: + - Cancelled after partial artifact dump. +- Rollout: + - Directory: `tests/real_llm/.rollouts/20260426T232803Z-3b258073-ab38-4a22-ac18-766c27d8aa1e/` + - Run ID: `3b258073-ab38-4a22-ac18-766c27d8aa1e` + - Row counts: 11 graph nodes, 37 mutations, 90 context events, 3 resources, 2 evaluations. +- Findings: + - Direct OpenAI avoided the OpenRouter `finish_reason=None` issue. + - Two root tasks completed and produced evaluations: + - score `0.11382113821138211`, passed `true` + - score `0.014084507042253521`, passed `false` + - Three roots failed before final output; two failed roots created specialist children, which were blocked by parent failure. + - This is a partial "rich behavior vs return" data point: returns are low/partial, but rollout-card structure exposes role-specialist decomposition not captured by scalar return. + +## 2026-04-27 00:29 UTC+1 - Prompt Hillclimb Variant 1d + +- Intent: + - Same specialist prompt, direct OpenAI, stronger model (`openai:gpt-4o`) to improve returns while preserving graph-specialism signal. +- Command: + - `ERGON_REAL_LLM=1 ERGON_REAL_LLM_BUDGET_USD=50 ERGON_REAL_LLM_MODEL=openai:gpt-4o ERGON_REAL_LLM_BENCHMARK=researchrubrics-vanilla ERGON_REAL_LLM_WORKER=researchrubrics-workflow-cli-react ERGON_REAL_LLM_LIMIT=5 uv run pytest tests/real_llm/benchmarks/test_researchrubrics.py -v -s --assume-stack-up` +- Status: + - Cancelled after partial artifact dump because dynamic child tasks remained pending. +- Rollout: + - Directory: `tests/real_llm/.rollouts/20260426T233740Z-356b7189-229b-4ef4-849c-f3c87964feb4/` + - Run ID: `356b7189-229b-4ef4-849c-f3c87964feb4` + - Row counts: 20 graph nodes, 43 mutations, 62 context events, 5 resources, 4 evaluations. +- Findings: + - Best evidence so far: 5 roots, 15 specialist children, 4/5 root reports completed, 4 evaluations landed. + - Scores: + - `0.1267605633802817`, passed `true` + - `0.11382113821138211`, passed `true` + - `0.07142857142857142`, passed `false` + - `0.0`, passed `false` + - Dynamic children remained `pending` rather than being scheduled after creation. +- Core fix: + - `WorkflowService.add_task` now emits `task/ready` for the created dynamic node after commit. + - Added an injectable task-ready dispatcher and unit test coverage. + - Tests: `uv run pytest tests/unit/runtime/test_workflow_service.py tests/unit/cli/test_workflow_cli.py tests/unit/state/test_workflow_cli_tool.py -q` -> `22 passed`. + +## 2026-04-27 00:38 UTC+1 - Prompt Hillclimb Variant 1e + +- Intent: + - Same GPT-4o specialist prompt, now with dynamic child scheduling fixed. +- Command: + - `ERGON_REAL_LLM=1 ERGON_REAL_LLM_BUDGET_USD=50 ERGON_REAL_LLM_MODEL=openai:gpt-4o ERGON_REAL_LLM_BENCHMARK=researchrubrics-vanilla ERGON_REAL_LLM_WORKER=researchrubrics-workflow-cli-react ERGON_REAL_LLM_LIMIT=5 uv run pytest tests/real_llm/benchmarks/test_researchrubrics.py -v -s --assume-stack-up` +- Status: + - Pytest artifact-dump wrapper passed, but run terminal status is `failed`. +- Rollout: + - Directory: `tests/real_llm/.rollouts/20260426T233920Z-0700b668-a640-49f2-80f9-a5c87bc160a9/` + - Run ID: `0700b668-a640-49f2-80f9-a5c87bc160a9` + - Row counts: 20 graph nodes, 70 mutations, 257 context events, 5 resources, 5 evaluations, 20 executions. + - Run summary: `final_score=0.7134802212615627`, `normalized_score=0.14269604425231255`, `evaluators_count=5`. +- Findings: + - Scheduling fix worked: Inngest logs show dynamic `task/ready` events for child `node_id`s and `task-execute` initialized for those children. + - Graph-specialism signal preserved: 5 roots and 15 specialist children. + - Returns improved versus prior failed/partial variants: all 5 root tasks completed and all 5 root evaluations landed; 1/5 evaluations passed. + - Remaining failure mode: most specialist children started and then failed generically (`Worker execution failed`), causing the overall run to fail even though root reports/evaluations landed. + - Root cause for child recursion: the prompt told agents to inspect `task-tree`; child agents can see other level-0 roots in that output and at least one child incorrectly called `manage add-task`. +- Prompt fix for next run: + - Delegation decision now uses only `workflow("inspect task-workspace --format json")` and `task_workspace.task.level`. + - Prompt explicitly says to ignore level-0 tasks shown elsewhere in task-tree. + - Non-root specialist children are told not to call `workflow("manage add-task`), to use at most 2 workflow inspections and 3 `exa_search` calls, and to write `final_output/report.md`. +- Verification: + - Red test first: `uv run pytest tests/unit/state/test_research_rubrics_workers.py::TestResearcherWorker::test_workflow_cli_prompt_uses_current_task_level_for_delegation -q` failed on the missing `task-workspace --format json` instruction. + - Green tests: `uv run pytest tests/unit/state/test_research_rubrics_workers.py tests/unit/state/test_workflow_cli_tool.py -q` -> `11 passed`. + +## 2026-04-27 00:55 UTC+1 - Prompt Hillclimb Variant 1f + +- Intent: + - Same GPT-4o specialist prompt, but with delegation keyed to the current task workspace rather than global task-tree rows. +- Command: + - `ERGON_REAL_LLM=1 ERGON_REAL_LLM_BUDGET_USD=50 ERGON_REAL_LLM_MODEL=openai:gpt-4o ERGON_REAL_LLM_BENCHMARK=researchrubrics-vanilla ERGON_REAL_LLM_WORKER=researchrubrics-workflow-cli-react ERGON_REAL_LLM_LIMIT=5 uv run pytest tests/real_llm/benchmarks/test_researchrubrics.py -v -s --assume-stack-up` +- Status: + - Pytest artifact-dump wrapper passed, but run terminal status is `failed`. +- Rollout: + - Directory: `tests/real_llm/.rollouts/20260426T234424Z-7fc055f5-03c3-4cab-8117-04e844696482/` + - Run ID: `7fc055f5-03c3-4cab-8117-04e844696482` + - Row counts: 20 graph nodes, 70 mutations, 235 context events, 5 resources, 5 evaluations, 20 executions. + - Run summary: `final_score=0.7597894539417135`, `normalized_score=0.1519578907883427`, `evaluators_count=5`. +- Findings: + - Best overnight evidence so far. + - Graph-specialism signal: 5 roots created exactly 15 specialist children; `manage add-task` appears exactly 15 times and no recursive child creation was observed. + - Return guardrail: all 5 root tasks completed, all 5 root evaluations landed, and the aggregate normalized score improved slightly over 1e (`0.1519578907883427` vs `0.14269604425231255`). + - Specialist execution improved but remains noisy: 5/15 children completed, 10/15 failed with generic `Worker execution failed`, so the run-level status is still `failed`. + - This supports the RQ1 story: the scalar terminal status is poor, but the rollout card exposes a stable specialist-delegation pattern, role-specific child descriptions, root report completion, and recoverable child-worker behavior. +- Backend harness endpoint check: + - `GET http://127.0.0.1:9000/api/test/read/run/7fc055f5-03c3-4cab-8117-04e844696482/state` returned HTTP 200 with `status=failed`, `graph_nodes=20`, `mutations=70`, `evaluations=5`, `executions=20`, `resource_count=5`, `context_event_count=235`. + - The same path on dashboard port `3001` returned 404; the harness route is backend API, not dashboard. + +## Morning Handoff Notes + +- Best variant: Prompt Hillclimb Variant 1f. +- Best artifact path: `tests/real_llm/.rollouts/20260426T234424Z-7fc055f5-03c3-4cab-8117-04e844696482/` +- Candidate RQ1 headline evidence: + - Returns/status alone: run is `failed`, but all 5 root tasks completed and all 5 evaluations landed. + - Rollout-card structure: 5 root tasks, exactly 15 specialist child tasks, 70 graph mutations, 235 context events. + - Specialism behavior: root tasks consistently decomposed into source-scout, rubric-checker/compliance, and synthesis-reviewer roles. + - Cross-community analysis hook: this single rollout card supports post-hoc role-diversity / worker-specialism measurements that are invisible in terminal status or scalar return. +- Main residual issue: + - Dynamic specialist children now schedule and some complete, but child failures still propagate run failure. Next core-code direction would be either (a) make advisory child tasks non-fatal for parent benchmark return, or (b) harden child-worker prompting/tooling so specialist children reliably write `final_output/report.md`. + +## 2026-04-27 10:05 UTC+1 - Model Resolution Refactor + +- Intent: + - Make Ergon cloud model targets (`openai:*`, `anthropic:*`, `google:*`) route through OpenRouter instead of direct provider APIs or PydanticAI fallback inference. +- Changes: + - Upgraded `pydantic-ai` from `0.7.2` to `0.8.1`, the latest resolvable version in this environment. + - Centralized dispatch in `ergon_core/core/providers/generation/model_resolution.py`. + - Added `openrouter.py` for OpenRouter-hosted cloud targets and `openai_compatible.py` for `vllm:` plus `openai-compatible:` endpoint targets. + - Removed the builtins model-backend registration path and the old `cloud_passthrough.py` / `vllm_backend.py` modules. +- Note: + - The installed PydanticAI version exposes `OpenRouterProvider` but not `OpenRouterModel`; the implementation uses `OpenAIChatModel(..., provider=OpenRouterProvider(...))`, which gives the desired OpenRouter routing semantics. + + diff --git a/docs/real-llm-rollout-harness.md b/docs/real-llm-rollout-harness.md index ad49aead..d28cf49f 100644 --- a/docs/real-llm-rollout-harness.md +++ b/docs/real-llm-rollout-harness.md @@ -195,7 +195,7 @@ async def test_researchrubrics_rollout( "--evaluator", "research-rubric", "--model", os.environ.get( "ERGON_REAL_LLM_MODEL", - "openrouter:anthropic/claude-sonnet-4.6", + "anthropic:claude-sonnet-4.6", ), "--limit", "1", ], @@ -215,16 +215,13 @@ async def test_researchrubrics_rollout( ## Spike results -**1. OpenRouter model routing — works out of the box.** -`pydantic_ai.models.infer_model("openrouter:anthropic/claude-sonnet-4.6")` -resolves to an `OpenAIModel` backed by -`pydantic_ai.providers.openrouter.OpenRouterProvider`. The only -requirement is `OPENROUTER_API_KEY` in the process env, which -`settings.py:82-83` already exports from `settings.openrouter_api_key`. -`resolve_model_target`'s fallback branch passes `openrouter:*` strings -straight through to pydantic-ai. **No backend registration needed.** An -optional one-line `"openrouter": resolve_cloud` entry in -`MODEL_BACKENDS` is nice-to-have for symmetry, not required. +**1. OpenRouter model routing.** +`resolve_model_target("anthropic:claude-sonnet-4.6")` resolves to a +PydanticAI chat model backed by +`pydantic_ai.providers.openrouter.OpenRouterProvider`. Cloud provider +prefixes (`openai:`, `anthropic:`, `google:`) are OpenRouter-hosted in +Ergon; use `OPENROUTER_API_KEY` in the process env and do not route +through direct provider APIs. **2. Exa inside the sandbox — confirmed not wired.** The plumbing exists but nothing populates it: @@ -342,10 +339,9 @@ Files: `write_manifest`, `_rollout_dir`. 2. `tests/real_llm/benchmarks/test_researchrubrics.py` — the 30-line trigger above. -3. *(optional)* `ergon_builtins/registry_core.py` — one-line - `"openrouter": resolve_cloud` entry in `MODEL_BACKENDS` for - symmetry with the other cloud prefixes. Not required — pydantic-ai - handles the prefix natively via the fallback branch. +3. Model targets resolve centrally in `resolve_model_target`; use + provider-prefixed targets such as `anthropic:claude-sonnet-4.6`. + Cloud provider prefixes route through OpenRouter. Estimated effort: **half a day** on top of the pre-work PR. diff --git a/docs/superpowers/plans/2026-04-26-finish-agent-workflow-cli.md b/docs/superpowers/plans/2026-04-26-finish-agent-workflow-cli.md new file mode 100644 index 00000000..23ad8eef --- /dev/null +++ b/docs/superpowers/plans/2026-04-26-finish-agent-workflow-cli.md @@ -0,0 +1,100 @@ +# Finish Agent Workflow CLI Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Finish `ergon workflow` as an agent-facing CLI for task editing and resource copying in one PR off `main`. + +**Architecture:** Extend the already-merged V1 instead of replacing it. Keep scoped reads and mutation policy in `WorkflowService`, command parsing/rendering in `ergon_cli.commands.workflow`, and model-facing scope injection in `workflow_cli_tool`. All commands stay current-run/current-node scoped unless an injected manager-capable context explicitly permits broader graph edits. + +**Tech Stack:** Python, argparse, SQLModel, existing `WorkflowGraphRepository`, existing run graph tables, pydantic DTOs, pytest. + +--- + +## Current Baseline + +Already merged: + +- `ergon workflow ...` top-level command. +- `WorkflowService` with task/resource inspection and `materialize_resource`. +- `workflow(command)` pydantic-ai wrapper with injected run/node/execution/sandbox scope. +- ResearchRubrics workflow worker registration. + +## Implementation Tasks + +### Task 1: Real Task Editing Commands + +**Files:** +- Modify: `ergon_core/ergon_core/core/runtime/services/workflow_dto.py` +- Modify: `ergon_core/ergon_core/core/runtime/services/workflow_service.py` +- Modify: `ergon_cli/ergon_cli/commands/workflow.py` +- Test: `tests/unit/runtime/test_workflow_service.py` +- Test: `tests/unit/cli/test_workflow_cli.py` + +- [ ] Add a `WorkflowMutationRef` DTO with `action`, `dry_run`, `node`, `edge`, `message`, and `suggested_commands`. +- [ ] Add service methods for `add_task`, `add_edge`, `update_task_description`, `restart_task`, and `abandon_task`. +- [ ] Use `WorkflowGraphRepository` for graph writes and mutation logging. +- [ ] Keep `--dry-run` behavior identical to real command validation but without writes. +- [ ] Add CLI parser arguments for task slug, description, worker, source/target, and status fields. +- [ ] Add text and JSON renderers for mutation results. +- [ ] Verify with focused unit tests before moving on. + +### Task 2: Resource Copying Completion + +**Files:** +- Modify: `ergon_core/ergon_core/core/runtime/services/workflow_dto.py` +- Modify: `ergon_core/ergon_core/core/runtime/services/workflow_service.py` +- Modify: `ergon_cli/ergon_cli/commands/workflow.py` +- Test: `tests/unit/runtime/test_workflow_service.py` +- Test: `tests/unit/cli/test_workflow_cli.py` + +- [ ] Add `inspect resource-location`. +- [ ] Add `inspect task-workspace`. +- [ ] Harden `materialize-resource` destination handling: reject absolute paths, `..`, and paths outside `/workspace`. +- [ ] Preserve source resource bytes and row unchanged. +- [ ] Ensure copied resource rows use `RunResourceKind.IMPORT`, `copied_from_resource_id`, and metadata with source resource, source task, and sandbox destination. +- [ ] Add JSON/text outputs for resource location and task workspace. +- [ ] Verify with unit tests and one integration-style sandbox-manager-injected test. + +### Task 3: Agent Wrapper Permissions + +**Files:** +- Modify: `ergon_builtins/ergon_builtins/tools/workflow_cli_tool.py` +- Test: `tests/unit/state/test_workflow_cli_tool.py` + +- [ ] Add a permission mode to the wrapper: leaf agents can inspect and materialize visible resources; manager-capable agents can use graph edit commands. +- [ ] Reject user-supplied scope/context flags before command execution. +- [ ] Reject multiline commands. +- [ ] Return structured, model-readable failure strings instead of leaking tracebacks. +- [ ] Verify wrapper tests for allowed inspect, allowed materialize, denied graph edit, and allowed manager graph edit. + +### Task 4: Acceptance Coverage + +**Files:** +- Modify: existing smoke fixture workers only as needed. +- Modify: existing E2E assertions only as needed. +- Test: focused unit tests plus existing smoke tests. + +- [ ] Ensure one deterministic no-LLM smoke path calls `workflow("inspect task-tree")`. +- [ ] Ensure one deterministic no-LLM smoke path calls `workflow("inspect resource-list --scope input")`. +- [ ] Ensure one deterministic no-LLM smoke path dry-runs `manage materialize-resource`. +- [ ] Keep real-LLM rollout optional, using `researchrubrics-workflow-cli-react`. +- [ ] Run focused workflow tests, Python unit tests touched by runtime changes, frontend contract generation if schemas change, and CI-fast-compatible checks. + +## Verification Commands + +Run incrementally: + +```bash +uv run pytest tests/unit/runtime/test_workflow_service.py -v +uv run pytest tests/unit/cli/test_workflow_cli.py -v +uv run pytest tests/unit/state/test_workflow_cli_tool.py -v +``` + +Before PR: + +```bash +uv run pytest tests/unit/runtime/test_workflow_service.py tests/unit/cli/test_workflow_cli.py tests/unit/state/test_workflow_cli_tool.py -v +uv run pytest tests/unit/runtime tests/unit/cli tests/unit/state -q +pnpm --dir ergon-dashboard run typecheck +``` + diff --git a/ergon_builtins/AGENTS.md b/ergon_builtins/AGENTS.md index cfb169a3..98174ada 100644 --- a/ergon_builtins/AGENTS.md +++ b/ergon_builtins/AGENTS.md @@ -108,16 +108,16 @@ and is instantiated directly by `researchrubrics-researcher`; it is not in --- -## Model backends (`MODEL_BACKENDS` in registry_core.py) +## Model targets (`resolve_model_target`) | prefix | file | notes | |---|---|---| -| `vllm:` | `models/vllm_backend.py` | Points at a running vLLM server; supports logprobs. | -| `openai:`, `anthropic:`, `google:` | `models/cloud_passthrough.py` | Passes through to pydantic-ai's provider. No logprobs. | -| *(no prefix)* | fallthrough | Handed to pydantic-ai's `infer_model` — may pick a default or fail. | +| `vllm:[#]` | `ergon_core/core/providers/generation/openai_compatible.py` | Points at a running vLLM server; supports logprobs. | +| `openai-compatible:#` | `ergon_core/core/providers/generation/openai_compatible.py` | Generic OpenAI-compatible endpoints such as Ollama. | +| `openai:`, `anthropic:`, `google:` | `ergon_core/core/providers/generation/openrouter.py` | Always routed through OpenRouter, not direct cloud APIs. | Default when `--model` is omitted: `openai:gpt-4o` -(`ergon_core/core/providers/generation/model_resolution.py:57`). +(`ergon_core/core/providers/generation/model_resolution.py`). --- diff --git a/ergon_builtins/ergon_builtins/benchmarks/researchrubrics/benchmark.py b/ergon_builtins/ergon_builtins/benchmarks/researchrubrics/benchmark.py index 6d343710..b9b11107 100644 --- a/ergon_builtins/ergon_builtins/benchmarks/researchrubrics/benchmark.py +++ b/ergon_builtins/ergon_builtins/benchmarks/researchrubrics/benchmark.py @@ -107,10 +107,11 @@ def _payload_from_row( row: Mapping[str, Any], # slopcop: ignore[no-typing-any] ) -> ResearchRubricsTaskPayload: """Convert one raw HuggingFace row into the benchmark payload schema.""" + ablated_prompt = row.get("ablated_prompt") or row["prompt"] return ResearchRubricsTaskPayload( sample_id=row["sample_id"], domain=str(row.get("domain", "")), - ablated_prompt=row["ablated_prompt"], + ablated_prompt=ablated_prompt, rubrics=[ RubricCriterion( criterion=r["criterion"], diff --git a/ergon_builtins/ergon_builtins/models/cloud_passthrough.py b/ergon_builtins/ergon_builtins/models/cloud_passthrough.py deleted file mode 100644 index e7620a1d..00000000 --- a/ergon_builtins/ergon_builtins/models/cloud_passthrough.py +++ /dev/null @@ -1,14 +0,0 @@ -"""Cloud passthrough: resolves ``openai:``, ``anthropic:``, etc. by passing through to PydanticAI.""" - -from ergon_core.core.providers.generation.model_resolution import ResolvedModel - - -def resolve_cloud( - target: str, - *, - model_name: str | None = None, - policy_version: str | None = None, - api_key: str | None = None, -) -> ResolvedModel: - """Pass cloud model targets through to PydanticAI's infer_model.""" - return ResolvedModel(model=target, supports_logprobs=False) diff --git a/ergon_builtins/ergon_builtins/models/vllm_backend.py b/ergon_builtins/ergon_builtins/models/vllm_backend.py deleted file mode 100644 index 0488f5a1..00000000 --- a/ergon_builtins/ergon_builtins/models/vllm_backend.py +++ /dev/null @@ -1,58 +0,0 @@ -"""vLLM backend: resolves ``vllm:http://...`` targets to OpenAI-compatible PydanticAI models.""" - -import json -import logging -import urllib.error -import urllib.request - -from ergon_core.core.providers.generation.model_resolution import ResolvedModel -from pydantic_ai.models.openai import OpenAIModel as OpenAIChatModel -from pydantic_ai.providers.openai import OpenAIProvider - -logger = logging.getLogger(__name__) - - -def resolve_vllm( - target: str, - *, - model_name: str | None = None, - policy_version: str | None = None, - api_key: str | None = None, -) -> ResolvedModel: - """Resolve a ``vllm:http://...`` target to a PydanticAI model.""" - endpoint = target[5:].rstrip("/") - resolved_name = model_name or _discover_model_name(endpoint) - provider = OpenAIProvider( - base_url=f"{endpoint}/v1", - api_key=api_key or "not-needed", - ) - model = OpenAIChatModel(model_name=resolved_name, provider=provider) - logger.info( - "Resolved vLLM model: endpoint=%s model_name=%s policy_version=%s", - endpoint, - resolved_name, - policy_version, - ) - return ResolvedModel(model=model, policy_version=policy_version, supports_logprobs=True) - - -def _discover_model_name(endpoint: str) -> str: - """Query ``/v1/models`` to discover the served model name.""" - url = f"{endpoint}/v1/models" - try: - with urllib.request.urlopen(url, timeout=5) as resp: - body = json.loads(resp.read()) - models = body.get("data", []) - if models: - name = models[0].get("id", "default") - logger.info("Discovered vLLM model name: %s", name) - return name - except ( - urllib.error.HTTPError, - urllib.error.URLError, - TimeoutError, - OSError, - json.JSONDecodeError, - ): - logger.warning("Could not discover vLLM model name from %s, using 'default'", url) - return "default" diff --git a/ergon_builtins/ergon_builtins/registry.py b/ergon_builtins/ergon_builtins/registry.py index aa340e2f..f91f9ddb 100644 --- a/ergon_builtins/ergon_builtins/registry.py +++ b/ergon_builtins/ergon_builtins/registry.py @@ -8,10 +8,6 @@ import structlog from ergon_core.api import Benchmark, Evaluator, Worker -from ergon_core.core.providers.generation.model_resolution import ( - ResolvedModel, - register_model_backend, -) from ergon_core.core.providers.sandbox.manager import BaseSandboxManager from ergon_builtins.registry_core import ( @@ -20,9 +16,6 @@ from ergon_builtins.registry_core import ( EVALUATORS as _core_evaluators, ) -from ergon_builtins.registry_core import ( - MODEL_BACKENDS as _core_model_backends, -) from ergon_builtins.registry_core import ( SANDBOX_MANAGERS as _core_sandbox_managers, ) @@ -42,19 +35,6 @@ EVALUATORS: dict[str, type[Evaluator]] = {**_core_evaluators} SANDBOX_MANAGERS: dict[str, type[BaseSandboxManager]] = {**_core_sandbox_managers} -_model_backends: dict[str, Callable[..., ResolvedModel]] = {**_core_model_backends} - -# -- Capability: local-models ---------------------------------------------- - -try: - from ergon_builtins.registry_local_models import ( - MODEL_BACKENDS as _local_model_backends, - ) - - _model_backends.update(_local_model_backends) -except ImportError: - log.info("ergon-builtins[local-models] not installed; local transformers inference unavailable") - # -- Capability: data ------------------------------------------------------ try: @@ -80,11 +60,6 @@ "ergon-builtins[data] not installed; gdpeval and researchrubrics benchmarks unavailable" ) -# -- Register model backends ----------------------------------------------- - -for prefix, resolver in _model_backends.items(): - register_model_backend(prefix, resolver) - # -- Install hints for slugs that require optional capabilities ------------- INSTALL_HINTS: dict[str, str] = { diff --git a/ergon_builtins/ergon_builtins/registry_core.py b/ergon_builtins/ergon_builtins/registry_core.py index 7be86868..67ea3697 100644 --- a/ergon_builtins/ergon_builtins/registry_core.py +++ b/ergon_builtins/ergon_builtins/registry_core.py @@ -10,7 +10,6 @@ from uuid import UUID from ergon_core.api import Benchmark, Evaluator, Worker -from ergon_core.core.providers.generation.model_resolution import ResolvedModel from ergon_core.core.providers.sandbox.manager import BaseSandboxManager from ergon_builtins.benchmarks.gdpeval.rubric import StagedRubric @@ -25,8 +24,6 @@ ) from ergon_builtins.benchmarks.swebench_verified.toolkit import SWEBenchToolkit from ergon_builtins.evaluators.rubrics.swebench_rubric import SWEBenchRubric -from ergon_builtins.models.cloud_passthrough import resolve_cloud -from ergon_builtins.models.vllm_backend import resolve_vllm from ergon_builtins.workers.baselines.react_prompts import ( MINIF2F_SYSTEM_PROMPT, SWEBENCH_SYSTEM_PROMPT, @@ -178,10 +175,3 @@ def _swebench_react( "minif2f": Path(__file__).parent / "benchmarks/minif2f/sandbox", "swebench-verified": Path(__file__).parent / "benchmarks/swebench_verified/sandbox", } - -MODEL_BACKENDS: dict[str, Callable[..., ResolvedModel]] = { - "vllm": resolve_vllm, - "openai": resolve_cloud, - "anthropic": resolve_cloud, - "google": resolve_cloud, -} diff --git a/ergon_builtins/ergon_builtins/registry_local_models.py b/ergon_builtins/ergon_builtins/registry_local_models.py deleted file mode 100644 index e45abd5d..00000000 --- a/ergon_builtins/ergon_builtins/registry_local_models.py +++ /dev/null @@ -1,16 +0,0 @@ -"""Components that require the [local-models] capability (torch + transformers). - -Eager, fully-typed imports. This module will fail to import if torch/ -transformers/outlines are not installed — that's by design. The composition -layer in registry.py handles the ImportError gracefully. -""" - -from collections.abc import Callable - -from ergon_core.core.providers.generation.model_resolution import ResolvedModel - -from ergon_builtins.models.transformers_backend import resolve_transformers - -MODEL_BACKENDS: dict[str, Callable[..., ResolvedModel]] = { - "transformers": resolve_transformers, -} diff --git a/ergon_builtins/ergon_builtins/tools/workflow_cli_tool.py b/ergon_builtins/ergon_builtins/tools/workflow_cli_tool.py index 0b540b6c..f15a0984 100644 --- a/ergon_builtins/ergon_builtins/tools/workflow_cli_tool.py +++ b/ergon_builtins/ergon_builtins/tools/workflow_cli_tool.py @@ -1,20 +1,29 @@ from collections.abc import Awaitable, Callable +import shlex from typing import Protocol from uuid import UUID from ergon_cli.commands.workflow import ( WorkflowCommandContext, WorkflowCommandOutput, - execute_workflow_command, + execute_workflow_command_async, ) from ergon_core.api.worker_context import WorkerContext from ergon_core.core.persistence.shared.db import get_session from ergon_core.core.runtime.services.workflow_service import WorkflowService from sqlmodel import Session +_MANAGER_ONLY_ACTIONS = { + "add-task", + "add-edge", + "update-task-description", + "restart-task", + "abandon-task", +} + class WorkflowCommandExecutor(Protocol): - def __call__( + async def __call__( self, command: str, *, @@ -29,9 +38,10 @@ def make_workflow_cli_tool( worker_context: WorkerContext, sandbox_task_key: UUID, benchmark_type: str, - execute_command: WorkflowCommandExecutor = execute_workflow_command, + execute_command: WorkflowCommandExecutor = execute_workflow_command_async, session_factory: Callable[[], Session] = get_session, service_factory: Callable[[], WorkflowService] = WorkflowService, + manager_capable: bool = False, ) -> Callable[[str], Awaitable[str]]: """Build an agent-facing ``workflow(command)`` callable. @@ -44,19 +54,25 @@ async def workflow(command: str) -> str: """Inspect workflow topology/resources or dry-run workflow management commands.""" if worker_context.node_id is None: raise ValueError("workflow tool requires WorkerContext.node_id") + denial = _denial_reason(command, manager_capable=manager_capable) + if denial is not None: + return f"workflow denied: {denial}" - output = execute_command( - command, - context=WorkflowCommandContext( - run_id=worker_context.run_id, - node_id=worker_context.node_id, - execution_id=worker_context.execution_id, - sandbox_task_key=sandbox_task_key, - benchmark_type=benchmark_type, - ), - session_factory=session_factory, - service=service_factory(), - ) + try: + output = await execute_command( + command, + context=WorkflowCommandContext( + run_id=worker_context.run_id, + node_id=worker_context.node_id, + execution_id=worker_context.execution_id, + sandbox_task_key=sandbox_task_key, + benchmark_type=benchmark_type, + ), + session_factory=session_factory, + service=service_factory(), + ) + except Exception as exc: # slopcop: ignore[no-broad-except] + return f"workflow failed: {type(exc).__name__}: {exc}" if output.exit_code != 0: detail = output.stderr or output.stdout return f"workflow exited {output.exit_code}: {detail}".strip() @@ -65,3 +81,16 @@ async def workflow(command: str) -> str: return output.stdout return workflow + + +def _denial_reason(command: str, *, manager_capable: bool) -> str | None: + if "\n" in command or "\r" in command: + return "multiline commands are not allowed" + try: + argv = shlex.split(command) + except ValueError as exc: + return f"could not parse command: {exc}" + if len(argv) >= 3 and argv[0] == "manage" and argv[1] in _MANAGER_ONLY_ACTIONS: + if not manager_capable: + return f"{argv[1]} requires a manager-capable workflow tool" + return None diff --git a/ergon_builtins/ergon_builtins/workers/baselines/training_stub_worker.py b/ergon_builtins/ergon_builtins/workers/baselines/training_stub_worker.py index 459351c8..37ec781e 100644 --- a/ergon_builtins/ergon_builtins/workers/baselines/training_stub_worker.py +++ b/ergon_builtins/ergon_builtins/workers/baselines/training_stub_worker.py @@ -21,11 +21,11 @@ ModelResponsePart, TextPart, ThinkingPart, - TokenLogprob, ToolCallPart, ToolReturnPart, UserPromptPart, ) +from ergon_core.core.providers.generation.types import TokenLogprob class TrainingStubWorker(Worker): diff --git a/ergon_builtins/ergon_builtins/workers/research_rubrics/workflow_cli_react_worker.py b/ergon_builtins/ergon_builtins/workers/research_rubrics/workflow_cli_react_worker.py index 302a9bde..37259000 100644 --- a/ergon_builtins/ergon_builtins/workers/research_rubrics/workflow_cli_react_worker.py +++ b/ergon_builtins/ergon_builtins/workers/research_rubrics/workflow_cli_react_worker.py @@ -45,14 +45,39 @@ "- workflow: Inspect current-run task topology and resources\n\n" "Write your final report to 'final_output/report.md' using write_report_draft. " "Include a # Findings section and a ## Sources section with citations.\n\n" + "Hard operating budget: use at most 6 exa_search calls for your own work. " + "After that, write the report from the evidence you have. Prefer targeted " + "queries over broad exploration.\n\n" "Use workflow(command) to inspect this run before " "deciding what context is missing. Useful commands include: " - "`inspect task-tree`, `inspect resource-list --scope input`, " + "`inspect task-workspace --format json`, `inspect task-tree`, " + "`inspect resource-list --scope input`, " "`inspect resource-list --scope visible --limit 20`, " + "`inspect resource-location --resource-id `, " "`inspect next-actions`, and " "`manage materialize-resource --resource-id --dry-run`. " "Use `--format json` when you need stable IDs. Resource copies are snapshots: " - "materialized files become resources owned by this task, not edits to the source." + "materialized files become resources owned by this task, not edits to the source.\n\n" + "First call `workflow(\"inspect task-workspace --format json\")`. Use only " + "`task_workspace.task.level` from that response to decide whether this current " + "task may delegate. Ignore level-0 tasks shown elsewhere in task-tree. If " + "`task_workspace.task.level is exactly 0`, create exactly three specialist " + "child tasks before researching: " + "(1) a source scout for finding citations, " + "(2) a rubric compliance checker for mapping requirements to an outline, and " + "(3) a synthesis reviewer for risks, gaps, and counterclaims. " + "Use `workflow(\"manage add-task --task-slug --worker worker " + "--description ''\")` for each child. " + "Give each child a role-specific description that includes the original task " + "goal and asks for a concise markdown report in `final_output/report.md`. " + "Then continue your own report; do not wait for child results unless visible " + "resources are already available.\n\n" + "If your current `task_workspace.task.level` is not 0, you are already a " + "specialist child. You must do only your assigned specialist work; do not call " + "`workflow(\"manage add-task` under any " + "circumstances. Do not inspect the workflow repeatedly. Use at most 2 " + "workflow inspections and at most 3 exa_search calls, then write your " + "specialist markdown report to `final_output/report.md`." ) @@ -121,6 +146,7 @@ async def publisher_sync() -> list[RunResourceView]: worker_context=context, sandbox_task_key=self.task_id, benchmark_type="researchrubrics", + manager_capable=True, ) self.tools = [*rr_toolkit.build_tools(), *graph_toolkit.build_tools(), workflow_tool] diff --git a/ergon_cli/ergon_cli/commands/workflow.py b/ergon_cli/ergon_cli/commands/workflow.py index 27a32d9f..21ec9559 100644 --- a/ergon_cli/ergon_cli/commands/workflow.py +++ b/ergon_cli/ergon_cli/commands/workflow.py @@ -8,6 +8,7 @@ from ergon_core.api.json_types import JsonObject from ergon_core.core.persistence.shared.db import get_session from ergon_core.core.runtime.services.workflow_service import WorkflowService +from ergon_core.core.runtime.services.workflow_dto import WorkflowMutationRef from pydantic import BaseModel from sqlmodel import Session from collections.abc import Callable @@ -59,10 +60,17 @@ def build_workflow_parser() -> argparse.ArgumentParser: resource_content.add_argument("--max-bytes", type=int, default=100_000) resource_content.add_argument("--format", choices=["text", "json"], default="text") + resource_location = inspect_sub.add_parser("resource-location") + resource_location.add_argument("--resource-id", required=True) + resource_location.add_argument("--format", choices=["text", "json"], default="text") + task_tree = inspect_sub.add_parser("task-tree") task_tree.add_argument("--format", choices=["text", "json"], default="text") task_tree.add_argument("--parent-node-id", default=None) + task_workspace = inspect_sub.add_parser("task-workspace") + task_workspace.add_argument("--format", choices=["text", "json"], default="text") + dependencies = inspect_sub.add_parser("task-dependencies") dependencies.add_argument( "--direction", choices=["upstream", "downstream", "both"], default="both" @@ -81,8 +89,32 @@ def build_workflow_parser() -> argparse.ArgumentParser: materialize.add_argument("--dry-run", action="store_true") materialize.add_argument("--format", choices=["text", "json"], default="text") - for action in ("add-task", "add-edge", "restart-task", "abandon-task"): + add_task = manage_sub.add_parser("add-task") + add_task.add_argument("--task-slug", required=True) + add_task.add_argument("--description", required=True) + add_task.add_argument("--worker", required=True) + add_task.add_argument("--parent-node-id", default=None) + add_task.add_argument("--dry-run", action="store_true") + add_task.add_argument("--format", choices=["text", "json"], default="text") + add_task.add_argument("--reason", default=None) + + add_edge = manage_sub.add_parser("add-edge") + add_edge.add_argument("--source-task-slug", required=True) + add_edge.add_argument("--target-task-slug", required=True) + add_edge.add_argument("--dry-run", action="store_true") + add_edge.add_argument("--format", choices=["text", "json"], default="text") + add_edge.add_argument("--reason", default=None) + + update_description = manage_sub.add_parser("update-task-description") + update_description.add_argument("--task-slug", required=True) + update_description.add_argument("--description", required=True) + update_description.add_argument("--dry-run", action="store_true") + update_description.add_argument("--format", choices=["text", "json"], default="text") + update_description.add_argument("--reason", default=None) + + for action in ("restart-task", "abandon-task"): parser_for_action = manage_sub.add_parser(action) + parser_for_action.add_argument("--task-slug", required=True) parser_for_action.add_argument("--dry-run", action="store_true") parser_for_action.add_argument("--format", choices=["text", "json"], default="text") parser_for_action.add_argument("--reason", default=None) @@ -96,6 +128,23 @@ def execute_workflow_command( context: WorkflowCommandContext, session_factory: Callable[[], Session], service: WorkflowService, +) -> WorkflowCommandOutput: + return asyncio.run( # slopcop: ignore[no-async-from-sync] -- CLI sync bridge + execute_workflow_command_async( + command, + context=context, + session_factory=session_factory, + service=service, + ) + ) + + +async def execute_workflow_command_async( + command: str, + *, + context: WorkflowCommandContext, + session_factory: Callable[[], Session], + service: WorkflowService, ) -> WorkflowCommandOutput: argv = shlex.split(command) _reject_context_flags(argv) @@ -105,9 +154,7 @@ def execute_workflow_command( if args.group == "inspect": return _handle_inspect(args, context=context, session=session, service=service) if args.group == "manage": - return asyncio.run( # slopcop: ignore[no-async-from-sync] -- CLI/tool sync bridge - _handle_manage(args, context=context, session=session, service=service) - ) + return await _handle_manage(args, context=context, session=session, service=service) finally: _close_session(session) raise ValueError(f"unsupported workflow command group: {args.group}") @@ -184,6 +231,22 @@ def _handle_inspect( if args.format == "json": return _format_output({"content": content.decode(errors="replace")}, [], "json") return WorkflowCommandOutput(stdout=content.decode(errors="replace")) + if args.action == "resource-location": + location = service.get_resource_location( + session, + run_id=context.run_id, + resource_id=UUID(args.resource_id), + ) + return _format_output( + {"resource_location": _dump(location)}, + text_lines=[ + f"resource {location.resource.name}", + f"producer={location.producer_task_slug or '-'}", + f"local={location.local_file_path}", + f"default_sandbox_path={location.default_sandbox_path}", + ], + output_format=args.format, + ) if args.action == "task-tree": parent = UUID(args.parent_node_id) if args.parent_node_id else None tasks = service.list_tasks(session, run_id=context.run_id, parent_node_id=parent) @@ -195,6 +258,28 @@ def _handle_inspect( ], output_format=args.format, ) + if args.action == "task-workspace": + workspace = service.get_task_workspace( + session, + run_id=context.run_id, + node_id=context.node_id, + ) + lines = [ + f"task {workspace.task.task_slug} status={workspace.task.status}", + ] + if workspace.latest_execution is not None: + lines.append( + "execution " + f"{workspace.latest_execution.execution_id} " + f"status={workspace.latest_execution.status}" + ) + lines.extend(f"own: {resource.name}" for resource in workspace.own_resources) + lines.extend(f"input: {resource.name}" for resource in workspace.input_resources) + return _format_output( + {"task_workspace": _dump(workspace)}, + text_lines=lines, + output_format=args.format, + ) if args.action == "task-dependencies": deps = service.list_dependencies( session, @@ -249,18 +334,52 @@ async def _handle_manage( text_lines=[f"{result.source_resource_id} -> {result.sandbox_path}"], output_format=args.format, ) - try: - dry_run = args.dry_run - except AttributeError: - dry_run = False - if dry_run: - payload: JsonObject = { - "action": args.action, - "dry_run": True, - "message": "Graph lifecycle command validated; no changes applied.", - } - return _format_output(payload, [str(payload["message"])], args.format) - raise ValueError(f"{args.action} requires --dry-run in workflow CLI v1") + if args.action == "add-task": + result = await service.add_task( + session, + run_id=context.run_id, + parent_node_id=UUID(args.parent_node_id) if args.parent_node_id else context.node_id, + task_slug=args.task_slug, + description=args.description, + assigned_worker_slug=args.worker, + dry_run=args.dry_run, + ) + return _mutation_output(result, args.format) + if args.action == "add-edge": + result = await service.add_edge( + session, + run_id=context.run_id, + source_task_slug=args.source_task_slug, + target_task_slug=args.target_task_slug, + dry_run=args.dry_run, + ) + return _mutation_output(result, args.format) + if args.action == "update-task-description": + result = await service.update_task_description( + session, + run_id=context.run_id, + task_slug=args.task_slug, + description=args.description, + dry_run=args.dry_run, + ) + return _mutation_output(result, args.format) + if args.action == "restart-task": + result = await service.restart_task( + session, + run_id=context.run_id, + task_slug=args.task_slug, + dry_run=args.dry_run, + ) + return _mutation_output(result, args.format) + if args.action == "abandon-task": + result = await service.abandon_task( + session, + run_id=context.run_id, + task_slug=args.task_slug, + dry_run=args.dry_run, + ) + return _mutation_output(result, args.format) + raise ValueError(f"unsupported manage action: {args.action}") def _format_output( @@ -273,6 +392,11 @@ def _format_output( return WorkflowCommandOutput(stdout="\n".join(text_lines)) +def _mutation_output(result: WorkflowMutationRef, output_format: str) -> WorkflowCommandOutput: + payload: JsonObject = {"mutation": _dump(result)} + return _format_output(payload, [result.message], output_format) + + def _dump(value: BaseModel | JsonObject) -> JsonObject: if isinstance(value, BaseModel): return cast(JsonObject, value.model_dump(mode="json")) diff --git a/ergon_core/ergon_core/api/__init__.py b/ergon_core/ergon_core/api/__init__.py index fe6ad7f0..61aa0602 100644 --- a/ergon_core/ergon_core/api/__init__.py +++ b/ergon_core/ergon_core/api/__init__.py @@ -1,5 +1,7 @@ """Object-first Ergon public API surface.""" +from typing import TYPE_CHECKING + from ergon_core.api.benchmark import Benchmark from ergon_core.api.benchmark_deps import BenchmarkDeps from ergon_core.api.criterion import Criterion @@ -10,13 +12,15 @@ from ergon_core.api.experiment import Experiment from ergon_core.api.handles import ExperimentRunHandle, PersistedExperimentDefinition from ergon_core.api.results import CriterionResult, TaskEvaluationResult, WorkerOutput -from ergon_core.api.run_resource import RunResourceKind, RunResourceView from ergon_core.api.task_types import BenchmarkTask, EmptyTaskPayload from ergon_core.api.types import Tool from ergon_core.api.worker import Worker from ergon_core.api.worker_context import WorkerContext from ergon_core.api.worker_spec import WorkerSpec +if TYPE_CHECKING: + from ergon_core.api.run_resource import RunResourceKind, RunResourceView + __all__ = [ "Benchmark", "BenchmarkDeps", @@ -44,3 +48,13 @@ "WorkerOutput", "WorkerSpec", ] + + +def __getattr__(name: str) -> object: + if name in {"RunResourceKind", "RunResourceView"}: + from ergon_core.api.run_resource import RunResourceKind, RunResourceView + + globals()["RunResourceKind"] = RunResourceKind + globals()["RunResourceView"] = RunResourceView + return globals()[name] + raise AttributeError(f"module {__name__!r} has no attribute {name!r}") diff --git a/ergon_core/ergon_core/api/generation.py b/ergon_core/ergon_core/api/generation.py index 449f24ed..c54f2618 100644 --- a/ergon_core/ergon_core/api/generation.py +++ b/ergon_core/ergon_core/api/generation.py @@ -15,19 +15,10 @@ from typing import Annotated, Any, Literal from ergon_core.api.json_types import JsonObject +from ergon_core.core.providers.generation.types import TokenLogprob from pydantic import BaseModel, Field -class TokenLogprob(BaseModel): - """Per-token log probability from the serving backend.""" - - model_config = {"frozen": True} - - token: str - logprob: float - top_logprobs: list[JsonObject] = Field(default_factory=list) - - # --------------------------------------------------------------------------- # Request parts (ModelRequest input — what went INTO the model) # --------------------------------------------------------------------------- diff --git a/ergon_core/ergon_core/core/persistence/context/event_payloads.py b/ergon_core/ergon_core/core/persistence/context/event_payloads.py index db19d6b1..b2f58bd7 100644 --- a/ergon_core/ergon_core/core/persistence/context/event_payloads.py +++ b/ergon_core/ergon_core/core/persistence/context/event_payloads.py @@ -7,7 +7,7 @@ from typing import Annotated, Any, Literal -from ergon_core.api.generation import TokenLogprob +from ergon_core.core.providers.generation.types import TokenLogprob from pydantic import BaseModel, Field # Exported type alias — use everywhere event_type is stored as a string field. @@ -49,7 +49,7 @@ class ToolCallPayload(BaseModel): args: dict[str, Any] # slopcop: ignore[no-typing-any] turn_id: str # links events from the same generation call turn_token_ids: list[int] | None = None # None if another event in this turn holds them - turn_logprobs: list[TokenLogprob] | None = None # None if another event in this turn holds them + turn_logprobs: list[TokenLogprob] | None = None class ToolResultPayload(BaseModel): diff --git a/ergon_core/ergon_core/core/providers/generation/__init__.py b/ergon_core/ergon_core/core/providers/generation/__init__.py index 585bef15..5a166577 100644 --- a/ergon_core/ergon_core/core/providers/generation/__init__.py +++ b/ergon_core/ergon_core/core/providers/generation/__init__.py @@ -2,8 +2,7 @@ from ergon_core.core.providers.generation.model_resolution import ( ResolvedModel, - register_model_backend, resolve_model_target, ) -__all__ = ["ResolvedModel", "register_model_backend", "resolve_model_target"] +__all__ = ["ResolvedModel", "resolve_model_target"] diff --git a/ergon_core/ergon_core/core/providers/generation/model_resolution.py b/ergon_core/ergon_core/core/providers/generation/model_resolution.py index 93f312df..bdd59f18 100644 --- a/ergon_core/ergon_core/core/providers/generation/model_resolution.py +++ b/ergon_core/ergon_core/core/providers/generation/model_resolution.py @@ -1,20 +1,8 @@ -"""Prefix-based model target resolution. - -Dispatches ``model_target`` strings to the appropriate backend based on -their prefix (``vllm:``, ``transformers:``, ``openai:``, etc.). - -Concrete backend implementations live in ``ergon_builtins.models``. -This module owns the contract (``ResolvedModel``) and the dispatch logic. -""" - -import logging -from typing import Callable +"""Prefix-based model target resolution.""" import pydantic_ai.models from pydantic import BaseModel -logger = logging.getLogger(__name__) - class ResolvedModel(BaseModel): """A resolved model target with backend metadata. @@ -32,16 +20,6 @@ class ResolvedModel(BaseModel): supports_logprobs: bool = False -# Backend resolver registry: prefix -> callable -# Populated by ergon_builtins.registry at import time. -_BACKEND_REGISTRY: dict[str, Callable[..., ResolvedModel]] = {} - - -def register_model_backend(prefix: str, resolver: Callable[..., ResolvedModel]) -> None: - """Register a model backend resolver for a given prefix.""" - _BACKEND_REGISTRY[prefix] = resolver - - def resolve_model_target( model_target: str | None, *, @@ -51,20 +29,42 @@ def resolve_model_target( ) -> ResolvedModel: """Resolve a ``model_target`` string to a PydanticAI-compatible model. - Dispatches by prefix to registered backends. Unrecognised prefixes - are passed through to PydanticAI's ``infer_model``. + Cloud provider targets (``openai:*``, ``anthropic:*``, ``google:*``) + intentionally resolve to OpenRouter-hosted models. Direct cloud provider + API access is not part of Ergon's model-target grammar. """ + target = model_target or "openai:gpt-4o" + prefix = target.split(":", 1)[0] if ":" in target else "" + + if prefix == "vllm": + from ergon_core.core.providers.generation.openai_compatible import resolve_vllm + + return resolve_vllm( + target, model_name=model_name, policy_version=policy_version, api_key=api_key + ) + + if prefix == "openai-compatible": + from ergon_core.core.providers.generation.openai_compatible import ( + resolve_openai_compatible, + ) + + return resolve_openai_compatible( + target, model_name=model_name, policy_version=policy_version, api_key=api_key + ) + + if prefix in {"openai", "anthropic", "google"}: + from ergon_core.core.providers.generation.openrouter import resolve_cloud_via_openrouter + + return resolve_cloud_via_openrouter( + target, model_name=model_name, policy_version=policy_version, api_key=api_key + ) - prefix = target.split(":")[0] if ":" in target else "" + if prefix == "openrouter": + from ergon_core.core.providers.generation.openrouter import resolve_openrouter_alias - resolver = _BACKEND_REGISTRY.get(prefix) - if resolver is not None: - return resolver( - target, - model_name=model_name, - policy_version=policy_version, - api_key=api_key, + return resolve_openrouter_alias( + target, model_name=model_name, policy_version=policy_version, api_key=api_key ) - return ResolvedModel(model=target, supports_logprobs=False) + raise ValueError(f"Unsupported model target: {target!r}") diff --git a/ergon_core/ergon_core/core/providers/generation/openai_compatible.py b/ergon_core/ergon_core/core/providers/generation/openai_compatible.py new file mode 100644 index 00000000..8a4b8727 --- /dev/null +++ b/ergon_core/ergon_core/core/providers/generation/openai_compatible.py @@ -0,0 +1,100 @@ +"""OpenAI-compatible endpoint resolution for local and custom model servers.""" + +import json +import logging +import urllib.error +import urllib.request + +from pydantic_ai.models.openai import OpenAIChatModel +from pydantic_ai.providers.openai import OpenAIProvider + +from ergon_core.core.providers.generation.model_resolution import ResolvedModel + +logger = logging.getLogger(__name__) + + +def resolve_openai_compatible( + target: str, + *, + model_name: str | None = None, + policy_version: str | None = None, + api_key: str | None = None, +) -> ResolvedModel: + """Resolve ``openai-compatible:#`` targets.""" + + base_url, parsed_model_name = _split_endpoint_target( + target, + prefix="openai-compatible:", + require_model_name=True, + ) + resolved_name = model_name or parsed_model_name + if resolved_name is None: + raise ValueError("openai-compatible target requires a model name") + provider = OpenAIProvider(base_url=base_url.rstrip("/"), api_key=api_key or "not-needed") + model = OpenAIChatModel(model_name=resolved_name, provider=provider) + return ResolvedModel(model=model, policy_version=policy_version, supports_logprobs=False) + + +def resolve_vllm( + target: str, + *, + model_name: str | None = None, + policy_version: str | None = None, + api_key: str | None = None, +) -> ResolvedModel: + """Resolve ``vllm:[#]`` targets.""" + + endpoint, parsed_model_name = _split_endpoint_target( + target, + prefix="vllm:", + require_model_name=False, + ) + endpoint = endpoint.rstrip("/") + resolved_name = model_name or parsed_model_name or _discover_model_name(endpoint) + provider = OpenAIProvider(base_url=f"{endpoint}/v1", api_key=api_key or "not-needed") + model = OpenAIChatModel(model_name=resolved_name, provider=provider) + logger.info( + "Resolved vLLM model: endpoint=%s model_name=%s policy_version=%s", + endpoint, + resolved_name, + policy_version, + ) + return ResolvedModel(model=model, policy_version=policy_version, supports_logprobs=True) + + +def _split_endpoint_target( + target: str, + *, + prefix: str, + require_model_name: bool, +) -> tuple[str, str | None]: + body = target.removeprefix(prefix) + endpoint, separator, model_name = body.partition("#") + if not endpoint: + raise ValueError(f"{prefix} target requires a base URL") + if require_model_name and not (separator and model_name): + raise ValueError(f"{prefix}# target requires a model name") + return endpoint, model_name or None + + +def _discover_model_name(endpoint: str) -> str: + """Query ``/v1/models`` to discover the served model name.""" + + url = f"{endpoint}/v1/models" + try: + with urllib.request.urlopen(url, timeout=5) as resp: + body = json.loads(resp.read()) + models = body.get("data", []) + if models: + name = models[0].get("id", "default") + logger.info("Discovered vLLM model name: %s", name) + return name + except ( + urllib.error.HTTPError, + urllib.error.URLError, + TimeoutError, + OSError, + json.JSONDecodeError, + ): + logger.warning("Could not discover vLLM model name from %s, using 'default'", url) + return "default" diff --git a/ergon_core/ergon_core/core/providers/generation/openrouter.py b/ergon_core/ergon_core/core/providers/generation/openrouter.py new file mode 100644 index 00000000..34f8946f --- /dev/null +++ b/ergon_core/ergon_core/core/providers/generation/openrouter.py @@ -0,0 +1,55 @@ +"""OpenRouter-hosted cloud model resolution.""" + +from pydantic_ai.models.openai import OpenAIChatModel +from pydantic_ai.providers.openrouter import OpenRouterProvider + +from ergon_core.core.providers.generation.model_resolution import ResolvedModel +from ergon_core.core.settings import settings + +CLOUD_PROVIDER_PREFIXES = frozenset({"openai", "anthropic", "google"}) + + +def resolve_cloud_via_openrouter( + target: str, + *, + model_name: str | None = None, + policy_version: str | None = None, + api_key: str | None = None, +) -> ResolvedModel: + """Resolve ``openai:*``, ``anthropic:*``, and ``google:*`` through OpenRouter.""" + + provider_prefix, separator, provider_model_name = target.partition(":") + if not separator or not provider_model_name: + raise ValueError(f"Unsupported model target: {target!r}") + if provider_prefix not in CLOUD_PROVIDER_PREFIXES: + raise ValueError(f"Unsupported cloud provider target: {target!r}") + + openrouter_model_name = model_name or f"{provider_prefix}/{provider_model_name}" + provider = _openrouter_provider(api_key) + model = OpenAIChatModel(model_name=openrouter_model_name, provider=provider) + return ResolvedModel(model=model, policy_version=policy_version, supports_logprobs=False) + + +def resolve_openrouter_alias( + target: str, + *, + model_name: str | None = None, + policy_version: str | None = None, + api_key: str | None = None, +) -> ResolvedModel: + """Resolve legacy ``openrouter:/`` targets through OpenRouter.""" + + provider_model_name = target.removeprefix("openrouter:") + if not provider_model_name: + raise ValueError("openrouter:/ target requires a model name") + + provider = _openrouter_provider(api_key) + model = OpenAIChatModel(model_name=model_name or provider_model_name, provider=provider) + return ResolvedModel(model=model, policy_version=policy_version, supports_logprobs=False) + + +def _openrouter_provider(api_key: str | None) -> OpenRouterProvider: + resolved_api_key = api_key or settings.openrouter_api_key + if resolved_api_key: + return OpenRouterProvider(api_key=resolved_api_key) + return OpenRouterProvider() diff --git a/ergon_core/ergon_core/core/providers/generation/pydantic_ai_format.py b/ergon_core/ergon_core/core/providers/generation/pydantic_ai_format.py index 243dc2b7..744440b6 100644 --- a/ergon_core/ergon_core/core/providers/generation/pydantic_ai_format.py +++ b/ergon_core/ergon_core/core/providers/generation/pydantic_ai_format.py @@ -15,8 +15,8 @@ rather than re-implementing the parsing. """ -from ergon_core.api.generation import TokenLogprob from ergon_core.api.json_types import JsonObject +from ergon_core.core.providers.generation.types import TokenLogprob def extract_logprobs( diff --git a/ergon_core/ergon_core/core/providers/generation/types.py b/ergon_core/ergon_core/core/providers/generation/types.py new file mode 100644 index 00000000..cf206095 --- /dev/null +++ b/ergon_core/ergon_core/core/providers/generation/types.py @@ -0,0 +1,17 @@ +"""Shared generation provider value types.""" + +from pydantic import BaseModel, Field + +type JsonScalar = str | int | float | bool | None +type JsonValue = JsonScalar | list[JsonValue] | dict[str, JsonValue] +type JsonObject = dict[str, JsonValue] + + +class TokenLogprob(BaseModel): + """Per-token log probability from the serving backend.""" + + model_config = {"frozen": True} + + token: str + logprob: float + top_logprobs: list[JsonObject] = Field(default_factory=list) diff --git a/ergon_core/ergon_core/core/runtime/services/workflow_dto.py b/ergon_core/ergon_core/core/runtime/services/workflow_dto.py index d5b45aaa..33dd2ac3 100644 --- a/ergon_core/ergon_core/core/runtime/services/workflow_dto.py +++ b/ergon_core/ergon_core/core/runtime/services/workflow_dto.py @@ -13,6 +13,7 @@ class WorkflowTaskRef(BaseModel): level: int parent_node_id: UUID | None = None assigned_worker_slug: str | None = None + description: str | None = None class WorkflowExecutionRef(BaseModel): @@ -82,3 +83,32 @@ class WorkflowMaterializedResourceRef(BaseModel): sandbox_path: str dry_run: bool = False source_mutated: bool = False + + +class WorkflowMutationRef(BaseModel): + model_config = {"frozen": True} + + action: str + dry_run: bool + node: WorkflowTaskRef | None = None + edge: WorkflowDependencyRef | None = None + message: str + suggested_commands: list[str] = Field(default_factory=list) + + +class WorkflowResourceLocationRef(BaseModel): + model_config = {"frozen": True} + + resource: WorkflowResourceRef + producer_task_slug: str | None = None + local_file_path: str + default_sandbox_path: str + + +class WorkflowTaskWorkspaceRef(BaseModel): + model_config = {"frozen": True} + + task: WorkflowTaskRef + latest_execution: WorkflowExecutionRef | None = None + own_resources: list[WorkflowResourceRef] = Field(default_factory=list) + input_resources: list[WorkflowResourceRef] = Field(default_factory=list) diff --git a/ergon_core/ergon_core/core/runtime/services/workflow_service.py b/ergon_core/ergon_core/core/runtime/services/workflow_service.py index a9aaff6b..cbe0a6ca 100644 --- a/ergon_core/ergon_core/core/runtime/services/workflow_service.py +++ b/ergon_core/ergon_core/core/runtime/services/workflow_service.py @@ -1,23 +1,33 @@ -from collections.abc import Callable +from collections.abc import Awaitable, Callable from pathlib import PurePosixPath from typing import Literal -from uuid import UUID +from uuid import UUID, uuid4 +import inngest from ergon_core.core.persistence.graph.models import RunGraphEdge, RunGraphNode from ergon_core.core.persistence.shared.enums import TaskExecutionStatus from ergon_core.core.persistence.telemetry.models import ( + RunRecord, RunResource, RunResourceKind, RunTaskExecution, ) from ergon_core.core.providers.sandbox.manager import BaseSandboxManager, DefaultSandboxManager +from ergon_core.core.runtime.events.task_events import TaskReadyEvent +from ergon_core.core.runtime.inngest_client import inngest_client +from ergon_core.core.runtime.services.graph_dto import GraphEdgeDto, GraphNodeDto, MutationMeta +from ergon_core.core.runtime.services.graph_repository import WorkflowGraphRepository from ergon_core.core.runtime.services.workflow_dto import ( WorkflowBlockerRef, WorkflowDependencyRef, + WorkflowExecutionRef, WorkflowMaterializedResourceRef, + WorkflowMutationRef, WorkflowNextActionRef, + WorkflowResourceLocationRef, WorkflowResourceRef, WorkflowTaskRef, + WorkflowTaskWorkspaceRef, ) from sqlmodel import Session, col, select @@ -37,8 +47,12 @@ def __init__( self, *, sandbox_manager_factory: Callable[[str], BaseSandboxManager] | None = None, + graph_repository: WorkflowGraphRepository | None = None, + task_ready_dispatcher: Callable[[UUID, UUID, UUID], Awaitable[None]] | None = None, ) -> None: self._sandbox_manager_factory = sandbox_manager_factory or self._sandbox_manager_for + self._graph_repo = graph_repository or WorkflowGraphRepository() + self._task_ready_dispatcher = task_ready_dispatcher or self._dispatch_task_ready def list_tasks( self, @@ -156,6 +170,60 @@ def read_resource_bytes( with open(resource.file_path, "rb") as handle: return handle.read(max_bytes) + def get_resource_location( + self, + session: Session, + *, + run_id: UUID, + resource_id: UUID, + ) -> WorkflowResourceLocationRef: + resource = self._resource_in_run(session, run_id=run_id, resource_id=resource_id) + producer = self._producer_node_for_resource(session, resource) + copied_name = self._copy_name(resource.name) + default_path = self._sandbox_destination( + destination=None, + producer_slug=producer.task_slug if producer is not None else "unknown", + copied_name=copied_name, + ) + return WorkflowResourceLocationRef( + resource=self._resource_ref(session, resource), + producer_task_slug=producer.task_slug if producer is not None else None, + local_file_path=resource.file_path, + default_sandbox_path=default_path, + ) + + def get_task_workspace( + self, + session: Session, + *, + run_id: UUID, + node_id: UUID, + ) -> WorkflowTaskWorkspaceRef: + node = self._resolve_node(session, run_id=run_id, node_id=node_id, task_slug=None) + latest = self.get_latest_execution(session, node_id=node_id) + own_resources: list[WorkflowResourceRef] = [] + if latest is not None: + own_rows = list( + session.exec( + select(RunResource) + .where(RunResource.run_id == run_id) + .where(RunResource.task_execution_id == latest.id), + ).all(), + ) + own_rows.sort(key=lambda resource: (resource.created_at, resource.id), reverse=True) + own_resources = [self._resource_ref(session, resource) for resource in own_rows] + return WorkflowTaskWorkspaceRef( + task=self._task_ref(node), + latest_execution=self._execution_ref(latest) if latest is not None else None, + own_resources=own_resources, + input_resources=self.list_resources( + session, + run_id=run_id, + node_id=node_id, + scope="input", + ), + ) + def get_task_blockers( self, session: Session, @@ -203,6 +271,215 @@ def get_next_actions( ) ] + async def add_task( + self, + session: Session, + *, + run_id: UUID, + parent_node_id: UUID, + task_slug: str, + description: str, + assigned_worker_slug: str, + dry_run: bool, + ) -> WorkflowMutationRef: + parent = self._resolve_node( + session, + run_id=run_id, + node_id=parent_node_id, + task_slug=None, + ) + node_ref = WorkflowTaskRef( + node_id=uuid4(), + task_slug=task_slug, + status=TaskExecutionStatus.PENDING.value, + level=parent.level + 1, + parent_node_id=parent.id, + assigned_worker_slug=assigned_worker_slug, + description=description, + ) + if dry_run: + return WorkflowMutationRef( + action="add-task", + dry_run=True, + node=node_ref, + message=f"Would add task {task_slug}", + ) + + created = await self._graph_repo.add_node( + session, + run_id, + task_slug=task_slug, + instance_key=parent.instance_key, + description=description, + status=TaskExecutionStatus.PENDING.value, + assigned_worker_slug=assigned_worker_slug, + parent_node_id=parent.id, + level=parent.level + 1, + meta=self._meta("add-task"), + ) + session.commit() + definition_id = self._resolve_definition_id(session, run_id) + await self._task_ready_dispatcher(run_id, definition_id, created.id) + return WorkflowMutationRef( + action="add-task", + dry_run=False, + node=self._task_ref_from_graph(created), + message=f"Added task {task_slug}", + ) + + async def add_edge( + self, + session: Session, + *, + run_id: UUID, + source_task_slug: str, + target_task_slug: str, + dry_run: bool, + ) -> WorkflowMutationRef: + source = self._resolve_node( + session, + run_id=run_id, + node_id=None, + task_slug=source_task_slug, + ) + target = self._resolve_node( + session, + run_id=run_id, + node_id=None, + task_slug=target_task_slug, + ) + edge_ref = WorkflowDependencyRef( + edge_id=uuid4(), + edge_status="pending", + source=self._task_ref(source), + target=self._task_ref(target), + ) + if dry_run: + return WorkflowMutationRef( + action="add-edge", + dry_run=True, + edge=edge_ref, + message=f"Would add dependency {source_task_slug} -> {target_task_slug}", + ) + + created = await self._graph_repo.add_edge( + session, + run_id, + source_node_id=source.id, + target_node_id=target.id, + status="pending", + meta=self._meta("add-edge"), + ) + session.commit() + return WorkflowMutationRef( + action="add-edge", + dry_run=False, + edge=self._dependency_ref_from_graph(session, run_id, created), + message=f"Added dependency {source_task_slug} -> {target_task_slug}", + ) + + async def update_task_description( + self, + session: Session, + *, + run_id: UUID, + task_slug: str, + description: str, + dry_run: bool, + ) -> WorkflowMutationRef: + node = self._resolve_node(session, run_id=run_id, node_id=None, task_slug=task_slug) + if dry_run: + return WorkflowMutationRef( + action="update-task-description", + dry_run=True, + node=self._task_ref(node).model_copy(update={"description": description}), + message=f"Would update description for {task_slug}", + ) + + updated = await self._graph_repo.update_node_field( + session, + run_id=run_id, + node_id=node.id, + field="description", + value=description, + meta=self._meta("update-task-description"), + ) + session.commit() + return WorkflowMutationRef( + action="update-task-description", + dry_run=False, + node=self._task_ref_from_graph(updated), + message=f"Updated description for {task_slug}", + ) + + async def restart_task( + self, + session: Session, + *, + run_id: UUID, + task_slug: str, + dry_run: bool, + ) -> WorkflowMutationRef: + return await self._set_task_status( + session, + run_id=run_id, + task_slug=task_slug, + action="restart-task", + status=TaskExecutionStatus.PENDING.value, + dry_run=dry_run, + ) + + async def abandon_task( + self, + session: Session, + *, + run_id: UUID, + task_slug: str, + dry_run: bool, + ) -> WorkflowMutationRef: + return await self._set_task_status( + session, + run_id=run_id, + task_slug=task_slug, + action="abandon-task", + status=TaskExecutionStatus.CANCELLED.value, + dry_run=dry_run, + ) + + async def _set_task_status( + self, + session: Session, + *, + run_id: UUID, + task_slug: str, + action: str, + status: str, + dry_run: bool, + ) -> WorkflowMutationRef: + node = self._resolve_node(session, run_id=run_id, node_id=None, task_slug=task_slug) + if dry_run: + return WorkflowMutationRef( + action=action, + dry_run=True, + node=self._task_ref(node).model_copy(update={"status": status}), + message=f"Would set {task_slug} to {status}", + ) + await self._graph_repo.update_node_status( + session, + run_id=run_id, + node_id=node.id, + new_status=status, + meta=self._meta(action), + ) + session.commit() + refreshed = self._resolve_node(session, run_id=run_id, node_id=None, task_slug=task_slug) + return WorkflowMutationRef( + action=action, + dry_run=False, + node=self._task_ref(refreshed), + message=f"Set {task_slug} to {status}", + ) + async def materialize_resource( # slopcop: ignore[max-function-params] -- mirrors CLI scope fields self, session: Session, @@ -278,6 +555,57 @@ def _task_ref(node: RunGraphNode) -> WorkflowTaskRef: level=node.level, parent_node_id=node.parent_node_id, assigned_worker_slug=node.assigned_worker_slug, + description=node.description, + ) + + @staticmethod + def _task_ref_from_graph(node: GraphNodeDto) -> WorkflowTaskRef: + return WorkflowTaskRef( + node_id=node.id, + task_slug=node.task_slug, + status=node.status, + level=node.level, + parent_node_id=node.parent_node_id, + assigned_worker_slug=node.assigned_worker_slug, + description=node.description, + ) + + def _dependency_ref_from_graph( + self, + session: Session, + run_id: UUID, + edge: GraphEdgeDto, + ) -> WorkflowDependencyRef: + nodes = self._nodes_by_id(session, run_id) + return WorkflowDependencyRef( + edge_id=edge.id, + edge_status=edge.status, + source=self._task_ref(nodes[edge.source_node_id]), + target=self._task_ref(nodes[edge.target_node_id]), + ) + + @staticmethod + def _meta(action: str) -> MutationMeta: + return MutationMeta(actor="workflow-cli", reason=action) + + def _resolve_definition_id(self, session: Session, run_id: UUID) -> UUID: + run = session.get(RunRecord, run_id) + if run is None: + raise ValueError(f"run {run_id} not found") + return run.experiment_definition_id + + async def _dispatch_task_ready(self, run_id: UUID, definition_id: UUID, node_id: UUID) -> None: + event = TaskReadyEvent( + run_id=run_id, + definition_id=definition_id, + task_id=None, + node_id=node_id, + ) + await inngest_client.send( + inngest.Event( + name=TaskReadyEvent.name, + data=event.model_dump(mode="json"), + ) ) def _resource_ref(self, session: Session, resource: RunResource) -> WorkflowResourceRef: @@ -298,6 +626,15 @@ def _resource_ref(self, session: Session, resource: RunResource) -> WorkflowReso created_at=resource.created_at, ) + @staticmethod + def _execution_ref(execution: RunTaskExecution) -> WorkflowExecutionRef: + return WorkflowExecutionRef( + execution_id=execution.id, + status=execution.status, + attempt_number=execution.attempt_number, + final_assistant_message=execution.final_assistant_message, + ) + @staticmethod def _resource_in_run(session: Session, *, run_id: UUID, resource_id: UUID) -> RunResource: resource = session.get(RunResource, resource_id) diff --git a/ergon_core/pyproject.toml b/ergon_core/pyproject.toml index 19b4002a..107b52fd 100644 --- a/ergon_core/pyproject.toml +++ b/ergon_core/pyproject.toml @@ -15,7 +15,7 @@ dependencies = [ "uvicorn>=0.24.0", "e2b-code-interpreter", "openai", - "pydantic-ai", + "pydantic-ai>=0.8.1", "litellm", "opentelemetry-api", "opentelemetry-sdk", diff --git a/tests/real_llm/benchmarks/test_researchrubrics.py b/tests/real_llm/benchmarks/test_researchrubrics.py index f8a842b1..c535853a 100644 --- a/tests/real_llm/benchmarks/test_researchrubrics.py +++ b/tests/real_llm/benchmarks/test_researchrubrics.py @@ -37,6 +37,7 @@ RunRecord, RunResource, RunTaskEvaluation, + RunTaskExecution, ) from ergon_core.core.providers.generation.openrouter_budget import OpenRouterBudget from ergon_core.core.settings import settings @@ -52,9 +53,9 @@ pytestmark = [pytest.mark.real_llm, pytest.mark.asyncio] -# Default to Sonnet 4.6 via OpenRouter. Override with ERGON_REAL_LLM_MODEL -# to roll out against a different model without editing the test. -_DEFAULT_MODEL = "openrouter:anthropic/claude-sonnet-4.6" +# Cloud provider prefixes resolve through OpenRouter. Override with +# ERGON_REAL_LLM_MODEL to roll out against a different model. +_DEFAULT_MODEL = "anthropic:claude-sonnet-4.6" # Wall-clock caps. Real-LLM + real-sandbox rollouts are slow; keep # these generous enough to absorb E2B startup + Exa retries but bounded @@ -99,6 +100,7 @@ def _wait_for_post_terminal_artifacts(run_id: UUID) -> None: deadline = time.monotonic() + _POST_TERMINAL_ARTIFACT_TIMEOUT_SECONDS while time.monotonic() < deadline: with get_session() as session: + run = session.get(RunRecord, run_id) resources = len( list(session.exec(select(RunResource).where(RunResource.run_id == run_id)).all()) ) @@ -109,8 +111,22 @@ def _wait_for_post_terminal_artifacts(run_id: UUID) -> None: ).all() ) ) + executions = list( + session.exec(select(RunTaskExecution).where(RunTaskExecution.run_id == run_id)).all() + ) if resources > 0 and evaluations > 0: return + run_status = str(getattr(run.status, "value", run.status)).lower() if run else "" + running_executions = { + "pending", + "running", + "executing", + } + if run_status in {"failed", "cancelled"} and not any( + str(getattr(execution.status, "value", execution.status)).lower() in running_executions + for execution in executions + ): + return time.sleep(2) @@ -128,7 +144,7 @@ async def test_researchrubrics_rollout( state inside the time budget. """ model = os.environ.get("ERGON_REAL_LLM_MODEL", _DEFAULT_MODEL) - benchmark = "researchrubrics" + benchmark = os.environ.get("ERGON_REAL_LLM_BENCHMARK", "researchrubrics") worker = os.environ.get("ERGON_REAL_LLM_WORKER", "researchrubrics-researcher") evaluator = "research-rubric" limit = os.environ.get("ERGON_REAL_LLM_LIMIT", "1") diff --git a/tests/unit/cli/test_workflow_cli.py b/tests/unit/cli/test_workflow_cli.py index 4c587413..1c6cf652 100644 --- a/tests/unit/cli/test_workflow_cli.py +++ b/tests/unit/cli/test_workflow_cli.py @@ -1,11 +1,18 @@ import json -from dataclasses import dataclass from datetime import UTC, datetime from uuid import uuid4 import pytest from ergon_cli.commands.workflow import WorkflowCommandContext, execute_workflow_command -from ergon_core.core.runtime.services.workflow_dto import WorkflowResourceRef +from ergon_core.core.runtime.services.workflow_dto import ( + WorkflowExecutionRef, + WorkflowMutationRef, + WorkflowResourceLocationRef, + WorkflowResourceRef, + WorkflowTaskRef, + WorkflowTaskWorkspaceRef, +) +from pydantic import BaseModel class _Session: @@ -13,12 +20,14 @@ def close(self) -> None: pass -@dataclass -class _Service: - resource: WorkflowResourceRef +class _Service(BaseModel): + model_config = {"arbitrary_types_allowed": True} + + resource: WorkflowResourceRef | None def list_resources(self, session, *, run_id, node_id, scope, kind=None, max_depth=3, limit=50): assert isinstance(session, _Session) + assert self.resource is not None assert run_id == self.resource.run_id assert node_id == self.resource.node_id assert scope == "visible" @@ -56,7 +65,7 @@ def test_resource_list_json_uses_injected_context() -> None: benchmark_type="researchrubrics", ), session_factory=_Session, - service=_Service(resource), + service=_Service(resource=resource), ) payload = json.loads(output.stdout) @@ -66,6 +75,188 @@ def test_resource_list_json_uses_injected_context() -> None: assert payload["resources"][0]["task_slug"] == "research" +def test_manage_add_task_json_plumbs_cli_arguments_to_service() -> None: + expected_run_id = uuid4() + expected_parent_node_id = uuid4() + created_node_id = uuid4() + + class Service: + async def add_task( + self, + session, + *, + run_id, + parent_node_id, + task_slug, + description, + assigned_worker_slug, + dry_run, + ): + assert isinstance(session, _Session) + assert run_id == expected_run_id + assert parent_node_id == expected_parent_node_id + assert task_slug == "new_leaf" + assert description == "New leaf" + assert assigned_worker_slug == "researchrubrics-researcher" + assert dry_run is True + return WorkflowMutationRef( + action="add-task", + dry_run=True, + node=WorkflowTaskRef( + node_id=created_node_id, + task_slug="new_leaf", + status="pending", + level=2, + parent_node_id=expected_parent_node_id, + assigned_worker_slug="researchrubrics-researcher", + description="New leaf", + ), + message="Would add task new_leaf", + ) + + output = execute_workflow_command( + "manage add-task --task-slug new_leaf --description 'New leaf' " + "--worker researchrubrics-researcher " + f"--parent-node-id {expected_parent_node_id} --dry-run --format json", + context=WorkflowCommandContext( + run_id=expected_run_id, + node_id=expected_parent_node_id, + execution_id=uuid4(), + sandbox_task_key=uuid4(), + benchmark_type="researchrubrics", + ), + session_factory=_Session, + service=Service(), + ) + + payload = json.loads(output.stdout) + assert payload["mutation"]["action"] == "add-task" + assert payload["mutation"]["node"]["task_slug"] == "new_leaf" + assert payload["mutation"]["dry_run"] is True + + +def test_resource_location_json_uses_injected_run_scope() -> None: + run_id = uuid4() + node_id = uuid4() + resource_id = uuid4() + resource = WorkflowResourceRef( + resource_id=resource_id, + run_id=run_id, + task_execution_id=uuid4(), + node_id=node_id, + task_slug="producer", + kind="report", + name="paper.txt", + mime_type="text/plain", + size_bytes=12, + file_path="/tmp/paper.txt", + content_hash="sha256:abc", + copied_from_resource_id=None, + created_at=datetime(2026, 4, 26, tzinfo=UTC), + ) + + class Service: + def get_resource_location(self, session, *, run_id, resource_id): + assert isinstance(session, _Session) + assert run_id == resource.run_id + assert resource_id == resource.resource_id + return WorkflowResourceLocationRef( + resource=resource, + producer_task_slug="producer", + local_file_path="/tmp/paper.txt", + default_sandbox_path="/workspace/imported/producer/paper (copy).txt", + ) + + output = execute_workflow_command( + f"inspect resource-location --resource-id {resource_id} --format json", + context=WorkflowCommandContext( + run_id=run_id, + node_id=node_id, + execution_id=uuid4(), + sandbox_task_key=uuid4(), + benchmark_type="researchrubrics", + ), + session_factory=_Session, + service=Service(), + ) + + payload = json.loads(output.stdout) + assert payload["resource_location"]["producer_task_slug"] == "producer" + assert payload["resource_location"]["default_sandbox_path"].startswith("/workspace/imported") + + +def test_task_workspace_text_lists_own_and_input_resources() -> None: + run_id = uuid4() + node_id = uuid4() + execution_id = uuid4() + + class Service: + def get_task_workspace(self, session, *, run_id, node_id): + assert isinstance(session, _Session) + return WorkflowTaskWorkspaceRef( + task=WorkflowTaskRef( + node_id=node_id, + task_slug="current", + status="running", + level=1, + description="Current", + ), + latest_execution=WorkflowExecutionRef( + execution_id=execution_id, + status="running", + attempt_number=1, + final_assistant_message=None, + ), + own_resources=[ + WorkflowResourceRef( + resource_id=uuid4(), + run_id=run_id, + task_execution_id=execution_id, + node_id=node_id, + task_slug="current", + kind="report", + name="own.txt", + mime_type="text/plain", + size_bytes=3, + file_path="/tmp/own.txt", + created_at=datetime(2026, 4, 26, tzinfo=UTC), + ) + ], + input_resources=[ + WorkflowResourceRef( + resource_id=uuid4(), + run_id=run_id, + task_execution_id=uuid4(), + node_id=uuid4(), + task_slug="upstream", + kind="report", + name="input.txt", + mime_type="text/plain", + size_bytes=5, + file_path="/tmp/input.txt", + created_at=datetime(2026, 4, 26, tzinfo=UTC), + ) + ], + ) + + output = execute_workflow_command( + "inspect task-workspace", + context=WorkflowCommandContext( + run_id=run_id, + node_id=node_id, + execution_id=execution_id, + sandbox_task_key=uuid4(), + benchmark_type="researchrubrics", + ), + session_factory=_Session, + service=Service(), + ) + + assert "task current status=running" in output.stdout + assert "own: own.txt" in output.stdout + assert "input: input.txt" in output.stdout + + def test_agent_command_rejects_user_supplied_context_flags() -> None: with pytest.raises(ValueError, match="scope/context flags are injected"): execute_workflow_command( diff --git a/tests/unit/providers/test_model_resolution.py b/tests/unit/providers/test_model_resolution.py new file mode 100644 index 00000000..6b17227f --- /dev/null +++ b/tests/unit/providers/test_model_resolution.py @@ -0,0 +1,47 @@ +import pytest + +from ergon_core.core.providers.generation.model_resolution import resolve_model_target + + +def test_cloud_provider_targets_resolve_to_openrouter_provider() -> None: + from pydantic_ai.models.openai import OpenAIChatModel + from pydantic_ai.providers.openrouter import OpenRouterProvider + + resolved = resolve_model_target("openai:gpt-4o", api_key="test-openrouter-key") + + assert isinstance(resolved.model, OpenAIChatModel) + assert isinstance(resolved.model._provider, OpenRouterProvider) + assert resolved.model.model_name == "openai/gpt-4o" + assert resolved.model.system == "openrouter" + assert resolved.supports_logprobs is False + + +def test_anthropic_target_resolves_to_openrouter_namespace() -> None: + from pydantic_ai.models.openai import OpenAIChatModel + + resolved = resolve_model_target( + "anthropic:claude-sonnet-4.6", api_key="test-openrouter-key" + ) + + assert isinstance(resolved.model, OpenAIChatModel) + assert resolved.model.model_name == "anthropic/claude-sonnet-4.6" + + +def test_vllm_endpoint_target_resolves_to_openai_compatible_model() -> None: + from pydantic_ai.models.openai import OpenAIChatModel + + resolved = resolve_model_target("vllm:http://localhost:8000#served-model") + + assert isinstance(resolved.model, OpenAIChatModel) + assert resolved.model.model_name == "served-model" + assert resolved.supports_logprobs is True + + +def test_openai_compatible_target_requires_model_name() -> None: + with pytest.raises(ValueError, match="model name"): + resolve_model_target("openai-compatible:http://localhost:11434/v1") + + +def test_unknown_model_target_prefix_is_rejected() -> None: + with pytest.raises(ValueError, match="Unsupported model target"): + resolve_model_target("mystery:model") diff --git a/tests/unit/runtime/test_import_boundaries.py b/tests/unit/runtime/test_import_boundaries.py new file mode 100644 index 00000000..edf3245b --- /dev/null +++ b/tests/unit/runtime/test_import_boundaries.py @@ -0,0 +1,25 @@ +def test_telemetry_models_import_before_run_resource_api() -> None: + from ergon_core.core.persistence.telemetry.models import RunResource + + from ergon_core.api.run_resource import RunResourceView + + assert RunResource.__tablename__ == "run_resources" + assert RunResourceView.__name__ == "RunResourceView" + + +def test_context_models_import_without_worker_cycle() -> None: + from ergon_core.core.persistence.context.models import RunContextEvent + + assert RunContextEvent.__tablename__ == "run_context_events" + + +def test_context_event_payloads_use_shared_logprob_type_without_api_cycle() -> None: + from typing import get_args + + from ergon_core.core.persistence.context.event_payloads import ToolCallPayload + from ergon_core.core.providers.generation.types import TokenLogprob + + annotation_args = get_args(ToolCallPayload.model_fields["turn_logprobs"].annotation) + list_annotation = next(arg for arg in annotation_args if get_args(arg)) + + assert get_args(list_annotation) == (TokenLogprob,) diff --git a/tests/unit/runtime/test_workflow_service.py b/tests/unit/runtime/test_workflow_service.py index 69e5a73a..c5d25bed 100644 --- a/tests/unit/runtime/test_workflow_service.py +++ b/tests/unit/runtime/test_workflow_service.py @@ -31,6 +31,7 @@ def _node( *, run_id: UUID, slug: str, + description: str | None = None, status: str = "completed", parent_node_id: UUID | None = None, level: int = 0, @@ -39,7 +40,7 @@ def _node( run_id=run_id, instance_key="instance", task_slug=slug, - description=f"Task {slug}", + description=description or f"Task {slug}", status=status, assigned_worker_slug="worker", parent_node_id=parent_node_id, @@ -47,6 +48,15 @@ def _node( ) +def _edge(*, run_id: UUID, source_node_id: UUID, target_node_id: UUID) -> RunGraphEdge: + return RunGraphEdge( + run_id=run_id, + source_node_id=source_node_id, + target_node_id=target_node_id, + status="satisfied", + ) + + def _execution( *, run_id: UUID, @@ -301,3 +311,271 @@ async def test_materialize_resource_dry_run_keeps_copy_name_for_explicit_destina assert result.sandbox_path == "/workspace/selected/paper (copy).pdf" assert result.copied_resource_id is None + + +def test_resource_location_describes_producer_and_workspace_destination(tmp_path: Path) -> None: + session = _session() + run_id = _run(session) + producer = _node(run_id=run_id, slug="producer") + session.add(producer) + session.flush() + producer_exec = _execution(run_id=run_id, node_id=producer.id) + session.add(producer_exec) + session.flush() + source = _resource( + run_id=run_id, + execution_id=producer_exec.id, + name="paper.pdf", + path=tmp_path / "paper.pdf", + content=b"paper", + ) + session.add(source) + session.commit() + + location = WorkflowService().get_resource_location( + session, + run_id=run_id, + resource_id=source.id, + ) + + assert location.resource.resource_id == source.id + assert location.producer_task_slug == "producer" + assert location.default_sandbox_path == "/workspace/imported/producer/paper (copy).pdf" + assert location.local_file_path == source.file_path + + +def test_task_workspace_reports_latest_execution_and_resources(tmp_path: Path) -> None: + session = _session() + run_id = _run(session) + current = _node(run_id=run_id, slug="current", status="running") + upstream = _node(run_id=run_id, slug="upstream") + session.add_all([current, upstream]) + session.flush() + current_exec = _execution( + run_id=run_id, + node_id=current.id, + status=TaskExecutionStatus.RUNNING, + ) + upstream_exec = _execution(run_id=run_id, node_id=upstream.id) + session.add_all([current_exec, upstream_exec]) + session.flush() + session.add(_edge(run_id=run_id, source_node_id=upstream.id, target_node_id=current.id)) + session.add_all( + [ + _resource( + run_id=run_id, + execution_id=current_exec.id, + name="own.txt", + path=tmp_path / "own.txt", + content=b"own", + ), + _resource( + run_id=run_id, + execution_id=upstream_exec.id, + name="input.txt", + path=tmp_path / "input.txt", + content=b"input", + ), + ] + ) + session.commit() + + workspace = WorkflowService().get_task_workspace( + session, + run_id=run_id, + node_id=current.id, + ) + + assert workspace.task.task_slug == "current" + assert workspace.latest_execution is not None + assert workspace.latest_execution.execution_id == current_exec.id + assert [resource.name for resource in workspace.own_resources] == ["own.txt"] + assert [resource.name for resource in workspace.input_resources] == ["input.txt"] + + +@pytest.mark.asyncio +async def test_materialize_resource_rejects_parent_directory_destination( + tmp_path: Path, +) -> None: + session = _session() + run_id = _run(session) + producer = _node(run_id=run_id, slug="producer") + consumer = _node(run_id=run_id, slug="consumer") + session.add_all([producer, consumer]) + session.flush() + producer_exec = _execution(run_id=run_id, node_id=producer.id) + consumer_exec = _execution( + run_id=run_id, + node_id=consumer.id, + status=TaskExecutionStatus.RUNNING, + ) + session.add_all([producer_exec, consumer_exec]) + session.flush() + source = _resource( + run_id=run_id, + execution_id=producer_exec.id, + name="paper.pdf", + path=tmp_path / "paper.pdf", + content=b"paper", + ) + session.add(source) + session.commit() + + with pytest.raises(ValueError, match="destination must stay inside /workspace"): + await WorkflowService().materialize_resource( + session, + run_id=run_id, + current_node_id=consumer.id, + current_execution_id=consumer_exec.id, + sandbox_task_key=consumer.id, + benchmark_type="test", + resource_id=source.id, + destination="../escape/paper.pdf", + dry_run=True, + ) + + +@pytest.mark.asyncio +async def test_add_task_dry_run_does_not_write_node() -> None: + session = _session() + run_id = _run(session) + parent = _node(run_id=run_id, slug="parent", level=1) + session.add(parent) + session.commit() + + result = await WorkflowService().add_task( + session, + run_id=run_id, + parent_node_id=parent.id, + task_slug="child", + description="Child task", + assigned_worker_slug="worker", + dry_run=True, + ) + + nodes = session.exec(select(RunGraphNode).where(RunGraphNode.run_id == run_id)).all() + assert len(nodes) == 1 + assert result.action == "add-task" + assert result.dry_run is True + assert result.node is not None + assert result.node.task_slug == "child" + assert result.node.parent_node_id == parent.id + assert result.node.level == 2 + + +@pytest.mark.asyncio +async def test_add_task_writes_node_and_mutation() -> None: + session = _session() + run_id = _run(session) + parent = _node(run_id=run_id, slug="parent", level=1) + session.add(parent) + session.commit() + dispatched = [] + + async def dispatch_task_ready(run_id, definition_id, node_id): + dispatched.append((run_id, definition_id, node_id)) + + result = await WorkflowService(task_ready_dispatcher=dispatch_task_ready).add_task( + session, + run_id=run_id, + parent_node_id=parent.id, + task_slug="child", + description="Child task", + assigned_worker_slug="worker", + dry_run=False, + ) + + assert result.dry_run is False + assert result.node is not None + child = session.get(RunGraphNode, result.node.node_id) + assert child is not None + assert child.task_slug == "child" + assert child.description == "Child task" + assert child.parent_node_id == parent.id + assert child.level == 2 + assert child.status == TaskExecutionStatus.PENDING.value + run = session.get(RunRecord, run_id) + assert run is not None + assert dispatched == [(run_id, run.experiment_definition_id, child.id)] + + +@pytest.mark.asyncio +async def test_add_edge_writes_dependency_between_slugs() -> None: + session = _session() + run_id = _run(session) + source = _node(run_id=run_id, slug="source") + target = _node(run_id=run_id, slug="target") + session.add_all([source, target]) + session.commit() + + result = await WorkflowService().add_edge( + session, + run_id=run_id, + source_task_slug="source", + target_task_slug="target", + dry_run=False, + ) + + assert result.action == "add-edge" + assert result.edge is not None + edge = session.get(RunGraphEdge, result.edge.edge_id) + assert edge is not None + assert edge.source_node_id == source.id + assert edge.target_node_id == target.id + assert edge.status == "pending" + + +@pytest.mark.asyncio +async def test_update_task_description_changes_only_description() -> None: + session = _session() + run_id = _run(session) + node = _node(run_id=run_id, slug="target", description="Old") + session.add(node) + session.commit() + + result = await WorkflowService().update_task_description( + session, + run_id=run_id, + task_slug="target", + description="New description", + dry_run=False, + ) + + refreshed = session.get(RunGraphNode, node.id) + assert refreshed is not None + assert refreshed.description == "New description" + assert refreshed.task_slug == "target" + assert result.node is not None + assert result.node.description == "New description" + + +@pytest.mark.asyncio +async def test_restart_and_abandon_task_update_node_status() -> None: + session = _session() + run_id = _run(session) + failed = _node(run_id=run_id, slug="failed", status="failed") + running = _node(run_id=run_id, slug="running", status="running") + session.add_all([failed, running]) + session.commit() + + restarted = await WorkflowService().restart_task( + session, + run_id=run_id, + task_slug="failed", + dry_run=False, + ) + abandoned = await WorkflowService().abandon_task( + session, + run_id=run_id, + task_slug="running", + dry_run=False, + ) + + failed_row = session.get(RunGraphNode, failed.id) + running_row = session.get(RunGraphNode, running.id) + assert failed_row is not None + assert running_row is not None + assert failed_row.status == TaskExecutionStatus.PENDING.value + assert running_row.status == TaskExecutionStatus.CANCELLED.value + assert restarted.action == "restart-task" + assert abandoned.action == "abandon-task" diff --git a/tests/unit/state/test_research_rubrics_benchmark.py b/tests/unit/state/test_research_rubrics_benchmark.py index d56502c3..83ec4933 100644 --- a/tests/unit/state/test_research_rubrics_benchmark.py +++ b/tests/unit/state/test_research_rubrics_benchmark.py @@ -82,6 +82,40 @@ def __getitem__(self, idx): ) ] + def test_load_rows_accepts_vanilla_prompt_field(self, monkeypatch: pytest.MonkeyPatch): + class FakeTrainDataset: + def __len__(self): + return 1 + + def __getitem__(self, idx): + assert idx == 0 + return { + "sample_id": "vanilla-sample", + "domain": "planning", + "prompt": "Plan a day in Washington DC.", + "rubrics": [ + {"criterion": "Includes a timed itinerary.", "axis": "quality", "weight": 5.0}, + ], + } + + monkeypatch.setattr( + "ergon_builtins.benchmarks.researchrubrics.benchmark.load_dataset", + lambda *args, **kwargs: {"train": FakeTrainDataset()}, + ) + + rows = ResearchRubricsBenchmark(dataset_name="ScaleAI/researchrubrics")._load_rows() + + assert rows == [ + ResearchRubricsTaskPayload( + sample_id="vanilla-sample", + domain="planning", + ablated_prompt="Plan a day in Washington DC.", + rubrics=[ + {"criterion": "Includes a timed itinerary.", "axis": "quality", "weight": 5.0} + ], + ) + ] + class TestResearchRubricsRubric: """Verify task-payload-driven rubric construction.""" diff --git a/tests/unit/state/test_research_rubrics_workers.py b/tests/unit/state/test_research_rubrics_workers.py index 65e31f7d..be78d1c6 100644 --- a/tests/unit/state/test_research_rubrics_workers.py +++ b/tests/unit/state/test_research_rubrics_workers.py @@ -14,6 +14,7 @@ ResearchRubricsResearcherWorker, ) from ergon_builtins.workers.research_rubrics.workflow_cli_react_worker import ( + _WORKFLOW_PROMPT, ResearchRubricsWorkflowCliReActWorker, ) from ergon_builtins.benchmarks.researchrubrics.toolkit_types import ( @@ -157,6 +158,12 @@ async def test_workflow_cli_worker_adds_workflow_tool(self): assert worker.type_slug == "researchrubrics-workflow-cli-react" assert "workflow" in tool_names + def test_workflow_cli_prompt_uses_current_task_level_for_delegation(self): + assert "inspect task-workspace --format json" in _WORKFLOW_PROMPT + assert "task_workspace.task.level is exactly 0" in _WORKFLOW_PROMPT + assert "Ignore level-0 tasks shown elsewhere in task-tree" in _WORKFLOW_PROMPT + assert "do not call `workflow(\"manage add-task" in _WORKFLOW_PROMPT + @pytest.mark.asyncio async def test_report_write_uses_manager_public_file_api(self): task_id = uuid4() diff --git a/tests/unit/state/test_workflow_cli_tool.py b/tests/unit/state/test_workflow_cli_tool.py index a52f7a6b..85883e76 100644 --- a/tests/unit/state/test_workflow_cli_tool.py +++ b/tests/unit/state/test_workflow_cli_tool.py @@ -7,16 +7,17 @@ @pytest.mark.asyncio async def test_workflow_tool_injects_worker_context() -> None: + task_key = uuid4() context = WorkerContext( run_id=uuid4(), - task_id=uuid4(), + task_id=task_key, execution_id=uuid4(), sandbox_id="sandbox", node_id=uuid4(), ) seen = {} - def execute(command, *, context, session_factory, service): + async def execute(command, *, context, session_factory, service): seen["command"] = command seen["context"] = context @@ -29,7 +30,7 @@ class Output: workflow = make_workflow_cli_tool( worker_context=context, - sandbox_task_key=context.task_id, + sandbox_task_key=task_key, benchmark_type="researchrubrics", execute_command=execute, ) @@ -39,21 +40,22 @@ class Output: assert seen["context"].run_id == context.run_id assert seen["context"].node_id == context.node_id assert seen["context"].execution_id == context.execution_id - assert seen["context"].sandbox_task_key == context.task_id + assert seen["context"].sandbox_task_key == task_key assert seen["context"].benchmark_type == "researchrubrics" @pytest.mark.asyncio async def test_workflow_tool_reports_nonzero_exit() -> None: + task_key = uuid4() context = WorkerContext( run_id=uuid4(), - task_id=uuid4(), + task_id=task_key, execution_id=uuid4(), sandbox_id="sandbox", node_id=uuid4(), ) - def execute(command, *, context, session_factory, service): + async def execute(command, *, context, session_factory, service): class Output: stdout = "" stderr = "bad command" @@ -63,9 +65,97 @@ class Output: workflow = make_workflow_cli_tool( worker_context=context, - sandbox_task_key=context.task_id, + sandbox_task_key=task_key, benchmark_type="researchrubrics", execute_command=execute, ) assert await workflow("inspect nope") == "workflow exited 2: bad command" + + +@pytest.mark.asyncio +async def test_leaf_workflow_tool_rejects_graph_edit_commands() -> None: + task_key = uuid4() + context = WorkerContext( + run_id=uuid4(), + task_id=task_key, + execution_id=uuid4(), + sandbox_id="sandbox", + node_id=uuid4(), + ) + + async def execute(command, *, context, session_factory, service): + raise AssertionError("denied commands must not reach executor") + + workflow = make_workflow_cli_tool( + worker_context=context, + sandbox_task_key=task_key, + benchmark_type="researchrubrics", + execute_command=execute, + ) + + result = await workflow("manage add-task --task-slug child --description Child --worker worker") + + assert result.startswith("workflow denied:") + assert "manager-capable" in result + + +@pytest.mark.asyncio +async def test_manager_workflow_tool_allows_graph_edit_commands() -> None: + task_key = uuid4() + context = WorkerContext( + run_id=uuid4(), + task_id=task_key, + execution_id=uuid4(), + sandbox_id="sandbox", + node_id=uuid4(), + ) + seen = {} + + async def execute(command, *, context, session_factory, service): + seen["command"] = command + + class Output: + stdout = "ok" + stderr = "" + exit_code = 0 + + return Output() + + workflow = make_workflow_cli_tool( + worker_context=context, + sandbox_task_key=task_key, + benchmark_type="researchrubrics", + execute_command=execute, + manager_capable=True, + ) + + assert await workflow("manage restart-task --task-slug child --dry-run") == "ok" + assert seen["command"] == "manage restart-task --task-slug child --dry-run" + + +@pytest.mark.asyncio +async def test_workflow_tool_rejects_multiline_commands() -> None: + task_key = uuid4() + context = WorkerContext( + run_id=uuid4(), + task_id=task_key, + execution_id=uuid4(), + sandbox_id="sandbox", + node_id=uuid4(), + ) + + async def execute(command, *, context, session_factory, service): + raise AssertionError("multiline commands must not reach executor") + + workflow = make_workflow_cli_tool( + worker_context=context, + sandbox_task_key=task_key, + benchmark_type="researchrubrics", + execute_command=execute, + manager_capable=True, + ) + + assert await workflow("inspect task-tree\ninspect next-actions") == ( + "workflow denied: multiline commands are not allowed" + ) diff --git a/uv.lock b/uv.lock index d7bfcd87..9f248d29 100644 --- a/uv.lock +++ b/uv.lock @@ -1128,7 +1128,7 @@ requires-dist = [ { name = "outlines", marker = "extra == 'dev'" }, { name = "psycopg2-binary", specifier = ">=2.9.9" }, { name = "pydantic", specifier = ">=2.5.0" }, - { name = "pydantic-ai" }, + { name = "pydantic-ai", specifier = ">=0.8.1" }, { name = "pydantic-settings", specifier = ">=2.1.0" }, { name = "sqlmodel", specifier = ">=0.0.14" }, { name = "structlog", specifier = ">=23.2.0" }, @@ -1489,6 +1489,19 @@ http = [ { name = "aiohttp" }, ] +[[package]] +name = "genai-prices" +version = "0.0.57" +source = { registry = "https://pypi.org/simple" } +dependencies = [ + { name = "httpx" }, + { name = "pydantic" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/be/30/11f3d683cf3b1d9612475ad8bfffe3423ce9f50fc617733109033e73a038/genai_prices-0.0.57.tar.gz", hash = "sha256:6e101e9c53975557ceffa237b0995787d81fe75aac12410f2898504188bcad89", size = 66555, upload-time = "2026-04-21T13:42:52.554Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/a9/fe/d0095040c120d97cb63d055224ecd4e913dc5655315c203c8e83bf13aa86/genai_prices-0.0.57-py3-none-any.whl", hash = "sha256:14e50fb69cdc5a06ddb2a6df5a7fe06741b9e44304ce3f1728f56abdf1856cca", size = 69654, upload-time = "2026-04-21T13:42:51.236Z" }, +] + [[package]] name = "genson" version = "1.3.0" @@ -2635,14 +2648,14 @@ wheels = [ [[package]] name = "nexus-rpc" -version = "1.4.0" +version = "1.1.0" source = { registry = "https://pypi.org/simple" } dependencies = [ { name = "typing-extensions" }, ] -sdist = { url = "https://files.pythonhosted.org/packages/35/d5/cd1ffb202b76ebc1b33c1332a3416e55a39929006982adc2b1eb069aaa9b/nexus_rpc-1.4.0.tar.gz", hash = "sha256:3b8b373d4865671789cc43623e3dc0bcbf192562e40e13727e17f1c149050fba", size = 82367, upload-time = "2026-02-25T22:01:34.053Z" } +sdist = { url = "https://files.pythonhosted.org/packages/ef/66/540687556bd28cf1ec370cc6881456203dfddb9dab047b8979c6865b5984/nexus_rpc-1.1.0.tar.gz", hash = "sha256:d65ad6a2f54f14e53ebe39ee30555eaeb894102437125733fb13034a04a44553", size = 77383, upload-time = "2025-07-07T19:03:58.368Z" } wheels = [ - { url = "https://files.pythonhosted.org/packages/11/52/6327a5f4fda01207205038a106a99848a41c83e933cd23ea2cab3d2ebc6c/nexus_rpc-1.4.0-py3-none-any.whl", hash = "sha256:14c953d3519113f8ccec533a9efdb6b10c28afef75d11cdd6d422640c40b3a49", size = 29645, upload-time = "2026-02-25T22:01:33.122Z" }, + { url = "https://files.pythonhosted.org/packages/bf/2f/9e9d0dcaa4c6ffa22b7aa31069a8a264c753ff8027b36af602cce038c92f/nexus_rpc-1.1.0-py3-none-any.whl", hash = "sha256:d1b007af2aba186a27e736f8eaae39c03aed05b488084ff6c3d1785c9ba2ad38", size = 27743, upload-time = "2025-07-07T19:03:57.556Z" }, ] [[package]] @@ -3312,17 +3325,16 @@ wheels = [ [[package]] name = "protobuf" -version = "6.33.6" +version = "5.29.6" source = { registry = "https://pypi.org/simple" } -sdist = { url = "https://files.pythonhosted.org/packages/66/70/e908e9c5e52ef7c3a6c7902c9dfbb34c7e29c25d2f81ade3856445fd5c94/protobuf-6.33.6.tar.gz", hash = "sha256:a6768d25248312c297558af96a9f9c929e8c4cee0659cb07e780731095f38135", size = 444531, upload-time = "2026-03-18T19:05:00.988Z" } +sdist = { url = "https://files.pythonhosted.org/packages/7e/57/394a763c103e0edf87f0938dafcd918d53b4c011dfc5c8ae80f3b0452dbb/protobuf-5.29.6.tar.gz", hash = "sha256:da9ee6a5424b6b30fd5e45c5ea663aef540ca95f9ad99d1e887e819cdf9b8723", size = 425623, upload-time = "2026-02-04T22:54:40.584Z" } wheels = [ - { url = "https://files.pythonhosted.org/packages/fc/9f/2f509339e89cfa6f6a4c4ff50438db9ca488dec341f7e454adad60150b00/protobuf-6.33.6-cp310-abi3-win32.whl", hash = "sha256:7d29d9b65f8afef196f8334e80d6bc1d5d4adedb449971fefd3723824e6e77d3", size = 425739, upload-time = "2026-03-18T19:04:48.373Z" }, - { url = "https://files.pythonhosted.org/packages/76/5d/683efcd4798e0030c1bab27374fd13a89f7c2515fb1f3123efdfaa5eab57/protobuf-6.33.6-cp310-abi3-win_amd64.whl", hash = "sha256:0cd27b587afca21b7cfa59a74dcbd48a50f0a6400cfb59391340ad729d91d326", size = 437089, upload-time = "2026-03-18T19:04:50.381Z" }, - { url = "https://files.pythonhosted.org/packages/5c/01/a3c3ed5cd186f39e7880f8303cc51385a198a81469d53d0fdecf1f64d929/protobuf-6.33.6-cp39-abi3-macosx_10_9_universal2.whl", hash = "sha256:9720e6961b251bde64edfdab7d500725a2af5280f3f4c87e57c0208376aa8c3a", size = 427737, upload-time = "2026-03-18T19:04:51.866Z" }, - { url = "https://files.pythonhosted.org/packages/ee/90/b3c01fdec7d2f627b3a6884243ba328c1217ed2d978def5c12dc50d328a3/protobuf-6.33.6-cp39-abi3-manylinux2014_aarch64.whl", hash = "sha256:e2afbae9b8e1825e3529f88d514754e094278bb95eadc0e199751cdd9a2e82a2", size = 324610, upload-time = "2026-03-18T19:04:53.096Z" }, - { url = "https://files.pythonhosted.org/packages/9b/ca/25afc144934014700c52e05103c2421997482d561f3101ff352e1292fb81/protobuf-6.33.6-cp39-abi3-manylinux2014_s390x.whl", hash = "sha256:c96c37eec15086b79762ed265d59ab204dabc53056e3443e702d2681f4b39ce3", size = 339381, upload-time = "2026-03-18T19:04:54.616Z" }, - { url = "https://files.pythonhosted.org/packages/16/92/d1e32e3e0d894fe00b15ce28ad4944ab692713f2e7f0a99787405e43533a/protobuf-6.33.6-cp39-abi3-manylinux2014_x86_64.whl", hash = "sha256:e9db7e292e0ab79dd108d7f1a94fe31601ce1ee3f7b79e0692043423020b0593", size = 323436, upload-time = "2026-03-18T19:04:55.768Z" }, - { url = "https://files.pythonhosted.org/packages/c4/72/02445137af02769918a93807b2b7890047c32bfb9f90371cbc12688819eb/protobuf-6.33.6-py3-none-any.whl", hash = "sha256:77179e006c476e69bf8e8ce866640091ec42e1beb80b213c3900006ecfba6901", size = 170656, upload-time = "2026-03-18T19:04:59.826Z" }, + { url = "https://files.pythonhosted.org/packages/d4/88/9ee58ff7863c479d6f8346686d4636dd4c415b0cbeed7a6a7d0617639c2a/protobuf-5.29.6-cp310-abi3-win32.whl", hash = "sha256:62e8a3114992c7c647bce37dcc93647575fc52d50e48de30c6fcb28a6a291eb1", size = 423357, upload-time = "2026-02-04T22:54:25.805Z" }, + { url = "https://files.pythonhosted.org/packages/1c/66/2dc736a4d576847134fb6d80bd995c569b13cdc7b815d669050bf0ce2d2c/protobuf-5.29.6-cp310-abi3-win_amd64.whl", hash = "sha256:7e6ad413275be172f67fdee0f43484b6de5a904cc1c3ea9804cb6fe2ff366eda", size = 435175, upload-time = "2026-02-04T22:54:28.592Z" }, + { url = "https://files.pythonhosted.org/packages/06/db/49b05966fd208ae3f44dcd33837b6243b4915c57561d730a43f881f24dea/protobuf-5.29.6-cp38-abi3-macosx_10_9_universal2.whl", hash = "sha256:b5a169e664b4057183a34bdc424540e86eea47560f3c123a0d64de4e137f9269", size = 418619, upload-time = "2026-02-04T22:54:30.266Z" }, + { url = "https://files.pythonhosted.org/packages/b7/d7/48cbf6b0c3c39761e47a99cb483405f0fde2be22cf00d71ef316ce52b458/protobuf-5.29.6-cp38-abi3-manylinux2014_aarch64.whl", hash = "sha256:a8866b2cff111f0f863c1b3b9e7572dc7eaea23a7fae27f6fc613304046483e6", size = 320284, upload-time = "2026-02-04T22:54:31.782Z" }, + { url = "https://files.pythonhosted.org/packages/e3/dd/cadd6ec43069247d91f6345fa7a0d2858bef6af366dbd7ba8f05d2c77d3b/protobuf-5.29.6-cp38-abi3-manylinux2014_x86_64.whl", hash = "sha256:e3387f44798ac1106af0233c04fb8abf543772ff241169946f698b3a9a3d3ab9", size = 320478, upload-time = "2026-02-04T22:54:32.909Z" }, + { url = "https://files.pythonhosted.org/packages/5a/cb/e3065b447186cb70aa65acc70c86baf482d82bf75625bf5a2c4f6919c6a3/protobuf-5.29.6-py3-none-any.whl", hash = "sha256:6b9edb641441b2da9fa8f428760fc136a49cf97a52076010cf22a2ff73438a86", size = 173126, upload-time = "2026-02-04T22:54:39.462Z" }, ] [[package]] @@ -3605,22 +3617,23 @@ email = [ [[package]] name = "pydantic-ai" -version = "0.7.2" +version = "0.8.1" source = { registry = "https://pypi.org/simple" } dependencies = [ { name = "pydantic-ai-slim", extra = ["ag-ui", "anthropic", "bedrock", "cli", "cohere", "evals", "google", "groq", "huggingface", "mcp", "mistral", "openai", "retries", "temporal", "vertexai"] }, ] -sdist = { url = "https://files.pythonhosted.org/packages/6f/d0/ca0dbea87aa677192fa4b663532bd37ae8273e883c55b661b786dbb52731/pydantic_ai-0.7.2.tar.gz", hash = "sha256:d215c323741d47ff13c6b48aa75aedfb8b6b5f9da553af709675c3078a4be4fc", size = 43763306, upload-time = "2025-08-14T22:59:58.912Z" } +sdist = { url = "https://files.pythonhosted.org/packages/56/d7/fcc18ce80008e888404a3615f973aa3f39b98384d61b03621144c9f4c2d4/pydantic_ai-0.8.1.tar.gz", hash = "sha256:05974382082ee4f3706909d06bdfcc5e95f39e29230cc4d00e47429080099844", size = 43772581, upload-time = "2025-08-29T14:46:23.201Z" } wheels = [ - { url = "https://files.pythonhosted.org/packages/a3/77/402a278b9694cdfaeb5bf0ed4e0fee447de624aa67126ddcce8d98dc6062/pydantic_ai-0.7.2-py3-none-any.whl", hash = "sha256:a6e5d0994aa87385a05fdfdad7fda1fd14576f623635e4000883c4c7856eba13", size = 10188, upload-time = "2025-08-14T22:59:50.653Z" }, + { url = "https://files.pythonhosted.org/packages/f9/04/802b8cf834dffcda8baabb3b76c549243694a83346c3f54e47a3a4d519fb/pydantic_ai-0.8.1-py3-none-any.whl", hash = "sha256:5fa923097132aa69b4d6a310b462dc091009c7b87705edf4443d37b887d5ef9a", size = 10188, upload-time = "2025-08-29T14:46:11.137Z" }, ] [[package]] name = "pydantic-ai-slim" -version = "0.7.2" +version = "0.8.1" source = { registry = "https://pypi.org/simple" } dependencies = [ { name = "eval-type-backport" }, + { name = "genai-prices" }, { name = "griffe" }, { name = "httpx" }, { name = "opentelemetry-api" }, @@ -3628,9 +3641,9 @@ dependencies = [ { name = "pydantic-graph" }, { name = "typing-inspection" }, ] -sdist = { url = "https://files.pythonhosted.org/packages/96/39/87500c5e038296fe1becf62ac24f7e62dd5a1fb7fe63a9e29c58a2898b1a/pydantic_ai_slim-0.7.2.tar.gz", hash = "sha256:636ca32c8928048ba1173963aab6b7eb33b71174bbc371ad3f2096fee4c48dfe", size = 211787, upload-time = "2025-08-14T23:00:02.67Z" } +sdist = { url = "https://files.pythonhosted.org/packages/a2/91/08137459b3745900501b3bd11852ced6c81b7ce6e628696d75b09bb786c5/pydantic_ai_slim-0.8.1.tar.gz", hash = "sha256:12ef3dcbe5e1dad195d5e256746ef960f6e59aeddda1a55bdd553ee375ff53ae", size = 218906, upload-time = "2025-08-29T14:46:27.517Z" } wheels = [ - { url = "https://files.pythonhosted.org/packages/ea/93/fc3723a7cde4a8edb2d060fb8abeba22270ae61984796ab653fdd05baca0/pydantic_ai_slim-0.7.2-py3-none-any.whl", hash = "sha256:f5749d63bf4c2deac45371874df30d1d76a1572ce9467f6505926ecb835da583", size = 289755, upload-time = "2025-08-14T22:59:53.346Z" }, + { url = "https://files.pythonhosted.org/packages/11/ce/8dbadd04f578d02a9825a46e931005743fe223736296f30b55846c084fab/pydantic_ai_slim-0.8.1-py3-none-any.whl", hash = "sha256:fc7edc141b21fe42bc54a2d92c1127f8a75160c5e57a168dba154d3f4adb963f", size = 297821, upload-time = "2025-08-29T14:46:14.647Z" }, ] [package.optional-dependencies] @@ -3647,6 +3660,7 @@ bedrock = [ cli = [ { name = "argcomplete" }, { name = "prompt-toolkit" }, + { name = "pyperclip" }, { name = "rich" }, ] cohere = [ @@ -3739,7 +3753,7 @@ wheels = [ [[package]] name = "pydantic-evals" -version = "0.7.2" +version = "0.8.1" source = { registry = "https://pypi.org/simple" } dependencies = [ { name = "anyio" }, @@ -3749,9 +3763,9 @@ dependencies = [ { name = "pyyaml" }, { name = "rich" }, ] -sdist = { url = "https://files.pythonhosted.org/packages/32/b7/005b1b23b96abf2bce880a4c10496c00f8ebd67690f6888e576269059f54/pydantic_evals-0.7.2.tar.gz", hash = "sha256:0cf7adee67b8a12ea0b41e5162c7256ae0f6a237acb1eea161a74ed6cf61615a", size = 44086, upload-time = "2025-08-14T23:00:03.606Z" } +sdist = { url = "https://files.pythonhosted.org/packages/6c/9d/460a1f2c9f5f263e9d8e9661acbd654ccc81ad3373ea43048d914091a817/pydantic_evals-0.8.1.tar.gz", hash = "sha256:c398a623c31c19ce70e346ad75654fcb1517c3f6a821461f64fe5cbbe0813023", size = 43933, upload-time = "2025-08-29T14:46:28.903Z" } wheels = [ - { url = "https://files.pythonhosted.org/packages/7c/6f/3b844991fc1223f9c3b201f222397b0d115e236389bd90ced406ebc478ea/pydantic_evals-0.7.2-py3-none-any.whl", hash = "sha256:c7497d89659c35fbcaefbeb6f457ae09d62e36e161c4b25a462808178b7cfa92", size = 52753, upload-time = "2025-08-14T22:59:55.018Z" }, + { url = "https://files.pythonhosted.org/packages/6f/f9/1d21c4687167c4fa76fd3b1ed47f9bc2d38fd94cbacd9aa3f19e82e59830/pydantic_evals-0.8.1-py3-none-any.whl", hash = "sha256:6c76333b1d79632f619eb58a24ac656e9f402c47c75ad750ba0230d7f5514344", size = 52602, upload-time = "2025-08-29T14:46:16.602Z" }, ] [[package]] @@ -3774,7 +3788,7 @@ pycountry = [ [[package]] name = "pydantic-graph" -version = "0.7.2" +version = "0.8.1" source = { registry = "https://pypi.org/simple" } dependencies = [ { name = "httpx" }, @@ -3782,9 +3796,9 @@ dependencies = [ { name = "pydantic" }, { name = "typing-inspection" }, ] -sdist = { url = "https://files.pythonhosted.org/packages/cf/a9/8a918b4dc2cd55775d854e076823fa9b60a390e4fbec5283916346556754/pydantic_graph-0.7.2.tar.gz", hash = "sha256:f90e4ec6f02b899bf6f88cc026dafa119ea5041ab4c62ba81497717c003a946e", size = 21804, upload-time = "2025-08-14T23:00:04.834Z" } +sdist = { url = "https://files.pythonhosted.org/packages/bd/97/b35b7cb82d9f1bb6d5c6d21bba54f6196a3a5f593373f3a9c163a3821fd7/pydantic_graph-0.8.1.tar.gz", hash = "sha256:c61675a05c74f661d4ff38d04b74bd652c1e0959467801986f2f85dc7585410d", size = 21675, upload-time = "2025-08-29T14:46:29.839Z" } wheels = [ - { url = "https://files.pythonhosted.org/packages/12/d7/639c69dda9e4b4cf376c9f45e5eae96721f2dc2f2dc618fb63142876dce4/pydantic_graph-0.7.2-py3-none-any.whl", hash = "sha256:b6189500a465ce1bce4bbc65ac5871149af8e0f81a15d54540d3dfc0cc9b2502", size = 27392, upload-time = "2025-08-14T22:59:56.564Z" }, + { url = "https://files.pythonhosted.org/packages/3d/e3/5908643b049bb2384d143885725cbeb0f53707d418357d4d1ac8d2c82629/pydantic_graph-0.8.1-py3-none-any.whl", hash = "sha256:f1dd5db0fe22f4e3323c04c65e2f0013846decc312b3efc3196666764556b765", size = 27239, upload-time = "2025-08-29T14:46:18.317Z" }, ] [[package]] @@ -3871,6 +3885,15 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/29/7d/5945b5af29534641820d3bd7b00962abbbdfee84ec7e19f0d5b3175f9a31/pynacl-1.6.2-cp38-abi3-win_arm64.whl", hash = "sha256:834a43af110f743a754448463e8fd61259cd4ab5bbedcf70f9dabad1d28a394c", size = 184801, upload-time = "2026-01-01T17:32:36.309Z" }, ] +[[package]] +name = "pyperclip" +version = "1.11.0" +source = { registry = "https://pypi.org/simple" } +sdist = { url = "https://files.pythonhosted.org/packages/e8/52/d87eba7cb129b81563019d1679026e7a112ef76855d6159d24754dbd2a51/pyperclip-1.11.0.tar.gz", hash = "sha256:244035963e4428530d9e3a6101a1ef97209c6825edab1567beac148ccc1db1b6", size = 12185, upload-time = "2025-09-26T14:40:37.245Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/df/80/fc9d01d5ed37ba4c42ca2b55b4339ae6e200b456be3a1aaddf4a9fa99b8c/pyperclip-1.11.0-py3-none-any.whl", hash = "sha256:299403e9ff44581cb9ba2ffeed69c7aa96a008622ad0c46cb575ca75b5b84273", size = 11063, upload-time = "2025-09-26T14:40:36.069Z" }, +] + [[package]] name = "pytest" version = "9.0.3" @@ -4935,7 +4958,7 @@ wheels = [ [[package]] name = "temporalio" -version = "1.25.0" +version = "1.16.0" source = { registry = "https://pypi.org/simple" } dependencies = [ { name = "nexus-rpc" }, @@ -4943,13 +4966,12 @@ dependencies = [ { name = "types-protobuf" }, { name = "typing-extensions" }, ] -sdist = { url = "https://files.pythonhosted.org/packages/de/9c/3782bab0bf11a40b550147c19a5d1a476c17405391751982408902d9f138/temporalio-1.25.0.tar.gz", hash = "sha256:a3bbec1dcc904f674402cfa4faae480fda490b1c53ea5440c1f1996c562016fb", size = 2152534, upload-time = "2026-04-08T18:53:55.388Z" } +sdist = { url = "https://files.pythonhosted.org/packages/f3/32/375ab75d0ebb468cf9c8abbc450a03d3a8c66401fc320b338bd8c00d36b4/temporalio-1.16.0.tar.gz", hash = "sha256:dd926f3e30626fd4edf5e0ce596b75ecb5bbe0e4a0281e545ac91b5577967c91", size = 1733873, upload-time = "2025-08-21T22:12:50.879Z" } wheels = [ - { url = "https://files.pythonhosted.org/packages/19/e3/5676dd10d1164b6d6ca8752314054097b89c5da931e936af402a7b15236c/temporalio-1.25.0-cp310-abi3-macosx_10_12_x86_64.whl", hash = "sha256:6dc1bc8e1773b1a833d86a7ede2dd90ef4e031ced5b748b59e7f09a5bf9b327d", size = 13943906, upload-time = "2026-04-08T18:53:30.022Z" }, - { url = "https://files.pythonhosted.org/packages/89/50/7cbf7f845973be986ec165348f72f7a409750842a04d554965a39be5cb4f/temporalio-1.25.0-cp310-abi3-macosx_11_0_arm64.whl", hash = "sha256:3c8fdcf79ea5ae8ae2cf6f48072e4a86c3e0f4778f6a8a066c6ff1d336587db4", size = 13298719, upload-time = "2026-04-08T18:53:35.95Z" }, - { url = "https://files.pythonhosted.org/packages/d2/31/d474bab8535552add6ed289911bf1ffae5d7071823ece1069842190fcaed/temporalio-1.25.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:141f37aaafd7d090ba5c8776e4e9bc60df1fbc64b9f50c8f00e905a436588ddc", size = 13555435, upload-time = "2026-04-08T18:53:41.36Z" }, - { url = "https://files.pythonhosted.org/packages/2a/c8/e7dc053d6107bf2a037a3c9fe7b86639a25dcb888bde0e1ca366901ee47f/temporalio-1.25.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:ff7ca5bb80264976477d4dc7a839b3d22af8577ae92306526a061481db49bf92", size = 14052050, upload-time = "2026-04-08T18:53:46.44Z" }, - { url = "https://files.pythonhosted.org/packages/08/70/9340ed3a578321cbc153041d34834bb1ec3f1f3e3d9cded47cd1b7c3e403/temporalio-1.25.0-cp310-abi3-win_amd64.whl", hash = "sha256:9411534279a2e64847231b6059c214bff4d57cfd1532bd09f333d0b1603daa7f", size = 14299684, upload-time = "2026-04-08T18:53:52.482Z" }, + { url = "https://files.pythonhosted.org/packages/e0/36/12bb7234c83ddca4b8b032c8f1a9e07a03067c6ed6d2ddb39c770a4c87c6/temporalio-1.16.0-cp39-abi3-macosx_11_0_arm64.whl", hash = "sha256:547c0853310350d3e5b5b9c806246cbf2feb523f685b05bf14ec1b0ece8a7bb6", size = 12540769, upload-time = "2025-08-21T22:11:24.551Z" }, + { url = "https://files.pythonhosted.org/packages/3c/16/a7d402435b8f994979abfeffd3f5ffcaaeada467ac16438e61c51c9f7abe/temporalio-1.16.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:5b05bb0d06025645aed6f936615311a6774eb8dc66280f32a810aac2283e1258", size = 12968631, upload-time = "2025-08-21T22:11:48.375Z" }, + { url = "https://files.pythonhosted.org/packages/11/6f/16663eef877b61faa5fd917b3a63497416ec4319195af75f6169a1594479/temporalio-1.16.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:0a08aed4e0f6c2b6bfc779b714e91dfe8c8491a0ddb4c4370627bb07f9bddcfd", size = 13164612, upload-time = "2025-08-21T22:12:16.366Z" }, + { url = "https://files.pythonhosted.org/packages/af/0e/8c6704ca7033aa09dc084f285d70481d758972cc341adc3c84d5f82f7b01/temporalio-1.16.0-cp39-abi3-win_amd64.whl", hash = "sha256:7c190362b0d7254f1f93fb71456063e7b299ac85a89f6227758af82c6a5aa65b", size = 13177058, upload-time = "2025-08-21T22:12:44.239Z" }, ] [[package]]