Conversation
There was a problem hiding this comment.
Devin Review found 1 new potential issue.
⚠️ 1 issue in files not directly in the diff
⚠️ Python contract path (BaseAgentContract.to_request()) does not propagate ingest_cmd (src/valkyrie/contract.py:100-111)
The PR adds ingest_cmd to both AgentContract (YAML schema at src/valkyrie/schemas.py:162) and AgentContractRequest (at services/tracker/src/tracker/database/models.py:79), and the YAML parser correctly forwards it (src/valkyrie/cli/bundler.py:164). However, BaseAgentContract.to_request() at src/valkyrie/contract.py:100-111 was not updated to include ingest_cmd in the constructed AgentContractRequest. This means Python-based contracts (.py contract files) will always produce ingest_cmd=None, silently disabling Docent ingestion for any agent using the Python contract format. The BaseAgentContract class also lacks an ingest_cmd property, so there's no way for a Python contract subclass to even specify the value.
View 14 additional findings in Devin Review.
2a6c26b to
ca21591
Compare
df1acac to
0fd0427
Compare
77bb5e2 to
13c4f14
Compare
For PR #309 review. Captures the problem, data flow, components, design decisions (ingest_cmd in contract, sandbox runs script, tracker uploads, fire-and-forget), failure handling, and open follow-ups (post-enqueue visibility, rerun dedup, non-model-library agents, per-job collection knob). Marked temporary — supersede by docs/CONTRACTS.md once shipped.
For PR #309 review. Captures the problem, data flow, components, design decisions (ingest_cmd in contract, sandbox runs script, tracker uploads, fire-and-forget), failure handling, and open follow-ups (post-enqueue visibility, rerun dedup, non-model-library agents, per-job collection knob). Marked temporary — supersede by docs/CONTRACTS.md once shipped.
968a4ec to
9364e0a
Compare
For PR #309 review. Captures the problem, data flow, components, design decisions (ingest_cmd in contract, sandbox runs script, tracker uploads, fire-and-forget), failure handling, and open follow-ups (post-enqueue visibility, rerun dedup, non-model-library agents, per-job collection knob). Marked temporary — supersede by docs/CONTRACTS.md once shipped.
9364e0a to
771cc16
Compare
JarettForzano
left a comment
There was a problem hiding this comment.
Solid pr, correctly separated functionality and documented changes, only had light style comments. Could we get this rolling for the terminus2 agent too?
- Move IngestError from tracker.docent_ingest to tracker.exceptions to match the existing pattern (TrackerServiceError subclasses live there). - Convert TaskIngestContext from @DataClass to pydantic BaseModel with a field_validator for agent_output_path that coerces None to ''. Lets the call site in process_task pass contract.final_output directly instead of inlining the str(...) if/else. DocentJobContext stays a stdlib @DataClass because its 'client' field is a live SDK object that pydantic can't meaningfully validate. - Extract the per-job docent resolution out of process_benchmark into _resolve_docent_for_job(); call site shrinks from 23 lines to 5. - Add log_output: Callable | None parameter to ingest_task so per-task ingest progress + failure messages are mirrored to the per-task CloudWatch stream (not just the worker logger). - Capture docent failures to Sentry from both resolve_docent_job_context and ingest_task's broad-except, so dashboards surface them. - Fix docstring: rerun flag is --task-ids (Valkyrie), not --rerun-task-ids. - Drop 'from __future__ import annotations' — not needed; pydantic prefers evaluable annotations and we don't have forward references. - docs: split the Docent ingestion section out of CONTRACTS.md into a dedicated docs/DOCENT.md so the integration can grow examples without bloating CONTRACTS. CONTRACTS.md now has a one-line pointer under a new 'Integrations' section. - Note in get_or_create_collection's docstring that the SDK doesn't expose a name-based collection lookup, so we list-and-filter by design.
Adds an optional `ingest_cmd` field (deserialized from `ingest_cmd:` in contract.yaml) so agents can opt in to Docent ingestion. The tracker runs this command in the sandbox after a task completes. Plumbs the field through Bundler, the contract DB model, and bumps contract version metadata.
Adds tracker.docent_ingest, the tracker-side half of the Docent ingestion pipeline. Responsibilities: - Resolve a per-job DocentClient + Docent collection (one collection per benchmark, get-or-create). - Run the agent-supplied `ingest_cmd` in the sandbox under a 300s timeout, from the agent bundle dir, with two env vars: the agent output path and the path the script must write its AgentRun JSON to. - Read the JSON back via base64 + AgentRun.model_validate; surface decode/JSON/schema/exit-code failures as IngestError. - Merge tracker-owned metadata (job_id, task_id, benchmark, agent, eval_result, ingested_at) onto the AgentRun, preserving any other metadata the script set; tracker keys always win. - Upload fire-and-forget. The DOCENT_API_KEY never enters the sandbox.
Calls into tracker.docent_ingest from the benchmark/task pipeline: `process_benchmark` resolves the DocentJobContext once per benchmark (skipped silently if no API key); `process_task` calls ingest_task after the task is committed FINISHED, so any post-finish failure can never flip the task back to ERROR.
- docent-python is a new tracker runtime dep. - Relaxes the boto3 pin so docent-python (which transitively pins a newer boto3) can coexist; uses a uv override to keep root resolution stable. - Makefile change makes the worker's stdout visible during `make dev` so ingest logs surface during local iteration.
Documents the optional `ingest_cmd` contract field, the two env vars the script receives (`VALKYRIE_AGENT_OUTPUT_PATH`, `VALKYRIE_AGENT_RUN_OUT_PATH`), and the tracker-owned metadata that gets merged onto every AgentRun.
For PR #309 review. Captures the problem, data flow, components, design decisions (ingest_cmd in contract, sandbox runs script, tracker uploads, fire-and-forget), failure handling, and open follow-ups (post-enqueue visibility, rerun dedup, non-model-library agents, per-job collection knob). Marked temporary — supersede by docs/CONTRACTS.md once shipped.
- Move IngestError from tracker.docent_ingest to tracker.exceptions to match the existing pattern (TrackerServiceError subclasses live there). - Convert TaskIngestContext from @DataClass to pydantic BaseModel with a field_validator for agent_output_path that coerces None to ''. Lets the call site in process_task pass contract.final_output directly instead of inlining the str(...) if/else. DocentJobContext stays a stdlib @DataClass because its 'client' field is a live SDK object that pydantic can't meaningfully validate. - Extract the per-job docent resolution out of process_benchmark into _resolve_docent_for_job(); call site shrinks from 23 lines to 5. - Add log_output: Callable | None parameter to ingest_task so per-task ingest progress + failure messages are mirrored to the per-task CloudWatch stream (not just the worker logger). - Capture docent failures to Sentry from both resolve_docent_job_context and ingest_task's broad-except, so dashboards surface them. - Fix docstring: rerun flag is --task-ids (Valkyrie), not --rerun-task-ids. - Drop 'from __future__ import annotations' — not needed; pydantic prefers evaluable annotations and we don't have forward references. - docs: split the Docent ingestion section out of CONTRACTS.md into a dedicated docs/DOCENT.md so the integration can grow examples without bloating CONTRACTS. CONTRACTS.md now has a one-line pointer under a new 'Integrations' section. - Note in get_or_create_collection's docstring that the SDK doesn't expose a name-based collection lookup, so we list-and-filter by design.
- Drop the parenthetical '(one paragraph)' header.
- Trim hedge phrasing throughout ('we want / expects it to / etc').
- Drop the false claim that IngestError has subtypes (it's one class).
- Drop the stale '~190 lines' magic number.
- Drop the failure-table row pointing at a removed open question.
- Update intro to 'move to Notion before merge' (per reviewer).
The Valkyrie Docent Ingestion design notes have moved to Notion. Removing the in-repo design doc and linking to the Notion page from docs/DOCENT.md. Notion: https://www.notion.so/vals-ai/Valkyrie-Docent-Ingestion-3528a877dd8d80a2a728e86bf70dd265
- Fold test_injects_required_metadata into test_tracker_owned_keys_always_win. The two tests overlapped — both confirmed tracker-owned keys land on the AgentRun. Combined into one test that checks every required key plus the precedence rule. - Trim the metadata block in TestIngestTask.test_happy_path to a single 'job_id is set' check. The full merge logic is exercised by TestMergeTrackerMetadata; here we only need to confirm the wrapper invokes it before uploading.
Two new Devin findings: 1. `except Exception` doesn't catch `asyncio.CancelledError` (BaseException in Python 3.12). On worker shutdown the cancellation would propagate past ingest_task, sail through process_task's inner `except Exception`, and flip a FINISHED task to ERROR via the outer commit_task_error path — exactly what the design says must not happen. 2. The `_surface` calls invoking the `log_output` callback sat outside the try/except, so a failure inside `log_output` (e.g. asyncio.QueueFull) would similarly escape. Fix: catch `BaseException` for the main pipeline; extract a `_safe_surface` helper that internally swallows + logs any exception from `log_output`. The eval result is already committed, so preserving FINISHED state under cancellation is the canonical behaviour even if it means cancellation doesn't propagate further from this call.
Mirror what platform does in p/docent-tracing/backend/app/docent/client.py: right after `create_collection`, call `share_collection_with_organization` with the Vals Docent org id at admin permission. Without this every new benchmark collection (one per benchmark name, on first run) is owned only by the API key's user — teammates can't see or edit it without manual sharing in the Docent UI. Sharing is creation-only: existing collections that we look up by name already came from a prior run, so they're either already shared (newly created via this code path) or were created manually before this change landed and need a one-time backfill — re-sharing every job would be wasted API calls.
1f19e5d to
e22d2ac
Compare
Replaces the hardcoded _VALS_DOCENT_ORG_ID constant in docent_ingest.py with a per-tracker configuration field, plumbed through the same X-Harness-* header path as docent_api_key_secret_name. When `docent_share_with_org_id` is set on HarnessConfig, newly-created collections are auto-shared with that org at admin permission. When unset, ingestion still works but collections aren't auto-shared (must be shared manually in Docent's UI). - HarnessConfig gains optional docent_share_with_org_id field. - get_or_create_collection / resolve_docent_job_context take an optional share_with_org_id kwarg; sharing is a no-op when None. - fetch_harness_config (server) and TrackerService (CLI) plumb the field through. - Tests updated: split create-and-share path from create-only.
… API key's user Drops the docent_share_with_org_id config knob added in 312be07 in favor of `client.get_my_organizations()` — cleaner, no config wiring needed, and naturally degrades to a no-op when the key is personal (zero orgs). - get_or_create_collection: after create_collection, iterate the orgs the authenticated user belongs to and share each at admin permission. - HarnessConfig: remove docent_share_with_org_id; only docent_api_key_secret_name is needed. - CLI / fetch_harness_config: drop the field plumbing. - Tests: split into 'shares with every org' (multi-org case) and 'no orgs → no share calls' (personal key case).
…AME to ConfigValue - Remove Notion design notes link - Replace verbose HarnessConfig paragraph with 'valk config set' example - Link model-library-ingest docs in the model-library-based agents section - Add DOCENT_API_KEY_SECRET_NAME to ConfigValue enum so the new command works Co-Authored-By: omar@vals.ai <omar.almatov@berkeley.edu>
| ### Model-library-based agents | ||
|
|
||
| Use the stock `model-library-ingest` CLI, which reads `Agent.run`'s on-disk output: | ||
| Use the stock [`model-library-ingest`](https://github.com/vals-ai/model-proxy/blob/dev/docs/docent.md#post-hoc-ingestion) CLI, which reads `Agent.run`'s on-disk output: |
There was a problem hiding this comment.
This is a private repository, you should only be linking public things here
JarettForzano
left a comment
There was a problem hiding this comment.
Approved but resolve last comment
Summary
Add optional per-task ingestion of agent runs into Transluce Docent. When an agent contract sets
ingest_cmd, the tracker runs that script inside the same Daytona sandbox that just ran the agent (afterevaluate_instancereturns), validates the producedAgentRunJSON, merges tracker-owned metadata, and uploads to a per-benchmark Docent collection. Ingestion failures are non-fatal — the task reachesFINISHEDregardless.Related PRs
model-library-ingest, the default ingest CLI that model-library-based agents can use verbatim in theiringest_cmd. No per-agent Python needed; the CLI readsAgent.run's on-disk output and produces the AgentRun JSON this tracker expects.docent-finance-agent) wiring the contract field +model-library-ingesttogether. Already pushed to Valkyrie; reviewers don't need to push anything.Changes
AgentContract.ingest_cmd: str | None— new optional contract YAML fieldHarnessConfig.docent_api_key_secret_name: str | None— new tracker-level config; the CLI forwards it asX-Harness-Docent-Api-Key-Secret-Nameservices/tracker/src/tracker/docent_ingest.py— new module:resolve_docent_job_context,run_ingest_script,merge_tracker_metadata,upload_agent_run,ingest_task(all failures caught and logged, never raised upward)process_benchmarkresolves the Docent API key and finds-or-creates the per-benchmark collection at job start;process_taskrunsingest_taskafter theEvaluationResultrow is committeddocs/CONTRACTS.md(new field described with env-var interface)How to test locally
Start the tracker locally with AWS creds in your shell so it can fetch the Docent API key from Secrets Manager:
Kick off a benchmark against the already-pushed
docent-finance-agent, with the Docent API key wired in:In the worker logs (
make logsordocker compose logs -f tracker-worker) note:docent ingestion enabled; collection_id=<uuid>once at job startdocent ingest complete (run id=<uuid>)after each task[request_id=... benchmark_id=... task_id=...]Open the collection in Docent and filter by
job_idto see this run's tasks:https://docent.transluce.org/dashboard/ec6505b0-fa73-4d8d-9970-a2931e108142/agent_run
Each AgentRun's metadata has
job_id,task_id,benchmark,agent,eval_result,ingested_at— so you can isolate a specific run by job_id, see the question (message 0), the model's full transcript, and the grading attached.Agents that don't set
ingest_cmdare unaffected — no contract change, no runtime cost.Type of Change
Testing
services/tracker/tests/unit/test_docent_ingest.pycovers happy path, script non-zero exit, missing output file, invalid JSON, schema mismatch, and upload failureingest_cmd; confirmeddocent ingestion enabled; collection_id=...at job start,docent ingest complete (run id=...)per task in worker logs, and that all 10 runs appeared in thefinanceDocent collection with tracker-merged metadata (job_id,task_id,benchmark,agent,eval_result,ingested_at)Checklist
Follow-ups
Enabling Docent ingestion for an agent requires setting
ingest_cmdin its contract. Model-library-based agents can usemodel-library-ingestdirectly (from vals-ai/model-proxy#709). Other agents need to ship a script that converts their output into adocent.data_models.AgentRunJSON and writes it to\$VALKYRIE_AGENT_RUN_OUT_PATH.