feat: add docent ingestion by oalmatov · Pull Request #309 · vals-ai/Valkyrie

oalmatov · 2026-04-22T22:04:42Z

Summary

Add optional per-task ingestion of agent runs into Transluce Docent. When an agent contract sets ingest_cmd, the tracker runs that script inside the same Daytona sandbox that just ran the agent (after evaluate_instance returns), validates the produced AgentRun JSON, merges tracker-owned metadata, and uploads to a per-benchmark Docent collection. Ingestion failures are non-fatal — the task reaches FINISHED regardless.

Related PRs

vals-ai/model-proxy#709 — ships model-library-ingest, the default ingest CLI that model-library-based agents can use verbatim in their ingest_cmd. No per-agent Python needed; the CLI reads Agent.run's on-disk output and produces the AgentRun JSON this tracker expects.
Add valkyrie contract finance-agent#45 — reference agent (docent-finance-agent) wiring the contract field + model-library-ingest together. Already pushed to Valkyrie; reviewers don't need to push anything.

Changes

AgentContract.ingest_cmd: str | None — new optional contract YAML field
HarnessConfig.docent_api_key_secret_name: str | None — new tracker-level config; the CLI forwards it as X-Harness-Docent-Api-Key-Secret-Name
services/tracker/src/tracker/docent_ingest.py — new module: resolve_docent_job_context, run_ingest_script, merge_tracker_metadata, upload_agent_run, ingest_task (all failures caught and logged, never raised upward)
process_benchmark resolves the Docent API key and finds-or-creates the per-benchmark collection at job start; process_task runs ingest_task after the EvaluationResult row is committed
Docs: docs/CONTRACTS.md (new field described with env-var interface)

How to test locally

Start the tracker locally with AWS creds in your shell so it can fetch the Docent API key from Secrets Manager:
```
make tracker-service
```

Kick off a benchmark against the already-pushed docent-finance-agent, with the Docent API key wired in:

valkyrie benchmark run --agent docent-finance-agent --benchmark fab \
  --docent-api-key-secret-name devEvalInfraDocentKey

In the worker logs (make logs or docker compose logs -f tracker-worker) note:
- docent ingestion enabled; collection_id=<uuid> once at job start
- docent ingest complete (run id=<uuid>) after each task
- The job_id printed at the top of every log line in [request_id=... benchmark_id=... task_id=...]
Open the collection in Docent and filter by job_id to see this run's tasks:
https://docent.transluce.org/dashboard/ec6505b0-fa73-4d8d-9970-a2931e108142/agent_run

Each AgentRun's metadata has job_id, task_id, benchmark, agent, eval_result, ingested_at — so you can isolate a specific run by job_id, see the question (message 0), the model's full transcript, and the grading attached.

Agents that don't set ingest_cmd are unaffected — no contract change, no runtime cost.

Type of Change

Testing

Added/updated unit tests — services/tracker/tests/unit/test_docent_ingest.py covers happy path, script non-zero exit, missing output file, invalid JSON, schema mismatch, and upload failure
Manually tested — ran the finance-agent against a 10-task slice on the dev tracker with a working ingest_cmd; confirmed docent ingestion enabled; collection_id=... at job start, docent ingest complete (run id=...) per task in worker logs, and that all 10 runs appeared in the finance Docent collection with tracker-merged metadata (job_id, task_id, benchmark, agent, eval_result, ingested_at)

Checklist

Self-reviewed the diff
No debug/dead code left in
Docs updated if needed

Follow-ups

Enabling Docent ingestion for an agent requires setting ingest_cmd in its contract. Model-library-based agents can use model-library-ingest directly (from vals-ai/model-proxy#709). Other agents need to ship a script that converts their output into a docent.data_models.AgentRun JSON and writes it to \$VALKYRIE_AGENT_RUN_OUT_PATH.

assert-app · 2026-04-22T22:04:44Z

Review on Assert →

devin-ai-integration

Devin Review found 1 new potential issue.

⚠️ 1 issue in files not directly in the diff

⚠️ Python contract path (`BaseAgentContract.to_request()`) does not propagate `ingest_cmd` (`src/valkyrie/contract.py:100-111`)

The PR adds ingest_cmd to both AgentContract (YAML schema at src/valkyrie/schemas.py:162) and AgentContractRequest (at services/tracker/src/tracker/database/models.py:79), and the YAML parser correctly forwards it (src/valkyrie/cli/bundler.py:164). However, BaseAgentContract.to_request() at src/valkyrie/contract.py:100-111 was not updated to include ingest_cmd in the constructed AgentContractRequest. This means Python-based contracts (.py contract files) will always produce ingest_cmd=None, silently disabling Docent ingestion for any agent using the Python contract format. The BaseAgentContract class also lacks an ingest_cmd property, so there's no way for a Python contract subclass to even specify the value.

View 14 additional findings in Devin Review.

For PR #309 review. Captures the problem, data flow, components, design decisions (ingest_cmd in contract, sandbox runs script, tracker uploads, fire-and-forget), failure handling, and open follow-ups (post-enqueue visibility, rerun dedup, non-model-library agents, per-job collection knob). Marked temporary — supersede by docs/CONTRACTS.md once shipped.

JarettForzano

Solid pr, correctly separated functionality and documented changes, only had light style comments. Could we get this rolling for the terminus2 agent too?

- Move IngestError from tracker.docent_ingest to tracker.exceptions to match the existing pattern (TrackerServiceError subclasses live there). - Convert TaskIngestContext from @DataClass to pydantic BaseModel with a field_validator for agent_output_path that coerces None to ''. Lets the call site in process_task pass contract.final_output directly instead of inlining the str(...) if/else. DocentJobContext stays a stdlib @DataClass because its 'client' field is a live SDK object that pydantic can't meaningfully validate. - Extract the per-job docent resolution out of process_benchmark into _resolve_docent_for_job(); call site shrinks from 23 lines to 5. - Add log_output: Callable | None parameter to ingest_task so per-task ingest progress + failure messages are mirrored to the per-task CloudWatch stream (not just the worker logger). - Capture docent failures to Sentry from both resolve_docent_job_context and ingest_task's broad-except, so dashboards surface them. - Fix docstring: rerun flag is --task-ids (Valkyrie), not --rerun-task-ids. - Drop 'from __future__ import annotations' — not needed; pydantic prefers evaluable annotations and we don't have forward references. - docs: split the Docent ingestion section out of CONTRACTS.md into a dedicated docs/DOCENT.md so the integration can grow examples without bloating CONTRACTS. CONTRACTS.md now has a one-line pointer under a new 'Integrations' section. - Note in get_or_create_collection's docstring that the SDK doesn't expose a name-based collection lookup, so we list-and-filter by design.

Adds an optional `ingest_cmd` field (deserialized from `ingest_cmd:` in contract.yaml) so agents can opt in to Docent ingestion. The tracker runs this command in the sandbox after a task completes. Plumbs the field through Bundler, the contract DB model, and bumps contract version metadata.

Adds tracker.docent_ingest, the tracker-side half of the Docent ingestion pipeline. Responsibilities: - Resolve a per-job DocentClient + Docent collection (one collection per benchmark, get-or-create). - Run the agent-supplied `ingest_cmd` in the sandbox under a 300s timeout, from the agent bundle dir, with two env vars: the agent output path and the path the script must write its AgentRun JSON to. - Read the JSON back via base64 + AgentRun.model_validate; surface decode/JSON/schema/exit-code failures as IngestError. - Merge tracker-owned metadata (job_id, task_id, benchmark, agent, eval_result, ingested_at) onto the AgentRun, preserving any other metadata the script set; tracker keys always win. - Upload fire-and-forget. The DOCENT_API_KEY never enters the sandbox.

Calls into tracker.docent_ingest from the benchmark/task pipeline: `process_benchmark` resolves the DocentJobContext once per benchmark (skipped silently if no API key); `process_task` calls ingest_task after the task is committed FINISHED, so any post-finish failure can never flip the task back to ERROR.

- docent-python is a new tracker runtime dep. - Relaxes the boto3 pin so docent-python (which transitively pins a newer boto3) can coexist; uses a uv override to keep root resolution stable. - Makefile change makes the worker's stdout visible during `make dev` so ingest logs surface during local iteration.

Documents the optional `ingest_cmd` contract field, the two env vars the script receives (`VALKYRIE_AGENT_OUTPUT_PATH`, `VALKYRIE_AGENT_RUN_OUT_PATH`), and the tracker-owned metadata that gets merged onto every AgentRun.

For PR #309 review. Captures the problem, data flow, components, design decisions (ingest_cmd in contract, sandbox runs script, tracker uploads, fire-and-forget), failure handling, and open follow-ups (post-enqueue visibility, rerun dedup, non-model-library agents, per-job collection knob). Marked temporary — supersede by docs/CONTRACTS.md once shipped.

- Move IngestError from tracker.docent_ingest to tracker.exceptions to match the existing pattern (TrackerServiceError subclasses live there). - Convert TaskIngestContext from @DataClass to pydantic BaseModel with a field_validator for agent_output_path that coerces None to ''. Lets the call site in process_task pass contract.final_output directly instead of inlining the str(...) if/else. DocentJobContext stays a stdlib @DataClass because its 'client' field is a live SDK object that pydantic can't meaningfully validate. - Extract the per-job docent resolution out of process_benchmark into _resolve_docent_for_job(); call site shrinks from 23 lines to 5. - Add log_output: Callable | None parameter to ingest_task so per-task ingest progress + failure messages are mirrored to the per-task CloudWatch stream (not just the worker logger). - Capture docent failures to Sentry from both resolve_docent_job_context and ingest_task's broad-except, so dashboards surface them. - Fix docstring: rerun flag is --task-ids (Valkyrie), not --rerun-task-ids. - Drop 'from __future__ import annotations' — not needed; pydantic prefers evaluable annotations and we don't have forward references. - docs: split the Docent ingestion section out of CONTRACTS.md into a dedicated docs/DOCENT.md so the integration can grow examples without bloating CONTRACTS. CONTRACTS.md now has a one-line pointer under a new 'Integrations' section. - Note in get_or_create_collection's docstring that the SDK doesn't expose a name-based collection lookup, so we list-and-filter by design.

- Drop the parenthetical '(one paragraph)' header. - Trim hedge phrasing throughout ('we want / expects it to / etc'). - Drop the false claim that IngestError has subtypes (it's one class). - Drop the stale '~190 lines' magic number. - Drop the failure-table row pointing at a removed open question. - Update intro to 'move to Notion before merge' (per reviewer).

The Valkyrie Docent Ingestion design notes have moved to Notion. Removing the in-repo design doc and linking to the Notion page from docs/DOCENT.md. Notion: https://www.notion.so/vals-ai/Valkyrie-Docent-Ingestion-3528a877dd8d80a2a728e86bf70dd265

- Fold test_injects_required_metadata into test_tracker_owned_keys_always_win. The two tests overlapped — both confirmed tracker-owned keys land on the AgentRun. Combined into one test that checks every required key plus the precedence rule. - Trim the metadata block in TestIngestTask.test_happy_path to a single 'job_id is set' check. The full merge logic is exercised by TestMergeTrackerMetadata; here we only need to confirm the wrapper invokes it before uploading.

Two new Devin findings: 1. `except Exception` doesn't catch `asyncio.CancelledError` (BaseException in Python 3.12). On worker shutdown the cancellation would propagate past ingest_task, sail through process_task's inner `except Exception`, and flip a FINISHED task to ERROR via the outer commit_task_error path — exactly what the design says must not happen. 2. The `_surface` calls invoking the `log_output` callback sat outside the try/except, so a failure inside `log_output` (e.g. asyncio.QueueFull) would similarly escape. Fix: catch `BaseException` for the main pipeline; extract a `_safe_surface` helper that internally swallows + logs any exception from `log_output`. The eval result is already committed, so preserving FINISHED state under cancellation is the canonical behaviour even if it means cancellation doesn't propagate further from this call.

Mirror what platform does in p/docent-tracing/backend/app/docent/client.py: right after `create_collection`, call `share_collection_with_organization` with the Vals Docent org id at admin permission. Without this every new benchmark collection (one per benchmark name, on first run) is owned only by the API key's user — teammates can't see or edit it without manual sharing in the Docent UI. Sharing is creation-only: existing collections that we look up by name already came from a prior run, so they're either already shared (newly created via this code path) or were created manually before this change landed and need a one-time backfill — re-sharing every job would be wasted API calls.

Replaces the hardcoded _VALS_DOCENT_ORG_ID constant in docent_ingest.py with a per-tracker configuration field, plumbed through the same X-Harness-* header path as docent_api_key_secret_name. When `docent_share_with_org_id` is set on HarnessConfig, newly-created collections are auto-shared with that org at admin permission. When unset, ingestion still works but collections aren't auto-shared (must be shared manually in Docent's UI). - HarnessConfig gains optional docent_share_with_org_id field. - get_or_create_collection / resolve_docent_job_context take an optional share_with_org_id kwarg; sharing is a no-op when None. - fetch_harness_config (server) and TrackerService (CLI) plumb the field through. - Tests updated: split create-and-share path from create-only.

… API key's user Drops the docent_share_with_org_id config knob added in 312be07 in favor of `client.get_my_organizations()` — cleaner, no config wiring needed, and naturally degrades to a no-op when the key is personal (zero orgs). - get_or_create_collection: after create_collection, iterate the orgs the authenticated user belongs to and share each at admin permission. - HarnessConfig: remove docent_share_with_org_id; only docent_api_key_secret_name is needed. - CLI / fetch_harness_config: drop the field plumbing. - Tests: split into 'shares with every org' (multi-org case) and 'no orgs → no share calls' (personal key case).

…AME to ConfigValue - Remove Notion design notes link - Replace verbose HarnessConfig paragraph with 'valk config set' example - Link model-library-ingest docs in the model-library-based agents section - Add DOCENT_API_KEY_SECRET_NAME to ConfigValue enum so the new command works Co-Authored-By: omar@vals.ai <omar.almatov@berkeley.edu>

JarettForzano · 2026-05-05T02:44:22Z

 ### Model-library-based agents

-Use the stock `model-library-ingest` CLI, which reads `Agent.run`'s on-disk output:
+Use the stock [`model-library-ingest`](https://github.com/vals-ai/model-proxy/blob/dev/docs/docent.md#post-hoc-ingestion) CLI, which reads `Agent.run`'s on-disk output:


This is a private repository, you should only be linking public things here

JarettForzano

Approved but resolve last comment

This comment was marked as resolved.

Sign in to view

devin-ai-integration Bot reviewed Apr 24, 2026

View reviewed changes

This comment was marked as resolved.

Sign in to view

oalmatov force-pushed the oa/docent-ingestion branch from 2a6c26b to ca21591 Compare April 24, 2026 23:34

vals-ai deleted a comment from devin-ai-integration Bot Apr 24, 2026

oalmatov force-pushed the oa/docent-ingestion branch from df1acac to 0fd0427 Compare April 25, 2026 00:17

oalmatov mentioned this pull request Apr 27, 2026

Add valkyrie contract vals-ai/finance-agent#45

Open

6 tasks

oalmatov force-pushed the oa/docent-ingestion branch from 77bb5e2 to 13c4f14 Compare April 27, 2026 17:21

oalmatov force-pushed the oa/docent-ingestion branch from 968a4ec to 9364e0a Compare April 27, 2026 17:48

oalmatov force-pushed the oa/docent-ingestion branch from 9364e0a to 771cc16 Compare April 27, 2026 17:52

JarettForzano requested changes Apr 30, 2026

View reviewed changes

This comment was marked as resolved.

Sign in to view

oalmatov added 10 commits April 29, 2026 21:21

docs(CONTRACTS): add Docent ingestion section

0eca5a5

Documents the optional `ingest_cmd` contract field, the two env vars the script receives (`VALKYRIE_AGENT_OUTPUT_PATH`, `VALKYRIE_AGENT_RUN_OUT_PATH`), and the tracker-owned metadata that gets merged onto every AgentRun.

chore: run lint

59265ba

oalmatov added 3 commits April 29, 2026 21:22

oalmatov force-pushed the oa/docent-ingestion branch from 1f19e5d to e22d2ac Compare April 30, 2026 05:22

Conversation

oalmatov commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related PRs

Changes

How to test locally

Type of Change

Testing

Checklist

Follow-ups

Uh oh!

assert-app Bot commented Apr 22, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

⚠️ Python contract path (BaseAgentContract.to_request()) does not propagate ingest_cmd (src/valkyrie/contract.py:100-111)

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

JarettForzano left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JarettForzano May 5, 2026

Choose a reason for hiding this comment

Uh oh!

JarettForzano left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

oalmatov commented Apr 22, 2026 •

edited

Loading

⚠️ Python contract path (`BaseAgentContract.to_request()`) does not propagate `ingest_cmd` (`src/valkyrie/contract.py:100-111`)