Skip to content

feat: add docent ingestion#309

Open
oalmatov wants to merge 16 commits intodevfrom
oa/docent-ingestion
Open

feat: add docent ingestion#309
oalmatov wants to merge 16 commits intodevfrom
oa/docent-ingestion

Conversation

@oalmatov
Copy link
Copy Markdown
Contributor

@oalmatov oalmatov commented Apr 22, 2026

Summary

Add optional per-task ingestion of agent runs into Transluce Docent. When an agent contract sets ingest_cmd, the tracker runs that script inside the same Daytona sandbox that just ran the agent (after evaluate_instance returns), validates the produced AgentRun JSON, merges tracker-owned metadata, and uploads to a per-benchmark Docent collection. Ingestion failures are non-fatal — the task reaches FINISHED regardless.

Related PRs

  • vals-ai/model-proxy#709 — ships model-library-ingest, the default ingest CLI that model-library-based agents can use verbatim in their ingest_cmd. No per-agent Python needed; the CLI reads Agent.run's on-disk output and produces the AgentRun JSON this tracker expects.
  • Add valkyrie contract finance-agent#45 — reference agent (docent-finance-agent) wiring the contract field + model-library-ingest together. Already pushed to Valkyrie; reviewers don't need to push anything.

Changes

  • AgentContract.ingest_cmd: str | None — new optional contract YAML field
  • HarnessConfig.docent_api_key_secret_name: str | None — new tracker-level config; the CLI forwards it as X-Harness-Docent-Api-Key-Secret-Name
  • services/tracker/src/tracker/docent_ingest.py — new module: resolve_docent_job_context, run_ingest_script, merge_tracker_metadata, upload_agent_run, ingest_task (all failures caught and logged, never raised upward)
  • process_benchmark resolves the Docent API key and finds-or-creates the per-benchmark collection at job start; process_task runs ingest_task after the EvaluationResult row is committed
  • Docs: docs/CONTRACTS.md (new field described with env-var interface)

How to test locally

  1. Start the tracker locally with AWS creds in your shell so it can fetch the Docent API key from Secrets Manager:

    make tracker-service
  2. Kick off a benchmark against the already-pushed docent-finance-agent, with the Docent API key wired in:

    valkyrie benchmark run --agent docent-finance-agent --benchmark fab \
      --docent-api-key-secret-name devEvalInfraDocentKey
  3. In the worker logs (make logs or docker compose logs -f tracker-worker) note:

    • docent ingestion enabled; collection_id=<uuid> once at job start
    • docent ingest complete (run id=<uuid>) after each task
    • The job_id printed at the top of every log line in [request_id=... benchmark_id=... task_id=...]
  4. Open the collection in Docent and filter by job_id to see this run's tasks:
    https://docent.transluce.org/dashboard/ec6505b0-fa73-4d8d-9970-a2931e108142/agent_run

    Each AgentRun's metadata has job_id, task_id, benchmark, agent, eval_result, ingested_at — so you can isolate a specific run by job_id, see the question (message 0), the model's full transcript, and the grading attached.

Agents that don't set ingest_cmd are unaffected — no contract change, no runtime cost.

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Refactor
  • Docs / config

Testing

  • Added/updated unit tests — services/tracker/tests/unit/test_docent_ingest.py covers happy path, script non-zero exit, missing output file, invalid JSON, schema mismatch, and upload failure
  • Manually tested — ran the finance-agent against a 10-task slice on the dev tracker with a working ingest_cmd; confirmed docent ingestion enabled; collection_id=... at job start, docent ingest complete (run id=...) per task in worker logs, and that all 10 runs appeared in the finance Docent collection with tracker-merged metadata (job_id, task_id, benchmark, agent, eval_result, ingested_at)

Checklist

  • Self-reviewed the diff
  • No debug/dead code left in
  • Docs updated if needed

Follow-ups

Enabling Docent ingestion for an agent requires setting ingest_cmd in its contract. Model-library-based agents can use model-library-ingest directly (from vals-ai/model-proxy#709). Other agents need to ship a script that converts their output into a docent.data_models.AgentRun JSON and writes it to \$VALKYRIE_AGENT_RUN_OUT_PATH.


Open in Devin Review

@assert-app
Copy link
Copy Markdown

assert-app Bot commented Apr 22, 2026

Review on Assert →

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

⚠️ 1 issue in files not directly in the diff

⚠️ Python contract path (BaseAgentContract.to_request()) does not propagate ingest_cmd (src/valkyrie/contract.py:100-111)

The PR adds ingest_cmd to both AgentContract (YAML schema at src/valkyrie/schemas.py:162) and AgentContractRequest (at services/tracker/src/tracker/database/models.py:79), and the YAML parser correctly forwards it (src/valkyrie/cli/bundler.py:164). However, BaseAgentContract.to_request() at src/valkyrie/contract.py:100-111 was not updated to include ingest_cmd in the constructed AgentContractRequest. This means Python-based contracts (.py contract files) will always produce ingest_cmd=None, silently disabling Docent ingestion for any agent using the Python contract format. The BaseAgentContract class also lacks an ingest_cmd property, so there's no way for a Python contract subclass to even specify the value.

View 14 additional findings in Devin Review.

Open in Devin Review

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

@oalmatov oalmatov force-pushed the oa/docent-ingestion branch from 2a6c26b to ca21591 Compare April 24, 2026 23:34
@vals-ai vals-ai deleted a comment from devin-ai-integration Bot Apr 24, 2026
@oalmatov oalmatov force-pushed the oa/docent-ingestion branch from df1acac to 0fd0427 Compare April 25, 2026 00:17
@oalmatov oalmatov force-pushed the oa/docent-ingestion branch from 77bb5e2 to 13c4f14 Compare April 27, 2026 17:21
oalmatov added a commit that referenced this pull request Apr 27, 2026
For PR #309 review. Captures the problem, data flow, components,
design decisions (ingest_cmd in contract, sandbox runs script,
tracker uploads, fire-and-forget), failure handling, and open
follow-ups (post-enqueue visibility, rerun dedup, non-model-library
agents, per-job collection knob).

Marked temporary — supersede by docs/CONTRACTS.md once shipped.
oalmatov added a commit that referenced this pull request Apr 27, 2026
For PR #309 review. Captures the problem, data flow, components,
design decisions (ingest_cmd in contract, sandbox runs script,
tracker uploads, fire-and-forget), failure handling, and open
follow-ups (post-enqueue visibility, rerun dedup, non-model-library
agents, per-job collection knob).

Marked temporary — supersede by docs/CONTRACTS.md once shipped.
@oalmatov oalmatov force-pushed the oa/docent-ingestion branch from 968a4ec to 9364e0a Compare April 27, 2026 17:48
oalmatov added a commit that referenced this pull request Apr 27, 2026
For PR #309 review. Captures the problem, data flow, components,
design decisions (ingest_cmd in contract, sandbox runs script,
tracker uploads, fire-and-forget), failure handling, and open
follow-ups (post-enqueue visibility, rerun dedup, non-model-library
agents, per-job collection knob).

Marked temporary — supersede by docs/CONTRACTS.md once shipped.
@oalmatov oalmatov force-pushed the oa/docent-ingestion branch from 9364e0a to 771cc16 Compare April 27, 2026 17:52
Copy link
Copy Markdown
Collaborator

@JarettForzano JarettForzano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid pr, correctly separated functionality and documented changes, only had light style comments. Could we get this rolling for the terminus2 agent too?

Comment thread services/tracker/src/tracker/utils.py Outdated
Comment thread services/tracker/src/tracker/utils.py Outdated
Comment thread services/tracker/src/tracker/utils.py Outdated
Comment thread services/tracker/src/tracker/utils.py Outdated
Comment thread services/tracker/src/tracker/utils.py Outdated
Comment thread services/tracker/src/tracker/docent_ingest.py
Comment thread services/tracker/src/tracker/docent_ingest.py Outdated
Comment thread services/tracker/src/tracker/docent_ingest.py
Comment thread docs/CONTRACTS.md
Comment thread docs/DESIGN_DOCENT_INGESTION.md Outdated
oalmatov added a commit that referenced this pull request Apr 30, 2026
- Move IngestError from tracker.docent_ingest to tracker.exceptions to match
  the existing pattern (TrackerServiceError subclasses live there).
- Convert TaskIngestContext from @DataClass to pydantic BaseModel with a
  field_validator for agent_output_path that coerces None to ''. Lets the
  call site in process_task pass contract.final_output directly instead of
  inlining the str(...) if/else. DocentJobContext stays a stdlib @DataClass
  because its 'client' field is a live SDK object that pydantic can't
  meaningfully validate.
- Extract the per-job docent resolution out of process_benchmark into
  _resolve_docent_for_job(); call site shrinks from 23 lines to 5.
- Add log_output: Callable | None parameter to ingest_task so per-task
  ingest progress + failure messages are mirrored to the per-task
  CloudWatch stream (not just the worker logger).
- Capture docent failures to Sentry from both resolve_docent_job_context
  and ingest_task's broad-except, so dashboards surface them.
- Fix docstring: rerun flag is --task-ids (Valkyrie), not --rerun-task-ids.
- Drop 'from __future__ import annotations' — not needed; pydantic prefers
  evaluable annotations and we don't have forward references.
- docs: split the Docent ingestion section out of CONTRACTS.md into a
  dedicated docs/DOCENT.md so the integration can grow examples without
  bloating CONTRACTS. CONTRACTS.md now has a one-line pointer under a new
  'Integrations' section.
- Note in get_or_create_collection's docstring that the SDK doesn't expose
  a name-based collection lookup, so we list-and-filter by design.
devin-ai-integration[bot]

This comment was marked as resolved.

oalmatov added 10 commits April 29, 2026 21:21
Adds an optional `ingest_cmd` field (deserialized from `ingest_cmd:`
in contract.yaml) so agents can opt in to Docent ingestion. The tracker
runs this command in the sandbox after a task completes.

Plumbs the field through Bundler, the contract DB model, and bumps
contract version metadata.
Adds tracker.docent_ingest, the tracker-side half of the Docent
ingestion pipeline. Responsibilities:

- Resolve a per-job DocentClient + Docent collection (one collection
  per benchmark, get-or-create).
- Run the agent-supplied `ingest_cmd` in the sandbox under a 300s
  timeout, from the agent bundle dir, with two env vars: the agent
  output path and the path the script must write its AgentRun JSON to.
- Read the JSON back via base64 + AgentRun.model_validate; surface
  decode/JSON/schema/exit-code failures as IngestError.
- Merge tracker-owned metadata (job_id, task_id, benchmark, agent,
  eval_result, ingested_at) onto the AgentRun, preserving any other
  metadata the script set; tracker keys always win.
- Upload fire-and-forget. The DOCENT_API_KEY never enters the sandbox.
Calls into tracker.docent_ingest from the benchmark/task pipeline:
`process_benchmark` resolves the DocentJobContext once per benchmark
(skipped silently if no API key); `process_task` calls ingest_task
after the task is committed FINISHED, so any post-finish failure can
never flip the task back to ERROR.
- docent-python is a new tracker runtime dep.
- Relaxes the boto3 pin so docent-python (which transitively pins a
  newer boto3) can coexist; uses a uv override to keep root resolution
  stable.
- Makefile change makes the worker's stdout visible during `make dev`
  so ingest logs surface during local iteration.
Documents the optional `ingest_cmd` contract field, the two env vars
the script receives (`VALKYRIE_AGENT_OUTPUT_PATH`,
`VALKYRIE_AGENT_RUN_OUT_PATH`), and the tracker-owned metadata that
gets merged onto every AgentRun.
For PR #309 review. Captures the problem, data flow, components,
design decisions (ingest_cmd in contract, sandbox runs script,
tracker uploads, fire-and-forget), failure handling, and open
follow-ups (post-enqueue visibility, rerun dedup, non-model-library
agents, per-job collection knob).

Marked temporary — supersede by docs/CONTRACTS.md once shipped.
- Move IngestError from tracker.docent_ingest to tracker.exceptions to match
  the existing pattern (TrackerServiceError subclasses live there).
- Convert TaskIngestContext from @DataClass to pydantic BaseModel with a
  field_validator for agent_output_path that coerces None to ''. Lets the
  call site in process_task pass contract.final_output directly instead of
  inlining the str(...) if/else. DocentJobContext stays a stdlib @DataClass
  because its 'client' field is a live SDK object that pydantic can't
  meaningfully validate.
- Extract the per-job docent resolution out of process_benchmark into
  _resolve_docent_for_job(); call site shrinks from 23 lines to 5.
- Add log_output: Callable | None parameter to ingest_task so per-task
  ingest progress + failure messages are mirrored to the per-task
  CloudWatch stream (not just the worker logger).
- Capture docent failures to Sentry from both resolve_docent_job_context
  and ingest_task's broad-except, so dashboards surface them.
- Fix docstring: rerun flag is --task-ids (Valkyrie), not --rerun-task-ids.
- Drop 'from __future__ import annotations' — not needed; pydantic prefers
  evaluable annotations and we don't have forward references.
- docs: split the Docent ingestion section out of CONTRACTS.md into a
  dedicated docs/DOCENT.md so the integration can grow examples without
  bloating CONTRACTS. CONTRACTS.md now has a one-line pointer under a new
  'Integrations' section.
- Note in get_or_create_collection's docstring that the SDK doesn't expose
  a name-based collection lookup, so we list-and-filter by design.
- Drop the parenthetical '(one paragraph)' header.
- Trim hedge phrasing throughout ('we want / expects it to / etc').
- Drop the false claim that IngestError has subtypes (it's one class).
- Drop the stale '~190 lines' magic number.
- Drop the failure-table row pointing at a removed open question.
- Update intro to 'move to Notion before merge' (per reviewer).
The Valkyrie Docent Ingestion design notes have moved to Notion. Removing
the in-repo design doc and linking to the Notion page from docs/DOCENT.md.

Notion: https://www.notion.so/vals-ai/Valkyrie-Docent-Ingestion-3528a877dd8d80a2a728e86bf70dd265
- Fold test_injects_required_metadata into test_tracker_owned_keys_always_win.
  The two tests overlapped — both confirmed tracker-owned keys land on the
  AgentRun. Combined into one test that checks every required key plus the
  precedence rule.
- Trim the metadata block in TestIngestTask.test_happy_path to a single
  'job_id is set' check. The full merge logic is exercised by
  TestMergeTrackerMetadata; here we only need to confirm the wrapper invokes
  it before uploading.
Two new Devin findings:

1. `except Exception` doesn't catch `asyncio.CancelledError` (BaseException
   in Python 3.12). On worker shutdown the cancellation would propagate past
   ingest_task, sail through process_task's inner `except Exception`, and
   flip a FINISHED task to ERROR via the outer commit_task_error path —
   exactly what the design says must not happen.

2. The `_surface` calls invoking the `log_output` callback sat outside
   the try/except, so a failure inside `log_output` (e.g. asyncio.QueueFull)
   would similarly escape.

Fix: catch `BaseException` for the main pipeline; extract a `_safe_surface`
helper that internally swallows + logs any exception from `log_output`. The
eval result is already committed, so preserving FINISHED state under
cancellation is the canonical behaviour even if it means cancellation
doesn't propagate further from this call.
Mirror what platform does in p/docent-tracing/backend/app/docent/client.py:
right after `create_collection`, call `share_collection_with_organization`
with the Vals Docent org id at admin permission. Without this every new
benchmark collection (one per benchmark name, on first run) is owned only
by the API key's user — teammates can't see or edit it without manual
sharing in the Docent UI.

Sharing is creation-only: existing collections that we look up by name
already came from a prior run, so they're either already shared (newly
created via this code path) or were created manually before this change
landed and need a one-time backfill — re-sharing every job would be
wasted API calls.
@oalmatov oalmatov force-pushed the oa/docent-ingestion branch from 1f19e5d to e22d2ac Compare April 30, 2026 05:22
Replaces the hardcoded _VALS_DOCENT_ORG_ID constant in docent_ingest.py
with a per-tracker configuration field, plumbed through the same
X-Harness-* header path as docent_api_key_secret_name.

When `docent_share_with_org_id` is set on HarnessConfig, newly-created
collections are auto-shared with that org at admin permission. When
unset, ingestion still works but collections aren't auto-shared (must
be shared manually in Docent's UI).

- HarnessConfig gains optional docent_share_with_org_id field.
- get_or_create_collection / resolve_docent_job_context take an
  optional share_with_org_id kwarg; sharing is a no-op when None.
- fetch_harness_config (server) and TrackerService (CLI) plumb the
  field through.
- Tests updated: split create-and-share path from create-only.
devin-ai-integration[bot]

This comment was marked as resolved.

… API key's user

Drops the docent_share_with_org_id config knob added in 312be07 in favor
of `client.get_my_organizations()` — cleaner, no config wiring needed,
and naturally degrades to a no-op when the key is personal (zero orgs).

- get_or_create_collection: after create_collection, iterate the orgs
  the authenticated user belongs to and share each at admin permission.
- HarnessConfig: remove docent_share_with_org_id; only docent_api_key_secret_name
  is needed.
- CLI / fetch_harness_config: drop the field plumbing.
- Tests: split into 'shares with every org' (multi-org case) and 'no orgs
  → no share calls' (personal key case).
Comment thread docs/DOCENT.md Outdated
Comment thread docs/DOCENT.md
Comment thread docs/DOCENT.md
…AME to ConfigValue

- Remove Notion design notes link
- Replace verbose HarnessConfig paragraph with 'valk config set' example
- Link model-library-ingest docs in the model-library-based agents section
- Add DOCENT_API_KEY_SECRET_NAME to ConfigValue enum so the new command works

Co-Authored-By: omar@vals.ai <omar.almatov@berkeley.edu>
Comment thread docs/DOCENT.md
### Model-library-based agents

Use the stock `model-library-ingest` CLI, which reads `Agent.run`'s on-disk output:
Use the stock [`model-library-ingest`](https://github.com/vals-ai/model-proxy/blob/dev/docs/docent.md#post-hoc-ingestion) CLI, which reads `Agent.run`'s on-disk output:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a private repository, you should only be linking public things here

Copy link
Copy Markdown
Collaborator

@JarettForzano JarettForzano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved but resolve last comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants