Skip to content

feat: capture per-trial LLM token usage via TrajectoryProxy #189

@EYH0602

Description

@EYH0602

Summary

Wire up the existing TrajectoryProxy (benchflow/trajectories/proxy.py) so every trial automatically captures per-trial LLM token usage (input/output tokens) from API responses. Today, acp_trajectory.jsonl records ACP session events (tool calls, messages, thoughts) but no API-level usage data.

Motivation

  • Cost tracking: users need to know how many tokens each trial consumed to estimate experiment costs.
  • Infrastructure already exists: TrajectoryProxy handles both streaming SSE and regular responses, reconstructs usage for both Anthropic and OpenAI formats, and exposes total_input_tokens / total_output_tokens properties on the Trajectory model.
  • Downstream consumers expect it: analysis scripts already read agent_result.n_input_tokens / n_output_tokens from result.json — they're just never populated.

Proposed Implementation

1. Start the proxy in Trial.connect() / Trial.connect_as()

Before the agent launches, spin up a TrajectoryProxy targeting the real LLM endpoint:

from benchflow.trajectories.proxy import TrajectoryProxy

proxy = TrajectoryProxy(target=real_api_base, session_id=self._trial_name)
await proxy.start()

2. Route agent LLM traffic through the proxy

Inject the proxy URL into the agent's environment so all LLM calls go through it:

agent_env["OPENAI_BASE_URL"] = proxy.base_url     # for OpenAI-compatible agents
agent_env["ANTHROPIC_BASE_URL"] = proxy.base_url   # for Anthropic agents

3. Stop the proxy and collect usage in Trial.disconnect()

await proxy.stop()
traj = proxy.trajectory
self._n_input_tokens += traj.total_input_tokens
self._n_output_tokens += traj.total_output_tokens

4. Write token counts into result.json

Add n_input_tokens and n_output_tokens to the result dict in SDK._build_result().

5. (Optional) Save raw LLM trajectory

Write llm_trajectory.jsonl alongside acp_trajectory.jsonl for detailed per-exchange analysis:

trajectory/
├── acp_trajectory.jsonl      # ACP session events (existing)
└── llm_trajectory.jsonl      # Raw LLM API exchanges with usage (new)

Key Properties

  • Zero adapter changes — all agents benefit automatically since the proxy sits between the agent and the LLM API.
  • Provider-agnostic — both OpenAI and Anthropic streaming/non-streaming formats are already handled by _reconstruct_response().
  • Backward-compatible — existing result.json gains two new optional fields; no schema break.

Acceptance Criteria

  • result.json includes n_input_tokens and n_output_tokens for every completed trial
  • Token counts match the sum of usage fields from actual API responses
  • Works for both OpenAI (gpt-*) and Anthropic (claude-*) models
  • Proxy adds < 50ms overhead per API call
  • summary.json aggregates total tokens across all trials in a job

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions