Summary
Wire up the existing TrajectoryProxy (benchflow/trajectories/proxy.py) so every trial automatically captures per-trial LLM token usage (input/output tokens) from API responses. Today, acp_trajectory.jsonl records ACP session events (tool calls, messages, thoughts) but no API-level usage data.
Motivation
- Cost tracking: users need to know how many tokens each trial consumed to estimate experiment costs.
- Infrastructure already exists:
TrajectoryProxy handles both streaming SSE and regular responses, reconstructs usage for both Anthropic and OpenAI formats, and exposes total_input_tokens / total_output_tokens properties on the Trajectory model.
- Downstream consumers expect it: analysis scripts already read
agent_result.n_input_tokens / n_output_tokens from result.json — they're just never populated.
Proposed Implementation
1. Start the proxy in Trial.connect() / Trial.connect_as()
Before the agent launches, spin up a TrajectoryProxy targeting the real LLM endpoint:
from benchflow.trajectories.proxy import TrajectoryProxy
proxy = TrajectoryProxy(target=real_api_base, session_id=self._trial_name)
await proxy.start()
2. Route agent LLM traffic through the proxy
Inject the proxy URL into the agent's environment so all LLM calls go through it:
agent_env["OPENAI_BASE_URL"] = proxy.base_url # for OpenAI-compatible agents
agent_env["ANTHROPIC_BASE_URL"] = proxy.base_url # for Anthropic agents
3. Stop the proxy and collect usage in Trial.disconnect()
await proxy.stop()
traj = proxy.trajectory
self._n_input_tokens += traj.total_input_tokens
self._n_output_tokens += traj.total_output_tokens
4. Write token counts into result.json
Add n_input_tokens and n_output_tokens to the result dict in SDK._build_result().
5. (Optional) Save raw LLM trajectory
Write llm_trajectory.jsonl alongside acp_trajectory.jsonl for detailed per-exchange analysis:
trajectory/
├── acp_trajectory.jsonl # ACP session events (existing)
└── llm_trajectory.jsonl # Raw LLM API exchanges with usage (new)
Key Properties
- Zero adapter changes — all agents benefit automatically since the proxy sits between the agent and the LLM API.
- Provider-agnostic — both OpenAI and Anthropic streaming/non-streaming formats are already handled by
_reconstruct_response().
- Backward-compatible — existing
result.json gains two new optional fields; no schema break.
Acceptance Criteria
Summary
Wire up the existing
TrajectoryProxy(benchflow/trajectories/proxy.py) so every trial automatically captures per-trial LLM token usage (input/output tokens) from API responses. Today,acp_trajectory.jsonlrecords ACP session events (tool calls, messages, thoughts) but no API-levelusagedata.Motivation
TrajectoryProxyhandles both streaming SSE and regular responses, reconstructsusagefor both Anthropic and OpenAI formats, and exposestotal_input_tokens/total_output_tokensproperties on theTrajectorymodel.agent_result.n_input_tokens/n_output_tokensfromresult.json— they're just never populated.Proposed Implementation
1. Start the proxy in
Trial.connect()/Trial.connect_as()Before the agent launches, spin up a
TrajectoryProxytargeting the real LLM endpoint:2. Route agent LLM traffic through the proxy
Inject the proxy URL into the agent's environment so all LLM calls go through it:
3. Stop the proxy and collect usage in
Trial.disconnect()4. Write token counts into
result.jsonAdd
n_input_tokensandn_output_tokensto the result dict inSDK._build_result().5. (Optional) Save raw LLM trajectory
Write
llm_trajectory.jsonlalongsideacp_trajectory.jsonlfor detailed per-exchange analysis:Key Properties
_reconstruct_response().result.jsongains two new optional fields; no schema break.Acceptance Criteria
result.jsonincludesn_input_tokensandn_output_tokensfor every completed trialusagefields from actual API responsesgpt-*) and Anthropic (claude-*) modelssummary.jsonaggregates total tokens across all trials in a job