feat: capture per-trial LLM token usage via TrajectoryProxy

## Summary

Wire up the existing `TrajectoryProxy` (`benchflow/trajectories/proxy.py`) so every trial automatically captures per-trial LLM token usage (input/output tokens) from API responses. Today, `acp_trajectory.jsonl` records ACP session events (tool calls, messages, thoughts) but no API-level `usage` data.

## Motivation

- **Cost tracking**: users need to know how many tokens each trial consumed to estimate experiment costs.
- **Infrastructure already exists**: `TrajectoryProxy` handles both streaming SSE and regular responses, reconstructs `usage` for both Anthropic and OpenAI formats, and exposes `total_input_tokens` / `total_output_tokens` properties on the `Trajectory` model.
- **Downstream consumers expect it**: analysis scripts already read `agent_result.n_input_tokens` / `n_output_tokens` from `result.json` — they're just never populated.

## Proposed Implementation

### 1. Start the proxy in `Trial.connect()` / `Trial.connect_as()`

Before the agent launches, spin up a `TrajectoryProxy` targeting the real LLM endpoint:

```python
from benchflow.trajectories.proxy import TrajectoryProxy

proxy = TrajectoryProxy(target=real_api_base, session_id=self._trial_name)
await proxy.start()
```

### 2. Route agent LLM traffic through the proxy

Inject the proxy URL into the agent's environment so all LLM calls go through it:

```python
agent_env["OPENAI_BASE_URL"] = proxy.base_url     # for OpenAI-compatible agents
agent_env["ANTHROPIC_BASE_URL"] = proxy.base_url   # for Anthropic agents
```

### 3. Stop the proxy and collect usage in `Trial.disconnect()`

```python
await proxy.stop()
traj = proxy.trajectory
self._n_input_tokens += traj.total_input_tokens
self._n_output_tokens += traj.total_output_tokens
```

### 4. Write token counts into `result.json`

Add `n_input_tokens` and `n_output_tokens` to the result dict in `SDK._build_result()`.

### 5. (Optional) Save raw LLM trajectory

Write `llm_trajectory.jsonl` alongside `acp_trajectory.jsonl` for detailed per-exchange analysis:

```
trajectory/
├── acp_trajectory.jsonl      # ACP session events (existing)
└── llm_trajectory.jsonl      # Raw LLM API exchanges with usage (new)
```

## Key Properties

- **Zero adapter changes** — all agents benefit automatically since the proxy sits between the agent and the LLM API.
- **Provider-agnostic** — both OpenAI and Anthropic streaming/non-streaming formats are already handled by `_reconstruct_response()`.
- **Backward-compatible** — existing `result.json` gains two new optional fields; no schema break.

## Acceptance Criteria

- [ ] `result.json` includes `n_input_tokens` and `n_output_tokens` for every completed trial
- [ ] Token counts match the sum of `usage` fields from actual API responses
- [ ] Works for both OpenAI (`gpt-*`) and Anthropic (`claude-*`) models
- [ ] Proxy adds < 50ms overhead per API call
- [ ] `summary.json` aggregates total tokens across all trials in a job

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: capture per-trial LLM token usage via TrajectoryProxy #189

Summary

Motivation

Proposed Implementation

1. Start the proxy in `Trial.connect()` / `Trial.connect_as()`

2. Route agent LLM traffic through the proxy

3. Stop the proxy and collect usage in `Trial.disconnect()`

4. Write token counts into `result.json`

5. (Optional) Save raw LLM trajectory

Key Properties

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: capture per-trial LLM token usage via TrajectoryProxy #189

Description

Summary

Motivation

Proposed Implementation

1. Start the proxy in Trial.connect() / Trial.connect_as()

2. Route agent LLM traffic through the proxy

3. Stop the proxy and collect usage in Trial.disconnect()

4. Write token counts into result.json

5. (Optional) Save raw LLM trajectory

Key Properties

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Start the proxy in `Trial.connect()` / `Trial.connect_as()`

3. Stop the proxy and collect usage in `Trial.disconnect()`

4. Write token counts into `result.json`