OpenHands integration: ACP CLI path does not expose stable token/cost metrics; need SDK-runner or equivalent metrics bridge

## Summary

Benchflow's current `openhands` integration uses the OpenHands **CLI ACP** path (`openhands acp ...`). This works for install / launch / ACP execution, but it does **not** reliably expose the token / cost metrics that the older Harbor + OpenHands SDK runner path provided.

This becomes a problem if Benchflow is expected to replace Harbor in the OpenHands evaluation harness while preserving metrics such as:

- `agent_result.n_input_tokens`
- `agent_result.n_cache_tokens`
- `agent_result.n_output_tokens`
- `agent_result.cost_usd`

## Current behavior

Benchflow can run OpenHands through ACP successfully, but the ACP path does not provide a stable source of cumulative LLM metrics.

What we can get reliably from the current ACP session:

- agent name / version from `initialize`
- tool calls
- message / thought / tool trajectory
- stop reason
- timing

What we **cannot** rely on the ACP path to provide today:

- cumulative prompt tokens
- cumulative completion tokens
- cumulative cache tokens
- cumulative cost in USD

Benchflow now has result schema slots for these metrics, but under the current OpenHands CLI ACP integration they will usually remain `null` unless some future trajectory source includes usage/cost.

## Why this matters

The older Harbor/OpenHands SDK-runner path could collect these values directly inside the agent process via:

- `llm.metrics.accumulated_token_usage`
- `llm.metrics.accumulated_cost`

That means Harbor output could populate richer trial results, while the current Benchflow ACP integration cannot provide equivalent metrics with the same reliability.

## Expected behavior

Benchflow should have a supported way to run OpenHands that can emit stable metrics comparable to the old Harbor/OpenHands SDK-runner results.

At minimum, OpenHands runs in Benchflow should be able to populate:

```json
"agent_result": {
  "n_input_tokens": ...,
  "n_cache_tokens": ...,
  "n_output_tokens": ...,
  "cost_usd": ...
}
```

Suggested direction
Add an OpenHands SDK runner execution path in Benchflow, and collect metrics directly from the SDK runtime.
Or add an equivalent telemetry/metrics bridge for the CLI ACP path, if OpenHands CLI can emit cumulative usage/cost in a supported machine-readable way.
From an accuracy and maintenance perspective, the SDK-runner path appears preferable, since it reads metrics from the same execution layer that is actually calling the LLM.

Notes
This issue is specifically about metrics parity, not basic OpenHands ACP functionality.
OpenHands CLI ACP works for task execution, but not for reliable token/cost accounting.
If Benchflow is meant to replace Harbor inside the OpenHands evaluation harness, this gap blocks lossless migration of result metadata.

Related local findings
Current Benchflow openhands agent launches via openhands acp --always-approve --override-with-envs
ACP initialize provides agentInfo but not cumulative token/cost usage
Older Harbor/OpenHands SDK runner gathered metrics from llm.metrics.accumulated_token_usage and llm.metrics.accumulated_cost
Benchflow result schema has been extended locally to include agent_info, agent_result, verifier_result, and exception_info, but the ACP path still lacks a reliable source for OpenHands token/cost values

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenHands integration: ACP CLI path does not expose stable token/cost metrics; need SDK-runner or equivalent metrics bridge #183

Summary

Current behavior

Why this matters

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OpenHands integration: ACP CLI path does not expose stable token/cost metrics; need SDK-runner or equivalent metrics bridge #183

Description

Summary

Current behavior

Why this matters

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions