Skip to content

OpenHands integration: ACP CLI path does not expose stable token/cost metrics; need SDK-runner or equivalent metrics bridge #183

@AmyTao

Description

@AmyTao

Summary

Benchflow's current openhands integration uses the OpenHands CLI ACP path (openhands acp ...). This works for install / launch / ACP execution, but it does not reliably expose the token / cost metrics that the older Harbor + OpenHands SDK runner path provided.

This becomes a problem if Benchflow is expected to replace Harbor in the OpenHands evaluation harness while preserving metrics such as:

  • agent_result.n_input_tokens
  • agent_result.n_cache_tokens
  • agent_result.n_output_tokens
  • agent_result.cost_usd

Current behavior

Benchflow can run OpenHands through ACP successfully, but the ACP path does not provide a stable source of cumulative LLM metrics.

What we can get reliably from the current ACP session:

  • agent name / version from initialize
  • tool calls
  • message / thought / tool trajectory
  • stop reason
  • timing

What we cannot rely on the ACP path to provide today:

  • cumulative prompt tokens
  • cumulative completion tokens
  • cumulative cache tokens
  • cumulative cost in USD

Benchflow now has result schema slots for these metrics, but under the current OpenHands CLI ACP integration they will usually remain null unless some future trajectory source includes usage/cost.

Why this matters

The older Harbor/OpenHands SDK-runner path could collect these values directly inside the agent process via:

  • llm.metrics.accumulated_token_usage
  • llm.metrics.accumulated_cost

That means Harbor output could populate richer trial results, while the current Benchflow ACP integration cannot provide equivalent metrics with the same reliability.

Expected behavior

Benchflow should have a supported way to run OpenHands that can emit stable metrics comparable to the old Harbor/OpenHands SDK-runner results.

At minimum, OpenHands runs in Benchflow should be able to populate:

"agent_result": {
  "n_input_tokens": ...,
  "n_cache_tokens": ...,
  "n_output_tokens": ...,
  "cost_usd": ...
}

Suggested direction
Add an OpenHands SDK runner execution path in Benchflow, and collect metrics directly from the SDK runtime.
Or add an equivalent telemetry/metrics bridge for the CLI ACP path, if OpenHands CLI can emit cumulative usage/cost in a supported machine-readable way.
From an accuracy and maintenance perspective, the SDK-runner path appears preferable, since it reads metrics from the same execution layer that is actually calling the LLM.

Notes
This issue is specifically about metrics parity, not basic OpenHands ACP functionality.
OpenHands CLI ACP works for task execution, but not for reliable token/cost accounting.
If Benchflow is meant to replace Harbor inside the OpenHands evaluation harness, this gap blocks lossless migration of result metadata.

Related local findings
Current Benchflow openhands agent launches via openhands acp --always-approve --override-with-envs
ACP initialize provides agentInfo but not cumulative token/cost usage
Older Harbor/OpenHands SDK runner gathered metrics from llm.metrics.accumulated_token_usage and llm.metrics.accumulated_cost
Benchflow result schema has been extended locally to include agent_info, agent_result, verifier_result, and exception_info, but the ACP path still lacks a reliable source for OpenHands token/cost values

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions