Summary
Benchflow's current openhands integration uses the OpenHands CLI ACP path (openhands acp ...). This works for install / launch / ACP execution, but it does not reliably expose the token / cost metrics that the older Harbor + OpenHands SDK runner path provided.
This becomes a problem if Benchflow is expected to replace Harbor in the OpenHands evaluation harness while preserving metrics such as:
agent_result.n_input_tokens
agent_result.n_cache_tokens
agent_result.n_output_tokens
agent_result.cost_usd
Current behavior
Benchflow can run OpenHands through ACP successfully, but the ACP path does not provide a stable source of cumulative LLM metrics.
What we can get reliably from the current ACP session:
- agent name / version from
initialize
- tool calls
- message / thought / tool trajectory
- stop reason
- timing
What we cannot rely on the ACP path to provide today:
- cumulative prompt tokens
- cumulative completion tokens
- cumulative cache tokens
- cumulative cost in USD
Benchflow now has result schema slots for these metrics, but under the current OpenHands CLI ACP integration they will usually remain null unless some future trajectory source includes usage/cost.
Why this matters
The older Harbor/OpenHands SDK-runner path could collect these values directly inside the agent process via:
llm.metrics.accumulated_token_usage
llm.metrics.accumulated_cost
That means Harbor output could populate richer trial results, while the current Benchflow ACP integration cannot provide equivalent metrics with the same reliability.
Expected behavior
Benchflow should have a supported way to run OpenHands that can emit stable metrics comparable to the old Harbor/OpenHands SDK-runner results.
At minimum, OpenHands runs in Benchflow should be able to populate:
"agent_result": {
"n_input_tokens": ...,
"n_cache_tokens": ...,
"n_output_tokens": ...,
"cost_usd": ...
}
Suggested direction
Add an OpenHands SDK runner execution path in Benchflow, and collect metrics directly from the SDK runtime.
Or add an equivalent telemetry/metrics bridge for the CLI ACP path, if OpenHands CLI can emit cumulative usage/cost in a supported machine-readable way.
From an accuracy and maintenance perspective, the SDK-runner path appears preferable, since it reads metrics from the same execution layer that is actually calling the LLM.
Notes
This issue is specifically about metrics parity, not basic OpenHands ACP functionality.
OpenHands CLI ACP works for task execution, but not for reliable token/cost accounting.
If Benchflow is meant to replace Harbor inside the OpenHands evaluation harness, this gap blocks lossless migration of result metadata.
Related local findings
Current Benchflow openhands agent launches via openhands acp --always-approve --override-with-envs
ACP initialize provides agentInfo but not cumulative token/cost usage
Older Harbor/OpenHands SDK runner gathered metrics from llm.metrics.accumulated_token_usage and llm.metrics.accumulated_cost
Benchflow result schema has been extended locally to include agent_info, agent_result, verifier_result, and exception_info, but the ACP path still lacks a reliable source for OpenHands token/cost values
Summary
Benchflow's current
openhandsintegration uses the OpenHands CLI ACP path (openhands acp ...). This works for install / launch / ACP execution, but it does not reliably expose the token / cost metrics that the older Harbor + OpenHands SDK runner path provided.This becomes a problem if Benchflow is expected to replace Harbor in the OpenHands evaluation harness while preserving metrics such as:
agent_result.n_input_tokensagent_result.n_cache_tokensagent_result.n_output_tokensagent_result.cost_usdCurrent behavior
Benchflow can run OpenHands through ACP successfully, but the ACP path does not provide a stable source of cumulative LLM metrics.
What we can get reliably from the current ACP session:
initializeWhat we cannot rely on the ACP path to provide today:
Benchflow now has result schema slots for these metrics, but under the current OpenHands CLI ACP integration they will usually remain
nullunless some future trajectory source includes usage/cost.Why this matters
The older Harbor/OpenHands SDK-runner path could collect these values directly inside the agent process via:
llm.metrics.accumulated_token_usagellm.metrics.accumulated_costThat means Harbor output could populate richer trial results, while the current Benchflow ACP integration cannot provide equivalent metrics with the same reliability.
Expected behavior
Benchflow should have a supported way to run OpenHands that can emit stable metrics comparable to the old Harbor/OpenHands SDK-runner results.
At minimum, OpenHands runs in Benchflow should be able to populate:
Suggested direction
Add an OpenHands SDK runner execution path in Benchflow, and collect metrics directly from the SDK runtime.
Or add an equivalent telemetry/metrics bridge for the CLI ACP path, if OpenHands CLI can emit cumulative usage/cost in a supported machine-readable way.
From an accuracy and maintenance perspective, the SDK-runner path appears preferable, since it reads metrics from the same execution layer that is actually calling the LLM.
Notes
This issue is specifically about metrics parity, not basic OpenHands ACP functionality.
OpenHands CLI ACP works for task execution, but not for reliable token/cost accounting.
If Benchflow is meant to replace Harbor inside the OpenHands evaluation harness, this gap blocks lossless migration of result metadata.
Related local findings
Current Benchflow openhands agent launches via openhands acp --always-approve --override-with-envs
ACP initialize provides agentInfo but not cumulative token/cost usage
Older Harbor/OpenHands SDK runner gathered metrics from llm.metrics.accumulated_token_usage and llm.metrics.accumulated_cost
Benchflow result schema has been extended locally to include agent_info, agent_result, verifier_result, and exception_info, but the ACP path still lacks a reliable source for OpenHands token/cost values