Three observability and execution model improvements
1. Cost / token tracking per step
Problem: Flows that call LLMs (via claude --print or API actions) have zero visibility into token spend. In a 90+ flow codebase with 25+ LLM-calling flows, an orchestrator flow might invoke 8 sub-flows containing ~40 LLM calls total. When a parallel branch fails at minute 15, the tokens burned on that branch are unquantifiable.
We've built defensive "gate" steps (e.g., check that research output is non-empty before running expensive analysis) specifically to prevent token waste — but this is guesswork without actual cost data.
Proposed solution:
{
"id": "runAnalysis",
"type": "bash",
"bash": { "command": "claude --print ..." },
"metrics": {
"track": true,
"costLabel": "company-analysis"
}
}
After flow execution, output includes:
{
"_metrics": {
"totalDurationMs": 145000,
"steps": {
"runAnalysis": {
"durationMs": 12000,
"inputTokens": 8234,
"outputTokens": 3421,
"estimatedCost": "$0.037"
}
},
"totalEstimatedCost": "$1.24"
}
}
For non-LLM steps, duration alone is valuable. For LLM steps, token counts enable optimization (e.g., "this prompt is 80K tokens — should we truncate?").
Impact: Enables data-driven pipeline optimization. Currently every LLM-heavy workflow is a cost black box.
2. Step duration metrics (lightweight version of above)
Even without token tracking, knowing per-step wall-clock time would be valuable. A --metrics flag on one flow execute that appends timing data to each step output:
one --agent flow execute my-flow --metrics -i param=value
Output includes _stepTimings: { "step1": 1234, "step2": 5678, ... }.
This is implementable without any LLM-specific logic and would help identify bottlenecks in any flow.
3. Event-driven flow triggers
Problem: The relay system handles inbound webhooks, but there's no declarative way to say "when event X occurs, run flow Y." Currently we:
- Poll via scheduled bash scripts (
inbox-poll flow)
- Chain flows manually via bash
one flow execute inside another flow
- Use relay + passthrough for simple webhook→action forwarding
A native trigger system would enable reactive pipelines:
{
"key": "auto-process-deal",
"trigger": {
"type": "webhook",
"path": "/deal-created",
"filter": "$.body.source === 'email'"
},
"flow": "deal-process",
"inputMapping": {
"companyName": "$.body.company",
"founderEmail": "$.body.email"
}
}
Or schedule-based:
{
"trigger": {
"type": "schedule",
"cron": "0 9 * * 1-5"
},
"flow": "daily-digest",
"inputMapping": {
"date": "$.trigger.scheduledDate"
}
}
Impact: Eliminates polling flows and manual chaining. Enables reactive architectures where flows respond to events rather than being manually invoked.
Relationship to existing issues
Three observability and execution model improvements
1. Cost / token tracking per step
Problem: Flows that call LLMs (via
claude --printor API actions) have zero visibility into token spend. In a 90+ flow codebase with 25+ LLM-calling flows, an orchestrator flow might invoke 8 sub-flows containing ~40 LLM calls total. When a parallel branch fails at minute 15, the tokens burned on that branch are unquantifiable.We've built defensive "gate" steps (e.g., check that research output is non-empty before running expensive analysis) specifically to prevent token waste — but this is guesswork without actual cost data.
Proposed solution:
{ "id": "runAnalysis", "type": "bash", "bash": { "command": "claude --print ..." }, "metrics": { "track": true, "costLabel": "company-analysis" } }After flow execution, output includes:
{ "_metrics": { "totalDurationMs": 145000, "steps": { "runAnalysis": { "durationMs": 12000, "inputTokens": 8234, "outputTokens": 3421, "estimatedCost": "$0.037" } }, "totalEstimatedCost": "$1.24" } }For non-LLM steps, duration alone is valuable. For LLM steps, token counts enable optimization (e.g., "this prompt is 80K tokens — should we truncate?").
Impact: Enables data-driven pipeline optimization. Currently every LLM-heavy workflow is a cost black box.
2. Step duration metrics (lightweight version of above)
Even without token tracking, knowing per-step wall-clock time would be valuable. A
--metricsflag onone flow executethat appends timing data to each step output:Output includes
_stepTimings: { "step1": 1234, "step2": 5678, ... }.This is implementable without any LLM-specific logic and would help identify bottlenecks in any flow.
3. Event-driven flow triggers
Problem: The relay system handles inbound webhooks, but there's no declarative way to say "when event X occurs, run flow Y." Currently we:
inbox-pollflow)one flow executeinside another flowA native trigger system would enable reactive pipelines:
{ "key": "auto-process-deal", "trigger": { "type": "webhook", "path": "/deal-created", "filter": "$.body.source === 'email'" }, "flow": "deal-process", "inputMapping": { "companyName": "$.body.company", "founderEmail": "$.body.email" } }Or schedule-based:
{ "trigger": { "type": "schedule", "cron": "0 9 * * 1-5" }, "flow": "daily-digest", "inputMapping": { "date": "$.trigger.scheduledDate" } }Impact: Eliminates polling flows and manual chaining. Enables reactive architectures where flows respond to events rather than being manually invoked.
Relationship to existing issues