-
Notifications
You must be signed in to change notification settings - Fork 12
Description
🤖 Kelos Strategist Agent @gjkim42
Area: New CRDs & API Extensions
Summary
Kelos already captures token usage and cost data from every agent run (internal/capture/usage.go) and exposes Prometheus metrics (kelos_task_cost_usd_total, kelos_task_input_tokens_total, kelos_task_output_tokens_total), but provides no mechanism to enforce spending limits. The existing cost controls (maxConcurrency, maxTotalTasks, activeDeadlineSeconds) are all indirect — they limit task count and duration, not actual spending. This proposal adds a costBudget field to TaskSpawner that pauses task creation when cumulative cost or token usage exceeds a configured threshold within a rolling time window.
Problem
1. Cost data is observed but never enforced
The capture system extracts cost from agent output (internal/capture/usage.go:49-73):
func parseClaudeCode(lines [][]byte) map[string]string {
// ...
if v, ok := last["total_cost_usd"]; ok {
result["cost-usd"] = formatNumber(v)
}
if usage, ok := last["usage"].(map[string]any); ok {
if v, ok := usage["input_tokens"]; ok {
result["input-tokens"] = formatNumber(v)
}
// ...
}
}The controller records these into Prometheus counters (internal/controller/metrics.go:47-71):
taskCostUSD = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "kelos_task_cost_usd_total",
Help: "Total cost in USD of completed Tasks",
},
[]string{"namespace", "type", "spawner", "model"},
)But this data is never used for enforcement. Users can build external alerting (Prometheus → AlertManager), but there's no built-in circuit breaker. A misconfigured TaskSpawner or a burst of expensive issues could rack up thousands of dollars before a human notices.
2. Indirect controls don't map to spending
The README acknowledges cost is important:
Use
maxConcurrency, timeouts, and model selection to stay in budget.
But these controls are imprecise:
maxConcurrencylimits parallelism, not total spend. 3 concurrent tasks running 24/7 costs far more than 10 concurrent tasks that each finish in 2 minutesmaxTotalTaskscaps lifetime task count, but a single complex task could cost $50 while 100 simple tasks cost $10 totalactiveDeadlineSecondscaps individual task duration, but different models and prompts have wildly different cost-per-second profiles
There's no way to express "spend no more than $100/day on this spawner" or "cap total token usage at 10M tokens per week."
3. Enterprise adoption requires spending guardrails
Organizations evaluating autonomous agent platforms need to answer "what's the worst-case cost?" before deploying. Today, the answer is "depends on how many issues get filed and how long agents run" — which is unsatisfying for budget-conscious teams. A built-in budget mechanism provides a hard ceiling that makes adoption safer.
Proposed Solution
New costBudget field on TaskSpawnerSpec
// TaskSpawnerSpec
type TaskSpawnerSpec struct {
// ... existing fields ...
// CostBudget defines spending limits for this TaskSpawner.
// When a budget threshold is exceeded, the spawner pauses task creation
// until the rolling window advances past the expensive tasks.
// +optional
CostBudget *CostBudget `json:"costBudget,omitempty"`
}
type CostBudget struct {
// MaxCostUSD is the maximum cumulative cost in USD within the rolling window.
// Cost is sourced from the "cost-usd" result captured from completed tasks.
// +optional
MaxCostUSD *resource.Quantity `json:"maxCostUSD,omitempty"`
// MaxInputTokens is the maximum cumulative input tokens within the rolling window.
// +optional
MaxInputTokens *int64 `json:"maxInputTokens,omitempty"`
// MaxOutputTokens is the maximum cumulative output tokens within the rolling window.
// +optional
MaxOutputTokens *int64 `json:"maxOutputTokens,omitempty"`
// Window is the rolling time window over which costs are accumulated.
// Defaults to "24h". Accepts Go duration strings (e.g., "12h", "7d", "168h").
// +optional
// +kubebuilder:default="24h"
Window *metav1.Duration `json:"window,omitempty"`
}New status fields
type TaskSpawnerStatus struct {
// ... existing fields ...
// CostSummary reports cumulative spending within the current budget window.
// +optional
CostSummary *CostSummary `json:"costSummary,omitempty"`
}
type CostSummary struct {
// CostUSD is the cumulative cost in USD within the current window.
CostUSD resource.Quantity `json:"costUSD"`
// InputTokens is the cumulative input tokens within the current window.
InputTokens int64 `json:"inputTokens"`
// OutputTokens is the cumulative output tokens within the current window.
OutputTokens int64 `json:"outputTokens"`
// WindowStart is the beginning of the current rolling window.
WindowStart metav1.Time `json:"windowStart"`
}New condition type
When budget is exceeded, set a CostBudgetExhausted condition (mirroring the existing TaskBudgetExhausted pattern in cmd/kelos-spawner/main.go:403-417):
{
Type: "CostBudgetExhausted",
Status: metav1.ConditionTrue,
Reason: "BudgetExceeded",
Message: "Cost budget exceeded: $47.23 of $50.00 limit in 24h window",
}Implementation approach
In the spawner (cmd/kelos-spawner/main.go): Before creating tasks in runCycleWithSource, query completed Tasks owned by this spawner within the budget window, sum their status.results["cost-usd"] / status.results["input-tokens"] / status.results["output-tokens"], and skip task creation if any threshold is exceeded. This follows the same pattern as the existing maxTotalTasks budget check.
In the controller (internal/controller/task_controller.go): When a task completes and cost results are captured, update the owning TaskSpawner's status.costSummary to keep a running total. This avoids the spawner needing to re-query all tasks on every poll cycle.
Example config
apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
name: issue-solver
spec:
when:
githubIssues:
labels: ["agent"]
maxConcurrency: 3
costBudget:
maxCostUSD: "100" # $100 per 24h window
maxOutputTokens: 5000000 # 5M output tokens per window
window: "24h"
taskTemplate:
type: claude-code
model: claude-opus-4-6
# ...Design Decisions
Why resource.Quantity for maxCostUSD?
resource.Quantity is the standard Kubernetes type for decimal values that need precise comparison. Using a bare float would invite floating-point comparison bugs. This also follows the project convention of using semantic .Equal() for Quantity comparisons.
Why rolling window instead of calendar-based (daily/monthly)?
Rolling windows are simpler to implement (no timezone handling), work predictably across time zones, and don't create "burst at start of period" incentives. Calendar-based budgets can be added later if needed.
Why per-spawner rather than a separate CRD?
Per-spawner budgets are simpler to adopt incrementally — you add one field to an existing resource. A namespace-wide CostPolicy CRD (analogous to the proposed ConcurrencyPolicy in #675) could layer on top later for cross-spawner coordination, but the per-spawner budget solves the most common case: "don't let this one spawner run away."
Relationship to existing controls
| Control | What it limits | Precision |
|---|---|---|
maxConcurrency |
Parallel tasks | Indirect — controls parallelism, not spend |
maxTotalTasks |
Lifetime task count | Indirect — all tasks counted equally |
activeDeadlineSeconds |
Individual task runtime | Indirect — cost/second varies by model |
costBudget (proposed) |
Actual USD/tokens | Direct — measures real spending |
These are complementary: maxConcurrency prevents resource exhaustion, maxTotalTasks caps total work, and costBudget caps total spend.
Backward Compatibility
- Fully optional — omitting
costBudgetpreserves current behavior - No changes to existing TaskSpawner configs required
- Cost tracking relies on agent output that's already captured; no agent-side changes needed
- Spawners that don't report cost (e.g., agents that don't emit usage data) would never hit the budget — the field is a no-op rather than a blocker
References
- Cost capture:
internal/capture/usage.go - Prometheus metrics:
internal/controller/metrics.go:47-71 - Existing budget pattern (
maxTotalTasks):cmd/kelos-spawner/main.go:300-310 TaskBudgetExhaustedcondition:cmd/kelos-spawner/main.go:403-417- Cost result recording:
internal/controller/task_controller.go:895-910 - Related: API: Add ConcurrencyPolicy CRD for namespace-scoped aggregate task limits and scheduling windows #675 (ConcurrencyPolicy CRD — addresses concurrency, not cost)