-
Notifications
You must be signed in to change notification settings - Fork 12
Description
🤖 Kelos Strategist Agent @gjkim42
Area: Integration Opportunities
Summary
Add OpenTelemetry (OTEL) distributed tracing to the Kelos controller and spawner, enabling operators to visualize and debug task pipeline execution as end-to-end traces in standard backends (Jaeger, Grafana Tempo, Datadog). While Kelos already has solid Prometheus metrics (internal/controller/metrics.go — task counts, duration histograms, cost/token counters), it has zero distributed tracing support. Metrics answer "how many?" and "how long on average?" but cannot answer "why did this specific pipeline fail at step 3?" or "which spawner discovery cycle created this task?" Tracing fills this gap.
Problem
1. Multi-step pipelines are opaque
Task pipelines using dependsOn chains (e.g., examples/07-task-pipeline/) create implicit causal relationships:
TaskSpawner discovery → Task "scaffold" → Task "write-tests" → Task "open-pr"
Today, debugging a failure in this chain requires:
- Manually correlating timestamps across
kelos get tasks - Reading individual pod logs (
kelos logs) - Guessing which spawner discovery cycle triggered the pipeline
- No way to see the full execution timeline in one view
2. Spawner-to-task causality is lost
When a TaskSpawner discovers a work item and creates a Task, there is no trace linking the discovery event to the resulting Task. The spawner (cmd/kelos-spawner/main.go) creates Task resources via the Kubernetes API, but no context is propagated. If 9 spawners are running simultaneously (as in the self-development setup), correlating which spawner created which task requires label-based filtering, not causal tracing.
3. Controller reconciliation loops are invisible
The TaskReconciler performs multiple operations per reconciliation:
- Dependency resolution (
checkDependencies) - Branch lock acquisition (
BranchLocker) - Prompt template resolution (
resolvePromptTemplate) - Job creation (
JobBuilder.BuildJob) - Output capture from pod logs
When a reconciliation takes unexpectedly long or fails, there is no breakdown of where time was spent. Prometheus histograms (kelos_task_duration_seconds) show end-to-end duration but not internal step timing.
4. No cross-task trace context
Task pipelines have no shared trace ID. Each task is an independent unit. When write-tests depends on scaffold, there is no way to:
- View the full pipeline as a single trace
- See the wait time between dependency completion and downstream task start
- Correlate failures across dependent tasks
Proposed Design
Trace Hierarchy
Trace: "spawner/{spawnerName}/discovery/{workItemID}"
├── Span: "spawner.discover" (poll cycle)
│ ├── Span: "spawner.create_task" (per work item)
│ │ └── Link: → task/{taskName}
│ └── Span: "spawner.create_task"
│ └── Link: → task/{taskName}
│
├── Trace: "task/{taskName}" (linked from spawner)
│ ├── Span: "task.reconcile"
│ │ ├── Span: "task.check_dependencies"
│ │ ├── Span: "task.acquire_branch_lock"
│ │ ├── Span: "task.resolve_prompt"
│ │ ├── Span: "task.build_job"
│ │ └── Span: "task.create_job"
│ ├── Span: "task.running" (long span covering agent execution)
│ ├── Span: "task.capture_outputs"
│ └── Span: "task.complete"
│
└── Trace: "task/{dependentTaskName}" (linked via dependsOn)
├── Span: "task.waiting" (blocked on dependency)
└── Span: "task.reconcile" ...
Trace Context Propagation
-
Spawner → Task: Store trace context in Task annotations (
kelos.dev/traceparent,kelos.dev/tracestate). The spawner creates a trace for each discovery cycle and child spans for each Task creation. -
Task → Dependent Task: When
resolvePromptTemplatechecks dependencies, create a span link from the dependent task's trace to each dependency's trace. This preserves the causal relationship without forcing a single trace (pipelines can be long-running). -
Controller → Job/Pod: Inject
TRACEPARENTenv var into agent pods viaJobBuilder.BuildJob(). Agents that support OTEL can optionally continue the trace, enabling end-to-end visibility into what the agent did.
Span Attributes
| Attribute | Source | Example |
|---|---|---|
kelos.task.name |
Task metadata | scaffold |
kelos.task.type |
spec.type |
claude-code |
kelos.task.model |
spec.model |
opus |
kelos.task.phase |
status.phase |
Succeeded |
kelos.task.spawner |
Label kelos.dev/taskspawner |
kelos-workers |
kelos.task.cost_usd |
status.results["total_cost"] |
0.42 |
kelos.task.tokens.input |
status.results["input_tokens"] |
15000 |
kelos.task.tokens.output |
status.results["output_tokens"] |
3200 |
kelos.task.branch |
spec.branch |
feature/auth |
kelos.spawner.name |
TaskSpawner metadata | kelos-workers |
kelos.spawner.source |
Source type | githubIssues |
kelos.spawner.work_item.id |
Work item identifier | issue-42 |
Implementation Touchpoints
| File | Change |
|---|---|
go.mod |
Add go.opentelemetry.io/otel, go.opentelemetry.io/otel/sdk, go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc |
cmd/kelos-controller/main.go |
Initialize OTEL TracerProvider with OTLP exporter; respect OTEL_EXPORTER_OTLP_ENDPOINT env var |
cmd/kelos-spawner/main.go |
Initialize TracerProvider; create spans for discovery cycles and task creation; store traceparent in Task annotations |
internal/controller/task_controller.go |
Read traceparent from annotations; create spans for reconciliation steps; inject TRACEPARENT into job env |
internal/controller/taskspawner_controller.go |
Create spans for spawner reconciliation |
internal/controller/job_builder.go |
Add TRACEPARENT env var to container spec when trace context is present |
Helm chart values.yaml |
Add tracing.enabled, tracing.endpoint, tracing.samplingRate configuration |
Configuration
# Helm values.yaml
tracing:
enabled: false # opt-in, no overhead when disabled
endpoint: "" # OTLP endpoint, e.g., "otel-collector:4317"
samplingRate: 1.0 # 0.0-1.0, trace sampling ratioWhen tracing.enabled is false (default), no OTEL SDK is initialized and no spans are created — zero performance impact for users who don't need tracing.
Why This Matters
Production debugging
The self-development deployment runs 9 spawners creating tasks that interact through GitHub (one spawner's PR triggers another spawner's review). Without tracing, debugging cross-spawner interactions requires manual timestamp correlation across potentially dozens of concurrent tasks.
Pipeline failure diagnosis
Example 07 shows a 3-step pipeline. With tracing, a failed pipeline shows as a single trace with the failing span highlighted, immediate visibility into whether the failure was in dependency resolution, branch locking, agent execution, or output capture.
Cost attribution per workflow
Existing cost metrics (kelos_task_cost_usd_total) are aggregated. Traces would show cost per individual pipeline execution, enabling per-workflow cost analysis.
Complements existing metrics
This proposal works alongside, not replacing, the existing Prometheus metrics in internal/controller/metrics.go. Metrics provide aggregate views (dashboards, alerting); traces provide per-execution debugging. Together they provide full observability:
| Question | Metrics | Tracing |
|---|---|---|
| How many tasks failed this hour? | ✅ kelos_task_completed_total{phase="Failed"} |
❌ |
| Why did task X fail? | ❌ | ✅ Trace shows exact step |
| What's the p95 task duration? | ✅ kelos_task_duration_seconds |
❌ |
| Where did task X spend its time? | ❌ | ✅ Span breakdown |
| Which spawner created task X? | ✅ Causal link |
Backward Compatibility
- Fully opt-in via Helm values (
tracing.enabled: falsedefault) - No new CRDs or API changes
- Zero overhead when disabled (no OTEL SDK initialization)
- Trace annotations on Tasks are informational and don't affect behavior
Alternatives Considered
-
Structured logging with correlation IDs: Simpler but less powerful — no timing visualization, no causal links, requires custom log parsing. Tracing provides this and more via standard tooling.
-
Kubernetes Events only: Already used for lifecycle events, but events are ephemeral (default 1h TTL), lack timing precision, and don't support parent-child relationships. Good for alerting, insufficient for debugging.
-
Custom trace format: Would require custom UI/tooling. OTEL is the industry standard with broad backend support (Jaeger, Tempo, Datadog, Honeycomb, etc.).
/kind feature