Skip to content

feat: enable Kubernetes system traces (API server + kubelet) #689

@jomcgi

Description

@jomcgi

Summary

Enable Kubernetes system traces for the kube-apiserver and kubelet to get control plane observability in SigNoz. This fills the one visibility gap in our current observability stack — we have application-level traces (OTEL SDKs) and service mesh traces (Linkerd), but no insight into Kubernetes internals.

What we'd see in SigNoz

  • API server spans: request lifecycle (authn → authz → admission webhooks → etcd), webhook latency (Kyverno, Linkerd), etcd read/write performance
  • Kubelet spans: CRI calls to containerd, pod sync routines, garbage collection, gRPC to container runtime
  • Distributed context: API server propagates W3C trace context to webhooks, so Kyverno policy evaluation shows as child spans

Pre-requisites (already met)

Requirement Status
otelAgent DaemonSet on all nodes (incl. control plane) 4/4 nodes
hostPort: 4317 bound (OTLP gRPC on localhost) Confirmed
OTLP gRPC receiver configured in otelAgent otlp.protocols.grpc
Traces pipeline wired to SigNoz otlp → k8sattributes → batch → signoz-otel-collector
Tolerations for control plane operator: Exists
K8s version supports stable kubelet tracing v1.35.0 (stable since v1.34)

No changes needed on the SigNoz/collector side.

Implementation

1. Create tracing config file on each server node

# /etc/rancher/k3s/tracing.yaml
apiVersion: apiserver.config.k8s.io/v1beta1
kind: TracingConfiguration
endpoint: localhost:4317
samplingRatePerMillion: 1000  # 0.1% — conservative starting point

2. Update k3s server config on each server node

# /etc/rancher/k3s/config.yaml (append to existing config)
kube-apiserver-arg:
  - "tracing-config-file=/etc/rancher/k3s/tracing.yaml"
kubelet-arg:
  - "tracing-config-file=/etc/rancher/k3s/tracing.yaml"

3. Rolling restart of k3s

Restart one server node at a time to avoid quorum loss:

# On each server node (node-1, node-2, node-3), one at a time:
sudo systemctl restart k3s
# Wait for node to be Ready before proceeding to next

node-4 (worker) only needs the kubelet tracing config + restart if desired.

4. Verify

# Check traces are flowing
kubectl logs -n signoz -l app.kubernetes.io/component=otel-agent --tail=20 | grep -i trace

# Look for apiserver/kubelet spans in SigNoz UI
# Services should appear as "kube-apiserver" and "kubelet"

Follow-up

  • Monitor otelAgent resource usage after enabling — the 0.1% sampling rate should have negligible impact
  • Consider bumping samplingRatePerMillion if more coverage is needed for debugging
  • Optionally add SigNoz dashboard for control plane latency metrics derived from traces

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions