Skip to content

Grafana Agent components unhealthy because of k8s API server timeout during pod startup #7053

@ishaanmanaktalia

Description

@ishaanmanaktalia

What's wrong?

We are running Grafana-agent statefulset pods with autoscaling enabled (Horizontal pod autoscaling enabled) in our Kubernetes cluster .
While launching new grafana-agent pod during Horizontal autoscaling, sometimes we noticed grafana agent components namely prometheus.operator.servicemonitors, prometheus.operator.podmonitors,prometheus.operator.probes going unhealthy because of API server timeout error during initialisation of pod. After this , Grafana-agent pod continue to stays in running state and showing these 3 components unhealthy without retrying to make connection to API server.

Here is how it looks in Grafana-agent UI
Image
Image
Image

On checking, we did not notice any issue directly with k8s API server and also ,other grafana-agent pods (which were part of same statefulset )were running healthy with all of their components (prometheus.operator.servicemonitors, prometheus.operator.podmonitors,prometheus.operator.probes and others )showing in healthy state .While this newly launched statefulset pod started by HPA for Grafana-agent statefulset continue showing these components unhealthy as soon as it starts .

The issue occurs sometimes and not every time . but it is reported by prometheus metric expression :
sum (agent_component_controller_running_components{health_type!="healthy"}) > 0

What is expected:
Grafana-agent in such cases can continue retrying for making connection to API server on discovering timeout for prometheus.operator.servicemonitors, prometheus.operator.podmonitors,prometheus.operator.probes components during initialization to retry to make its component healthy so that situation gets resolved without any need of manual restart/deletion of pod in such cases.

Steps to reproduce

Grafana agent helm chart version 0.31.0 and app version v0.39.0
Helm values are added below in configuration section .

Environment:

Infrastructure: Kubernetes
Deployment tool: Helm

System information

No response

Software version

Grafana Agent v0.39.0

Configuration

Helm values.yaml:
nameOverride: grafana-agent
crds:
create: false
image:
tag: v0.39.0
service:
enabled: true
controller:
type: 'statefulset'
replicas: 4
autoscaling:
enabled: true
targetMemoryUtilizationPercentage: 50
minReplicas: 4
maxReplicas: 20
agent:
resources:
requests:
cpu: "4"
memory: "20Gi"
limits:
cpu: "4"
memory: "20Gi"
mode: 'flow'
clustering:
enabled: true
configMap:
content: |
prometheus.remote_write "mimir" {
endpoint {
url = "https://mimir-url.abcxyz/api/v1/push"
headers = {
"X-Scope-OrgID" = "tenantid",
}
}
}
/*
Service Monitors
/
prometheus.operator.servicemonitors "discover_servicemonitors" {
forward_to = [prometheus.remote_write.mimir.receiver]
selector {
match_expression {
key = "app.kubernetes.io/part-of"
operator = "NotIn"
values = ["prometheus-operator"]
}
match_expression {
key = "app.kubernetes.io/instance"
operator = "NotIn"
values = ["prom-op"]
}
}
clustering {
enabled = true
}
}
/

Pod Monitors
/
prometheus.operator.podmonitors "discover_podmonitors" {
forward_to = [prometheus.remote_write.mimir.receiver]
scrape {
default_scrape_interval = "30s"
}
clustering {
enabled = true
}
}
/

Probes
*/
prometheus.operator.probes "discover_probes" {
forward_to = [prometheus.remote_write.mimir.receiver]
scrape {
default_scrape_interval = "30s"
}
clustering {
enabled = true
}
}

Logs

ts=2024-09-23T05:04:54.362676449Z level=info msg="now listening for http traffic" service=http addr=0.0.0.0:80
ts=2024-09-23T05:04:54.362152043Z level=info msg="Using pod service account via in-cluster config" component=prometheus.operator.servicemonitors.discover_servicemonitors
ts=2024-09-23T05:04:54.361663857Z level=info msg="scheduling loaded components and services"
ts=2024-09-23T05:04:54.362133503Z level=info msg="Using pod service account via in-cluster config" component=prometheus.operator.probes.discover_probes
ts=2024-09-23T05:04:54.362076197Z level=info msg="Using pod service account via in-cluster config" component=prometheus.operator.podmonitors.discover_podmonitors
ts=2024-09-23T05:04:54.361499105Z level=info msg="finished complete graph evaluation" controller_id="" trace_id=eaf937c20f85f3ce18dd408efb23c4ae duration=22.012674ms
ts=2024-09-23T05:04:54.361405421Z level=info msg="applying non-TLS config to HTTP server" service=http
ts=2024-09-23T05:05:24.363526504Z level=error msg="error running crd manager" component=prometheus.operator.podmonitors.discover_podmonitors err="could not create RESTMapper from config: Get "https://172.20.0.1:443/api": dial tcp 172.20.0.1:443: i/o timeout"
ts=2024-09-23T05:05:24.363546239Z level=info msg="scrape manager stopped" component=prometheus.operator.probes.discover_probes
ts=2024-09-23T05:05:24.363568264Z level=info msg="scrape manager stopped" component=prometheus.operator.podmonitors.discover_podmonitors
ts=2024-09-23T05:05:24.363491071Z level=error msg="error running crd manager" component=prometheus.operator.probes.discover_probes err="could not create RESTMapper from config: Get "https://172.20.0.1:443/api": dial tcp 172.20.0.1:443: i/o timeout"
ts=2024-09-23T05:05:24.363597696Z level=info msg="scrape manager stopped" component=prometheus.operator.servicemonitors.discover_servicemonitors
ts=2024-09-23T05:05:24.363558031Z level=error msg="error running crd manager" component=prometheus.operator.servicemonitors.discover_servicemonitors err="could not create RESTMapper from config: Get "https://172.20.0.1:443/api": dial tcp 172.20.0.1:443: i/o timeout"
ts=2024-09-23T05:05:44.36759843Z level=info msg="peers changed" new_peers=grafana-agent-5
ts=2024-09-23T05:05:44.367431093Z level=info msg="starting cluster node" peers="" advertise_addr=10.123.123.30:80

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingneeds-attentionAn issue or PR has been sitting around and needs attention.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions