Grafana Agent  components unhealthy because of  k8s API server timeout during pod startup

### What's wrong?

We are running Grafana-agent statefulset pods  with autoscaling enabled (Horizontal pod autoscaling enabled) in our Kubernetes cluster .
While launching new grafana-agent pod  during Horizontal autoscaling, sometimes we noticed grafana agent components namely prometheus.operator.servicemonitors, prometheus.operator.podmonitors,prometheus.operator.probes going unhealthy because of API server timeout error during initialisation of pod. After this , Grafana-agent pod continue to stays in running state and  showing these 3 components unhealthy without retrying to make connection to API server. 


Here is how it looks in Grafana-agent UI
![Image](https://github.com/user-attachments/assets/08eab539-f91f-4c57-84f1-01af50bc022e)
![Image](https://github.com/user-attachments/assets/130f73a7-5233-493a-a635-4e9302f865d0)
![Image](https://github.com/user-attachments/assets/9938a268-7eec-49bd-819c-37906ed27d00)


On checking, we did not  notice any issue directly with k8s API server and also ,other grafana-agent pods (which were part of same statefulset )were running healthy with all of their components (prometheus.operator.servicemonitors, prometheus.operator.podmonitors,prometheus.operator.probes and others )showing in healthy state .While this newly launched statefulset pod started by HPA  for Grafana-agent statefulset continue showing these components unhealthy as soon as it starts . 

The issue occurs sometimes and not every time . but  it is reported by  prometheus metric expression  : 
sum (agent_component_controller_running_components{health_type!="healthy"}) > 0


What is expected: 
Grafana-agent in such cases can continue retrying  for  making connection  to API server on discovering timeout for  prometheus.operator.servicemonitors, prometheus.operator.podmonitors,prometheus.operator.probes components during initialization to retry to make its component healthy  so that situation gets resolved without any need of manual restart/deletion of pod in such cases. 



### Steps to reproduce

 Grafana agent helm chart version  0.31.0 and app version v0.39.0
Helm values are added below in configuration section . 

Environment:

Infrastructure: Kubernetes
Deployment tool: Helm

### System information

_No response_

### Software version

Grafana Agent v0.39.0

### Configuration

Helm values.yaml: 
nameOverride: grafana-agent
crds:
  create: false
image:
  tag: v0.39.0
service:
  enabled: true
controller:
  type: 'statefulset'
  replicas: 4
  autoscaling:
    enabled: true
    targetMemoryUtilizationPercentage: 50
    minReplicas: 4
    maxReplicas: 20
agent:
  resources:
    requests:
      cpu: "4"
      memory: "20Gi"
    limits:
      cpu: "4"
      memory: "20Gi"
  mode: 'flow'
  clustering:
    enabled: true
  configMap:
    content: |
      prometheus.remote_write "mimir" {
        endpoint {
          url = "https://mimir-url.abcxyz/api/v1/push"
          headers = {
             "X-Scope-OrgID" = "tenantid",
          }
        }
      }
      /*
      Service Monitors
      */
      prometheus.operator.servicemonitors "discover_servicemonitors" {
        forward_to = [prometheus.remote_write.mimir.receiver]
        selector {
          match_expression {
              key = "app.kubernetes.io/part-of"
              operator = "NotIn"
              values = ["prometheus-operator"]
          }
          match_expression {
              key = "app.kubernetes.io/instance"
              operator = "NotIn"
              values = ["prom-op"]
          }
        }
        clustering {
          enabled = true
        }
      }
      /*
      Pod Monitors
      */
      prometheus.operator.podmonitors "discover_podmonitors" {
       forward_to = [prometheus.remote_write.mimir.receiver]
        scrape {
          default_scrape_interval = "30s"
        }
        clustering {
          enabled = true
        }
      }
      /*
      Probes
      */
      prometheus.operator.probes "discover_probes" {
        forward_to = [prometheus.remote_write.mimir.receiver]
        scrape {
          default_scrape_interval = "30s"
        }
        clustering {
          enabled = true
        }
      }



### Logs

ts=2024-09-23T05:04:54.362676449Z level=info msg="now listening for http traffic" service=http addr=0.0.0.0:80
ts=2024-09-23T05:04:54.362152043Z level=info msg="Using pod service account via in-cluster config" component=prometheus.operator.servicemonitors.discover_servicemonitors
ts=2024-09-23T05:04:54.361663857Z level=info msg="scheduling loaded components and services"
ts=2024-09-23T05:04:54.362133503Z level=info msg="Using pod service account via in-cluster config" component=prometheus.operator.probes.discover_probes
ts=2024-09-23T05:04:54.362076197Z level=info msg="Using pod service account via in-cluster config" component=prometheus.operator.podmonitors.discover_podmonitors
ts=2024-09-23T05:04:54.361499105Z level=info msg="finished complete graph evaluation" controller_id="" trace_id=eaf937c20f85f3ce18dd408efb23c4ae duration=22.012674ms
ts=2024-09-23T05:04:54.361405421Z level=info msg="applying non-TLS config to HTTP server" service=http
ts=2024-09-23T05:05:24.363526504Z level=error msg="error running crd manager" component=prometheus.operator.podmonitors.discover_podmonitors err="could not create RESTMapper from config: Get "https://172.20.0.1:443/api": dial tcp 172.20.0.1:443: i/o timeout"
ts=2024-09-23T05:05:24.363546239Z level=info msg="scrape manager stopped" component=prometheus.operator.probes.discover_probes
ts=2024-09-23T05:05:24.363568264Z level=info msg="scrape manager stopped" component=prometheus.operator.podmonitors.discover_podmonitors
ts=2024-09-23T05:05:24.363491071Z level=error msg="error running crd manager" component=prometheus.operator.probes.discover_probes err="could not create RESTMapper from config: Get "https://172.20.0.1:443/api": dial tcp 172.20.0.1:443: i/o timeout"
ts=2024-09-23T05:05:24.363597696Z level=info msg="scrape manager stopped" component=prometheus.operator.servicemonitors.discover_servicemonitors
ts=2024-09-23T05:05:24.363558031Z level=error msg="error running crd manager" component=prometheus.operator.servicemonitors.discover_servicemonitors err="could not create RESTMapper from config: Get "https://172.20.0.1:443/api": dial tcp 172.20.0.1:443: i/o timeout"
ts=2024-09-23T05:05:44.36759843Z level=info msg="peers changed" new_peers=grafana-agent-5
ts=2024-09-23T05:05:44.367431093Z level=info msg="starting cluster node" peers="" advertise_addr=10.123.123.30:80


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grafana Agent components unhealthy because of k8s API server timeout during pod startup #7053

What's wrong?

Steps to reproduce

System information

Software version

Configuration

Logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Grafana Agent components unhealthy because of k8s API server timeout during pod startup #7053

Description

What's wrong?

Steps to reproduce

System information

Software version

Configuration

Logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions