Skip to content

Use informers for pod labels and UIDs to fix client-go throttling#552

Open
Li357 wants to merge 3 commits intoNVIDIA:mainfrom
Li357:main
Open

Use informers for pod labels and UIDs to fix client-go throttling#552
Li357 wants to merge 3 commits intoNVIDIA:mainfrom
Li357:main

Conversation

@Li357
Copy link

@Li357 Li357 commented Sep 2, 2025

Fixes #551

The existing pod label code doesn't really do actual caching. It still hits the API server on every scrape (which could be as low as every 1s) which is stopped by client-go's client-side rate limiting. It makes sense to use an informer here because labels change relatively rarely and UIDs never.

@glowkey
Copy link
Collaborator

glowkey commented Sep 3, 2025

The following failure occurs when running 'make test-main':

--- FAIL: TestProcessPodMapper_WithLabels (0.04s)
kubernetes_test.go:566:
Error Trace: /opt/exporter/dcgm-exporter-dev/internal/pkg/transformation/kubernetes_test.go:566
Error: Not equal:
expected: 1
actual : 0
Test: TestProcessPodMapper_WithLabels
Messages: Expected 1 labels for pod gpu-pod-0, but got 0

@Li357
Copy link
Author

Li357 commented Sep 4, 2025

Tests should be passing now.

@glowkey seems like the mocked client set is injected after the informer is created:

podMapper := NewPodMapper(&appconfig.Config{
KubernetesEnablePodLabels: true,
KubernetesGPUIdType: appconfig.GPUUID,
PodResourcesKubeletSocket: socketPath,
})
// Inject the fake clientset
podMapper.Client = clientset

which wasn't a problem before because on subsequent scrape after NewPodMapper creates the instance, it would use the mocked clientset. I think it makes more sense to allow passing an explicit Client to the constructor config for testing purposes. Let me know if you want to refactor it, i.e. NewPodMapper takes in the client, etc.

Also fixed allocating the map in the constructor. Don't know how my manual tests were working without that...

@Li357
Copy link
Author

Li357 commented Sep 17, 2025

@glowkey any updates here? I'm maintaining my own fork for this right now

@glowkey
Copy link
Collaborator

glowkey commented Sep 19, 2025

Just to understand your use-case a little more, what is the reason for the query interval being set to 1000ms? How many GPUs are being queried and about how many pods? Thanks!

@Li357
Copy link
Author

Li357 commented Sep 29, 2025

We are using DCGM profile (DCGM_FI_PROF_...) metrics and we're exporting them every 1s which I understand is pretty frequent but DCGM supports down to every 100ms.

Even just with one pod or two pods with 4 or 8 GPUs each we see a ton of metrics being dropped because the Kubernetes client code rate limits. IMO regardless of whether scraping every 1s is excessive, it's pretty excessive for dcgm-exporter to on-demand query the apiserver per metric scrape when labels rarely change and UIDs never change.

@Li357
Copy link
Author

Li357 commented Oct 16, 2025

@glowkey Can I please get a review here?

@glowkey
Copy link
Collaborator

glowkey commented Oct 16, 2025

Very sorry for the delay. I plan on trying to reproduce the issue to validate this MR but I have not had time to do that yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

"Couldn't get pod metadata" and client-go throttling when using Kubernetes pod labels

2 participants