Use informers for pod labels and UIDs to fix client-go throttling by Li357 · Pull Request #552 · NVIDIA/dcgm-exporter

Li357 · 2025-09-02T22:44:44Z

Fixes #551

The existing pod label code doesn't really do actual caching. It still hits the API server on every scrape (which could be as low as every 1s) which is stopped by client-go's client-side rate limiting. It makes sense to use an informer here because labels change relatively rarely and UIDs never.

…very scrape

glowkey · 2025-09-03T21:25:00Z

The following failure occurs when running 'make test-main':

--- FAIL: TestProcessPodMapper_WithLabels (0.04s)
kubernetes_test.go:566:
Error Trace: /opt/exporter/dcgm-exporter-dev/internal/pkg/transformation/kubernetes_test.go:566
Error: Not equal:
expected: 1
actual : 0
Test: TestProcessPodMapper_WithLabels
Messages: Expected 1 labels for pod gpu-pod-0, but got 0

Li357 · 2025-09-04T16:16:01Z

Tests should be passing now.

@glowkey seems like the mocked client set is injected after the informer is created:

dcgm-exporter/internal/pkg/transformation/kubernetes_test.go

Lines 501 to 508 in 4ecf9b6

    
           podMapper := NewPodMapper(&appconfig.Config{ 
        
           	KubernetesEnablePodLabels: true, 
        
           	KubernetesGPUIdType:       appconfig.GPUUID, 
        
           	PodResourcesKubeletSocket: socketPath, 
        
           }) 
        
           // Inject the fake clientset 
        
           podMapper.Client = clientset

which wasn't a problem before because on subsequent scrape after NewPodMapper creates the instance, it would use the mocked clientset. I think it makes more sense to allow passing an explicit Client to the constructor config for testing purposes. Let me know if you want to refactor it, i.e. NewPodMapper takes in the client, etc.

Also fixed allocating the map in the constructor. Don't know how my manual tests were working without that...

Li357 · 2025-09-17T15:46:46Z

@glowkey any updates here? I'm maintaining my own fork for this right now

glowkey · 2025-09-19T19:30:56Z

Just to understand your use-case a little more, what is the reason for the query interval being set to 1000ms? How many GPUs are being queried and about how many pods? Thanks!

Li357 · 2025-09-29T22:30:19Z

We are using DCGM profile (DCGM_FI_PROF_...) metrics and we're exporting them every 1s which I understand is pretty frequent but DCGM supports down to every 100ms.

Even just with one pod or two pods with 4 or 8 GPUs each we see a ton of metrics being dropped because the Kubernetes client code rate limits. IMO regardless of whether scraping every 1s is excessive, it's pretty excessive for dcgm-exporter to on-demand query the apiserver per metric scrape when labels rarely change and UIDs never change.

Li357 · 2025-10-16T15:23:40Z

@glowkey Can I please get a review here?

glowkey · 2025-10-16T15:28:13Z

Very sorry for the delay. I plan on trying to reproduce the issue to validate this MR but I have not had time to do that yet.

Li357 force-pushed the main branch from c927d37 to 3a028e9 Compare September 2, 2025 22:47

Use informers for pod labels and UIDs instead of querying apiserver e…

f9c9da1

…very scrape

Li357 force-pushed the main branch from 3a028e9 to f9c9da1 Compare September 2, 2025 22:48

Use injected clientset correctly in tests and fix nil map error

135d742

Merge branch 'main' into main

eed965c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use informers for pod labels and UIDs to fix client-go throttling#552

Use informers for pod labels and UIDs to fix client-go throttling#552
Li357 wants to merge 3 commits intoNVIDIA:mainfrom
Li357:main

Li357 commented Sep 2, 2025

Uh oh!

glowkey commented Sep 3, 2025 •

edited

Loading

Uh oh!

Li357 commented Sep 4, 2025 •

edited

Loading

Uh oh!

Li357 commented Sep 17, 2025

Uh oh!

glowkey commented Sep 19, 2025

Uh oh!

Li357 commented Sep 29, 2025 •

edited

Loading

Uh oh!

Li357 commented Oct 16, 2025

Uh oh!

glowkey commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Li357 commented Sep 2, 2025

Uh oh!

glowkey commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Li357 commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Li357 commented Sep 17, 2025

Uh oh!

glowkey commented Sep 19, 2025

Uh oh!

Li357 commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Li357 commented Oct 16, 2025

Uh oh!

glowkey commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

glowkey commented Sep 3, 2025 •

edited

Loading

Li357 commented Sep 4, 2025 •

edited

Loading

Li357 commented Sep 29, 2025 •

edited

Loading