Use informers for pod labels and UIDs to fix client-go throttling#552
Use informers for pod labels and UIDs to fix client-go throttling#552Li357 wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
|
The following failure occurs when running 'make test-main': --- FAIL: TestProcessPodMapper_WithLabels (0.04s) |
|
Tests should be passing now. @glowkey seems like the mocked client set is injected after the informer is created: dcgm-exporter/internal/pkg/transformation/kubernetes_test.go Lines 501 to 508 in 4ecf9b6 which wasn't a problem before because on subsequent scrape after NewPodMapper creates the instance, it would use the mocked clientset. I think it makes more sense to allow passing an explicit Also fixed allocating the map in the constructor. Don't know how my manual tests were working without that... |
|
@glowkey any updates here? I'm maintaining my own fork for this right now |
|
Just to understand your use-case a little more, what is the reason for the query interval being set to 1000ms? How many GPUs are being queried and about how many pods? Thanks! |
|
We are using DCGM profile (DCGM_FI_PROF_...) metrics and we're exporting them every 1s which I understand is pretty frequent but DCGM supports down to every 100ms. Even just with one pod or two pods with 4 or 8 GPUs each we see a ton of metrics being dropped because the Kubernetes client code rate limits. IMO regardless of whether scraping every 1s is excessive, it's pretty excessive for dcgm-exporter to on-demand query the apiserver per metric scrape when labels rarely change and UIDs never change. |
|
@glowkey Can I please get a review here? |
|
Very sorry for the delay. I plan on trying to reproduce the issue to validate this MR but I have not had time to do that yet. |
Fixes #551
The existing pod label code doesn't really do actual caching. It still hits the API server on every scrape (which could be as low as every 1s) which is stopped by client-go's client-side rate limiting. It makes sense to use an informer here because labels change relatively rarely and UIDs never.