Skip to content

Conversation

yanhaoluo666
Copy link

Issue

Enabling high frequency GPU metrics requires customer to make changes in 2 places: one is cw agent config, the other is dcgmexporter config. It's a bit complicated and hard for add-on customers to make these changes.

Description of changes:

Write a help function to check the presence of accelerated_compute_gpu_metrics_collection_interval, if it is present and value < 60, then add DCGM_EXPORTER_INTERVAL: 1000 to dcgmexporter env.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Testing

  1. use default helm-chart configs and deploy dcgmexporter, checked DCGM_EXPORTER_INTERVAL: 1000 was not in its env.
image
  1. add accelerated_compute_gpu_metrics_collection_interval: 1 to values.yaml and deploy pod, checked DCGM_EXPORTER_INTERVAL was in its env.
image

@yanhaoluo666 yanhaoluo666 force-pushed the feature/gpu-metrics-high-sampling branch from 360d772 to ee036c9 Compare October 14, 2025 17:58
{{- $intervalValue = $agentConfig.logs.metrics_collected.kubernetes.accelerated_compute_gpu_metrics_collection_interval -}}
{{- end -}}
{{- end -}}
{{- if and $intervalFound (lt $intervalValue 60) -}}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets await the results from internal testing, before we push this change to be 1 Sec, we want to ideally limit any impact to the underlying hardware.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, will also connect with agent folks to get their thoughts.

…etrics_collection_interval is present and less than 60
@yanhaoluo666 yanhaoluo666 force-pushed the feature/gpu-metrics-high-sampling branch from ee036c9 to 20d1be5 Compare October 17, 2025 11:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants