[dashboard] Rework dashboard (MIG support, Grafana deprecations, Hostname)#355
[dashboard] Rework dashboard (MIG support, Grafana deprecations, Hostname)#355frittentheke wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
…name) * Use PromQL aggregations to take MIG subdevices into account (see NVIDIA#353) * Update all panels to use Timeseries panels (instead of deprecated Graph) * Switch from instance to Hostname to select individual systems to avoid duplicated timeseries for Kubernetes daemonsets and their Pod names * Use DCGM_FI_DEV_FB_FREE instead of DCGM_FI_DEV_GPU_TEMP to also cover vGPU (~ PR NVIDIA#240) * Use DCGM_FI_PROF_GR_ENGINE_ACTIVE to determine utilization to cover MIG (and vGPU) Fixes: NVIDIA#353, NVIDIA#236 Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
26f0677 to
a52c9c0
Compare
|
Is there any news on this PR in terms of merging? |
|
^^ @rohit-arora-dev @glowkey ? |
|
I just took a moment to test these changes on an 8 GPU system with MIG enabled and unfortunately the panels were empty. I'm far from a Grafana expert so it's hard for me to know what was going wrong. I did confirm that without the changes the panels displayed the expected data. |
Thanks for looking at my PR @glowkey |
|
For the GPU Util i think the old way may be better, the prof module is proprietary and only supported on a small amount of gpus / configurations The other changes seem great though #380 it looks like most of the prom metrics are not really reliable enough to be used as the sole source of labels in all the different situations the exporter is used |
|
Thanks @frittentheke! I've tried it and definitely is a big improvement. I can see all MIG subdevices and changing to hostnames is much more intuitive. |
Running into various issues with the dashboard (see #353) I started reworking the existing board.
This PR combines all my cleanups and fixes. It also includes the changes of PR #240 by @Levi080513
duplicated timeseries for Kubernetes daemonsets and their Pod names
Fixes: #353, #236