Skip to content

[dashboard] Rework dashboard (MIG support, Grafana deprecations, Hostname)#355

Open
frittentheke wants to merge 1 commit intoNVIDIA:mainfrom
frittentheke:dashboardRework_353
Open

[dashboard] Rework dashboard (MIG support, Grafana deprecations, Hostname)#355
frittentheke wants to merge 1 commit intoNVIDIA:mainfrom
frittentheke:dashboardRework_353

Conversation

@frittentheke
Copy link

@frittentheke frittentheke commented Jul 8, 2024

Running into various issues with the dashboard (see #353) I started reworking the existing board.
This PR combines all my cleanups and fixes. It also includes the changes of PR #240 by @Levi080513

Fixes: #353, #236

…name)

* Use PromQL aggregations to take MIG subdevices into account (see NVIDIA#353)
* Update all panels to use Timeseries panels (instead of deprecated Graph)
* Switch from instance to Hostname to select individual systems to avoid
  duplicated timeseries for Kubernetes daemonsets and their Pod names
* Use DCGM_FI_DEV_FB_FREE instead of DCGM_FI_DEV_GPU_TEMP to also cover vGPU (~ PR NVIDIA#240)
* Use DCGM_FI_PROF_GR_ENGINE_ACTIVE to determine utilization to cover MIG (and vGPU)

Fixes: NVIDIA#353, NVIDIA#236

Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de>
@SohamG
Copy link

SohamG commented Feb 20, 2025

Is there any news on this PR in terms of merging?

@frittentheke
Copy link
Author

^^ @rohit-arora-dev @glowkey ?

@glowkey
Copy link
Collaborator

glowkey commented Mar 5, 2025

I just took a moment to test these changes on an 8 GPU system with MIG enabled and unfortunately the panels were empty. I'm far from a Grafana expert so it's hard for me to know what was going wrong. I did confirm that without the changes the panels displayed the expected data.

@frittentheke
Copy link
Author

frittentheke commented Mar 6, 2025

I just took a moment to test these changes on an 8 GPU system with MIG enabled and unfortunately the panels were empty. I'm far from a Grafana expert so it's hard for me to know what was going wrong. I did confirm that without the changes the panels displayed the expected data.

Thanks for looking at my PR @glowkey
Some more details on which graph and with which PromQL query doesn't work would be great.

@kristiangronas
Copy link

For the GPU Util i think the old way may be better, the prof module is proprietary and only supported on a small amount of gpus / configurations

The other changes seem great though

#380 it looks like most of the prom metrics are not really reliable enough to be used as the sole source of labels in all the different situations the exporter is used

@ermitovski
Copy link

Thanks @frittentheke! I've tried it and definitely is a big improvement. I can see all MIG subdevices and changing to hostnames is much more intuitive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Duplicated, missing or wrong metrics if using MIG, Grafana dashboard showing wrong duplicated / false values

5 participants