Skip to content

always return 0, when get GPU of process info by dcgm.GetProcessInfo(XXX) #64

@berkaroad

Description

@berkaroad

Run benchmarks with 2 gpus, and compare with ./processInfo -pid 203639 and nvidia-smi.

'GPU ID' from ./processInfo -pid 203639 is GPU-0, GPU-0. But in nvidia-smi is GPU-0, GPU-1.

python3 ./benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
        --forward_only \
        --batch_size=16 \
        --model=resnet50  \
        --num_gpus=2 \
        --num_batches=500000 \
        --num_warmup_batches=10 \
        --data_name=imagenet \
        --allow_growth=True
root@k8s-node1:~/go-dcgm/samples/processInfo# ./processInfo -pid 203639
2024/04/07 11:51:51 Enabling DCGM watches to start collecting process stats. This may take a few seconds....
----------------------------------------------------------------------
GPU ID			     : 0
----------Execution Stats---------------------------------------------
PID                          : 203639
Name                         : tf_cnn_benchmar
Start Time                   : 2024-04-03 20:29:37 +0800 CST
End Time                     : Running
----------Performance Stats-------------------------------------------
Energy Consumed (Joules)     : 0
Max GPU Memory Used (bytes)  : 5453643776
Avg SM Clock (MHz)           : 1590
Avg Memory Clock (MHz)       : 5000
Avg SM Utilization (%)       : 21
Avg Memory Utilization (%)   : 16
Avg PCIe Rx Bandwidth (MB)   : 9223372036854775792
Avg PCIe Tx Bandwidth (MB)   : 9223372036854775792
----------Event Stats-------------------------------------------------
Single Bit ECC Errors        : N/A
Double Bit ECC Errors        : N/A
Critical XID Errors          : 0
----------Slowdown Stats----------------------------------------------
Due to - Power (%)           : 0
       - Thermal (%)         : 0
       - Reliability (%)     : 9223372036854775792
       - Board Limit (%)     : 9223372036854775792
       - Low Utilization (%) : 9223372036854775792
       - Sync Boost (%)      : 0
----------Process Utilization-----------------------------------------
Avg SM Utilization (%)       : 48
Avg Memory Utilization (%)   : 38
----------------------------------------------------------------------
----------------------------------------------------------------------
GPU ID			     : 0
----------Execution Stats---------------------------------------------
PID                          : 203639
Name                         : tf_cnn_benchmar
Start Time                   : 2024-04-03 20:29:37 +0800 CST
End Time                     : Running
----------Performance Stats-------------------------------------------
Energy Consumed (Joules)     : 0
Max GPU Memory Used (bytes)  : 227540992
Avg SM Clock (MHz)           : 585
Avg Memory Clock (MHz)       : 5000
Avg SM Utilization (%)       : N/A
Avg Memory Utilization (%)   : N/A
Avg PCIe Rx Bandwidth (MB)   : 9223372036854775792
Avg PCIe Tx Bandwidth (MB)   : 9223372036854775792
----------Event Stats-------------------------------------------------
Single Bit ECC Errors        : N/A
Double Bit ECC Errors        : N/A
Critical XID Errors          : 0
----------Slowdown Stats----------------------------------------------
Due to - Power (%)           : 0
       - Thermal (%)         : 0
       - Reliability (%)     : 9223372036854775792
       - Board Limit (%)     : 9223372036854775792
       - Low Utilization (%) : 9223372036854775792
       - Sync Boost (%)      : 0
----------Process Utilization-----------------------------------------
Avg SM Utilization (%)       : 0
Avg Memory Utilization (%)   : 0
----------------------------------------------------------------------
root@k8s-node1:~/go-dcgm/samples/processInfo# nvidia-smi 
Sun Apr  7 11:52:05 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:07.0 Off |                    0 |
| N/A   64C    P0    71W /  70W |   5204MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            Off  | 00000000:00:08.0 Off |                    0 |
| N/A   43C    P0    27W /  70W |    220MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    203639      C   python3                          5201MiB |
|    1   N/A  N/A    203639      C   python3                           217MiB |
+-----------------------------------------------------------------------------+

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions