Skip to content

feat: allow watch pid with pre-created group.#60

Open
rootfs wants to merge 1 commit intoNVIDIA:mainfrom
rootfs:main
Open

feat: allow watch pid with pre-created group.#60
rootfs wants to merge 1 commit intoNVIDIA:mainfrom
rootfs:main

Conversation

@rootfs
Copy link

@rootfs rootfs commented Feb 23, 2024

Fix #59

The existing WatchPidFieldsEx API always creates device group. If the MIG is enabled, the device group doesn't capture the MIG device and may fail to watch processes there.

The PR allows a pre-created device to pass into the WatchPidFieldsWithGroup API.

I am also adding a sample test file here, but for certain reason I am not able to run it successfully:

# go run samples/processInfo/watchPidWithGroup/main.go 
2024/02/23 16:24:11 Created device group "dg-20240223162411"
2024/02/23 16:24:11 found 2 supported devices
2024/02/23 16:24:11 Adding device 0 to group "dg-20240223162411"
2024/02/23 16:24:11 Adding device 1 to group "dg-20240223162411"
2024/02/23 16:24:11 Adding GPU ID[0] MIG Device[0] GPU Device Index[0] Instance ID[13]: err <nil>
2024/02/23 16:24:11 Adding GPU ID[0] MIG Device[1] GPU Device Index[0] Instance ID[5]: err <nil>
2024/02/23 16:24:11 Adding GPU ID[0] MIG Device[2] GPU Device Index[0] Instance ID[1]: err <nil>
2024/02/23 16:24:12 DCGM pid watcher set up successfully
2024/02/23 16:24:12 DCGM initialized successfully
2024/02/23 16:24:17 failed to get process info: Bad parameter passed to function
2024/02/23 16:24:22 failed to get process info: Bad parameter passed to function

In fact, the existing sample also failed with the same error:

# go run samples/processInfo/main.go 
2024/02/23 16:23:17 Enabling DCGM watches to start collecting process stats. This may take a few seconds....
2024/02/23 16:23:20 Bad parameter passed to function
panic: Bad parameter passed to function

Signed-off-by: Huamin Chen <hchen@redhat.com>
@rohit-arora-dev
Copy link
Collaborator

Fix #59

The existing WatchPidFieldsEx API always creates device group. If the MIG is enabled, the device group doesn't capture the MIG device and may fail to watch processes there.

The PR allows a pre-created device to pass into the WatchPidFieldsWithGroup API.

I am also adding a sample test file here, but for certain reason I am not able to run it successfully:

# go run samples/processInfo/watchPidWithGroup/main.go 
2024/02/23 16:24:11 Created device group "dg-20240223162411"
2024/02/23 16:24:11 found 2 supported devices
2024/02/23 16:24:11 Adding device 0 to group "dg-20240223162411"
2024/02/23 16:24:11 Adding device 1 to group "dg-20240223162411"
2024/02/23 16:24:11 Adding GPU ID[0] MIG Device[0] GPU Device Index[0] Instance ID[13]: err <nil>
2024/02/23 16:24:11 Adding GPU ID[0] MIG Device[1] GPU Device Index[0] Instance ID[5]: err <nil>
2024/02/23 16:24:11 Adding GPU ID[0] MIG Device[2] GPU Device Index[0] Instance ID[1]: err <nil>
2024/02/23 16:24:12 DCGM pid watcher set up successfully
2024/02/23 16:24:12 DCGM initialized successfully
2024/02/23 16:24:17 failed to get process info: Bad parameter passed to function
2024/02/23 16:24:22 failed to get process info: Bad parameter passed to function

In fact, the existing sample also failed with the same error:

# go run samples/processInfo/main.go 
2024/02/23 16:23:17 Enabling DCGM watches to start collecting process stats. This may take a few seconds....
2024/02/23 16:23:20 Bad parameter passed to function
panic: Bad parameter passed to function

@rootfs

Thank you for your contribution.

The reason sample test is failing because the go run argument for existing sample in samples/processInfo/main.go and new sample in samples/processInfo/watchPidWithGroup/main.go expects PID to be supplied e.g. go run main.go --pid 1234. Please add a similar note like this in your sample as well.

Please share output with a successful run on a MIG device and suggested changes.

Comment on lines +139 to +141
for _, gpu := range pidInfo {
log.Printf("gpu %d, process info: %+v\n", gpu.GPU, gpu)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider using template for output same as samples/processInfo/main.go for consistency.

// create a tick to watch the process every 5 seconds
ticker := time.NewTicker(5 * time.Second)
defer ticker.Stop()
for range ticker.C {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider running it once or twice but not every 5 seconds.

if err != nil {
return fmt.Errorf("failed to find supported devices: %v", err)
}
log.Printf("found %d supported devices\n", len(supportedDeviceIndices))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
s/found/Found

as it is not an error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

provide more flexible WatchPidFields API

2 participants