feat: Add mock servers for DGX H100, H200, and B200 systems by fabiendupont · Pull Request #163 · NVIDIA/go-nvml

fabiendupont · 2025-09-09T14:40:05Z

This commit adds comprehensive mock implementations for three additional NVIDIA GPU architectures to support multi-architecture testing:

dgxh100: 8-GPU DGX H100 system with Hopper architecture
dgxh200: 8-GPU DGX H200 system with Hopper architecture
dgxb200: 8-GPU DGX B200 system with Blackwell architecture

Each mock server provides:

Realistic device properties (memory, compute capability, architecture)
Complete MIG profile support with architecture-specific configurations
Full NVML interface implementation following existing dgxa100 patterns
Thread-safe operations with proper locking
Comprehensive MIG placement and profile information

Key Features:

H100/H200: 141GB memory, Hopper architecture, enhanced MIG profiles
B200: 192GB memory, Blackwell architecture, current-gen capabilities
Consistent API with existing dgxa100 mock server
Proper error handling and return codes
Realistic PCI bus IDs and device UUIDs

This enables comprehensive testing across multiple GPU architectures without requiring physical hardware.

Copilot

Pull Request Overview

This PR adds comprehensive mock implementations for three additional NVIDIA GPU architectures (H100, H200, and B200) to support multi-architecture testing without requiring physical hardware.

Implements complete mock servers for DGX H100, H200, and B200 systems with 8-GPU configurations
Provides architecture-specific MIG profiles with realistic memory allocations and compute capabilities
Adds fabric management integration and enhanced topology features for newer GPU architectures

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
pkg/nvml/mock/dgxh100/dgxh100.go	Main mock server implementation for H100 with Hopper architecture (80GB memory, 9.0 compute capability)
pkg/nvml/mock/dgxh100/mig-profile.go	H100-specific MIG profiles with enhanced memory allocations compared to A100
pkg/nvml/mock/dgxh200/dgxh200.go	H200 mock server with enhanced memory (141GB) and improved fabric capabilities
pkg/nvml/mock/dgxh200/mig-profile.go	H200 MIG profiles with 76% more memory than H100 profiles
pkg/nvml/mock/dgxb200/dgxb200.go	B200 mock server with Blackwell architecture (192GB memory, 10.0 compute capability)
pkg/nvml/mock/dgxb200/mig-profile.go	B200 MIG profiles with massive memory allocations for next-generation AI workloads

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

ArangoGutierrez

First pass looks good, I'll run some manual tests and provide feedback

pkg/nvml/mock/dgxb200/dgxb200.go

pkg/nvml/mock/dgxb200/mig-profile.go

pkg/nvml/mock/dgxh100/dgxh100.go

pkg/nvml/mock/dgxh100/mig-profile.go

pkg/nvml/mock/dgxh200/dgxh200.go

pkg/nvml/mock/dgxh200/mig-profile.go

elezar · 2025-09-10T06:58:26Z

pkg/nvml/mock/dgxb200/dgxb200.go

+type Server struct {
+	mock.Interface
+	mock.ExtendedInterface
+	Devices           [8]nvml.Device


Do the GB200 nodes not only have 4 devices? Or are these each addressable as 2?

GPU 8 x NVIDIA B200 GPUs that provide 1,440 GB total GPU memory as read in https://docs.nvidia.com/dgx/dgxb200-user-guide/dgxb200-user-guide.pdf

I think the difference here is that this is a B200 and not a GB200 -- the latter has 4.

Ah yeah, but here (this PR) is a B200

elezar · 2025-09-10T07:25:50Z

pkg/nvml/mock/dgxb200/dgxb200.go

+	return device
+}
+
+func (d *Device) GetUUID() (string, nvml.Return) {


Why do we not set the mock functions as we do for the a100?

Agree @fabiendupont let's do as https://github.com/NVIDIA/go-nvml/blob/main/pkg/nvml/mock/dgxa100/dgxa100.go#L195 and https://github.com/NVIDIA/go-nvml/blob/main/pkg/nvml/mock/dgxa100/dgxa100.go#L136 for the 3 B200,H100 and H200

elezar · 2025-09-10T07:28:04Z

pkg/nvml/mock/dgxb200/dgxb200.go

+	CudaDriverVersion int
+}
+
+type Device struct {


Question: Why do we need a new type. Are the properties not expected to be the same across different devices?

I think @fabiendupont intend was to create a folder for each GPU type (given that current A100 implementation is a folder named a100) If we were to add more GPU types, then we should rename that folder to a generic name, and that way we don't need to duplicate this much code, simply a file per GPU type and MIG profile. reusing a core set of defined variables

Establish baseline tests for the original A100 implementation before refactoring to shared architecture. This enables proper TDD approach where we can detect regressions during refactoring. Tests cover: - Server creation and basic properties - Device handling and indexing (8 devices) - Device properties (name, architecture, memory, etc.) - Device access by UUID and PCI bus ID - MIG mode operations - MIG profile configurations and access - GPU instance lifecycle and placements - Compute instance lifecycle - Init/shutdown behavior - Multiple device uniqueness - A100-specific characteristics All tests pass with the existing implementation at fd3e42f. Signed-off-by: Fabien Dupont <fdupont@redhat.com>

Copilot

Pull Request Overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-17T12:19:32Z

pkg/nvml/mock/dgxh200/dgxh200.go

+		MigMode:              nvml.DEVICE_MIG_ENABLE,
+		GpuInstances:         make(map[*GpuInstance]struct{}),
+		GpuInstanceCounter:   0,
+		MemoryInfo:           nvml.Memory{Total: 151397597184}, // 141GB for H200 (vs 80GB for H100)


The memory calculation is incorrect. 151397597184 bytes equals approximately 141GB, but this should be calculated as 141 * 1024^3 = 151473709056 bytes for exact 141GB.

Suggested change

MemoryInfo: nvml.Memory{Total: 151397597184}, // 141GB for H200 (vs 80GB for H100)

MemoryInfo: nvml.Memory{Total: 141 * 1024 * 1024 * 1024}, // 141GB for H200 (vs 80GB for H100)

Copilot · 2025-09-17T12:19:33Z

pkg/nvml/mock/dgxb200/dgxb200.go

+		MigMode:              nvml.DEVICE_MIG_ENABLE,
+		GpuInstances:         make(map[*GpuInstance]struct{}),
+		GpuInstanceCounter:   0,
+		MemoryInfo:           nvml.Memory{Total: 206158430208}, // 192GB for B200 (massive memory)


The memory calculation is incorrect. 206158430208 bytes equals approximately 192GB, but this should be calculated as 192 * 1024^3 = 206158430208 bytes. While the value is correct, it should be calculated consistently.

Suggested change

MemoryInfo: nvml.Memory{Total: 206158430208}, // 192GB for B200 (massive memory)

MemoryInfo: nvml.Memory{Total: 192 * 1024 * 1024 * 1024}, // 192GB for B200 (massive memory)

Copilot · 2025-09-17T12:19:33Z

pkg/nvml/mock/dgxh100/dgxh100.go

+		MigMode:              nvml.DEVICE_MIG_ENABLE,
+		GpuInstances:         make(map[*GpuInstance]struct{}),
+		GpuInstanceCounter:   0,
+		MemoryInfo:           nvml.Memory{Total: 85899345920}, // 80GB for H100


The memory calculation is incorrect. 85899345920 bytes equals approximately 80GB, but this should be calculated as 80 * 1024^3 = 85899345920 bytes for exact 80GB.

Suggested change

MemoryInfo: nvml.Memory{Total: 85899345920}, // 80GB for H100

MemoryInfo: nvml.Memory{Total: 80 * 1024 * 1024 * 1024}, // 80GB for H100

- Implement shared factory system in pkg/nvml/mock/shared/ to eliminate code duplication - Add comprehensive A30 GPU configurations with MIG profiles (56 SMs, 1/2/4-slice support) - Refactor dgxa100 to use shared factory while maintaining backward compatibility - Create modular GPU configurations in shared/gpus/ for A100 and A30 families - Add comprehensive documentation covering architecture and usage examples - Maintain thread safety and proper NVML return codes - Support all A100 variants (SXM4 40GB/80GB, PCIe 40GB/80GB) and A30 PCIe 24GB Signed-off-by: Fabien Dupont <fdupont@redhat.com>

Implements DGX H100 and H200 GPU mocks following the established shared factory pattern for consistency with existing A100/A30 implementations. - Add H100 SXM5 80GB configuration with complete MIG profile support - Add H200 SXM5 141GB configuration with complete MIG profile support - Implement dgxh100 and dgxh200 packages using shared factory pattern - Include all 7 MIG profiles (standard, REV1 media, REV2 double memory) - Maintain full backward compatibility with legacy globals and type aliases - Use NVIDIA-spec compliant memory allocations and SM distributions Signed-off-by: Fabien Dupont <fdupont@redhat.com>

Implements DGX B200 mock following the established shared factory pattern: - Add B200 SXM5 180GB GPU configuration with Blackwell architecture - Comprehensive MIG profiles matching NVIDIA specifications: * Memory allocations: 23GB, 45GB, 90GB, 180GB per NVIDIA MIG User Guide * REV1 (media extensions) and REV2 (expanded memory) profiles * Full P2P support in MIG mode (IsP2pSupported: 1) * 144 SMs total with 18 SMs per slice - Complete DGX B200 server implementation with 8 GPUs - Driver version 560.28.03, NVML 12.560.28.03, CUDA 12060 - Comprehensive test suite covering server, device, and MIG operations - Backward compatible legacy global variables (MIGProfiles, MIGPlacements) Memory values corrected from initial 192GB to 180GB based on official NVIDIA MIG User Guide specifications. Signed-off-by: Fabien Dupont <fdupont@redhat.com>

fabiendupont · 2025-09-29T13:04:27Z

Added mock tests, to have basic regression testsing.
Refactored the mocks by decoupling GPU/Server logic from configuration.
Added A30 as an example on Ampere generation for the refactor.
Added H100, H200 and B200, alongside with documentation of the refactor.

Copilot

Pull Request Overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-07T09:51:20Z

pkg/nvml/mock/internal/shared/gpus/a30.go

+	},
+	nvml.GPU_INSTANCE_PROFILE_1_SLICE_REV1: {
+		nvml.COMPUTE_INSTANCE_PROFILE_1_SLICE: {
+			Id:                  nvml.COMPUTE_INSTANCE_PROFILE_1_SLICE_REV1,


The compute instance profile ID should be nvml.COMPUTE_INSTANCE_PROFILE_1_SLICE not nvml.COMPUTE_INSTANCE_PROFILE_1_SLICE_REV1. The REV1 suffix applies to GPU instance profiles, not compute instance profiles.

Suggested change

Id: nvml.COMPUTE_INSTANCE_PROFILE_1_SLICE_REV1,

Id: nvml.COMPUTE_INSTANCE_PROFILE_1_SLICE,

ArangoGutierrez

PR looks now now
@elezar - @klueska Ready for your approval

pkg/nvml/mock/README.md

pkg/nvml/mock/dgxa100/dgxa100.go

…ntation Updates the mock framework documentation to include all GPU generations: - Add H100, H200, and B200 to architecture diagram and file structure - Document all GPU specifications: * H100 SXM5 80GB (Hopper, 132 SMs, CUDA 9.0, P2P MIG support) * H200 SXM5 141GB (Hopper, 132 SMs, CUDA 9.0, P2P MIG support) * B200 SXM5 180GB (Blackwell, 144 SMs, CUDA 10.0, P2P MIG support) - Add complete server model documentation with driver versions - Expand MIG support section with detailed SM allocations and P2P capabilities - Update usage examples to include all four GPU generations - Add comprehensive testing instructions for all mock implementations - Update backward compatibility section to reflect all generations The documentation now accurately reflects the complete shared factory implementation with comprehensive coverage of Ampere, Hopper, and Blackwell GPU architectures. Signed-off-by: Fabien Dupont <fdupont@redhat.com>

elezar

Some quick thoughts on this. I will make a more critical pass later this week.

elezar · 2025-10-20T11:17:59Z

pkg/nvml/mock/dgxa100/dgxa100_test.go

+	require.Implements(t, (*nvml.Interface)(nil), server)
+	require.Implements(t, (*nvml.ExtendedInterface)(nil), server)


We can test these at compile time with something like:

var _ nvml.Interface = (*server)(nil) var _ nvml.ExtendedInterface = (*server)(nil)

Added tests

pkg/nvml/mock/dgxa100/dgxa100_test.go

pkg/nvml/mock/README.md

elezar · 2025-10-20T11:23:19Z

pkg/nvml/mock/README.md

+
+Add new configurations to the appropriate file in `shared/gpus/`:
+```go
+var A100_PCIE_24GB = shared.Config{


nit: Do we want a more go-like name?

elezar · 2025-10-20T11:24:26Z

pkg/nvml/mock/README.md

+
+```
+pkg/nvml/mock/
+├── shared/


Should shared be internal? Do we expect that users interact with this package?

Moved it to internal. Users still have an external API through package-specific wrappers.

pkg/nvml/mock/shared/shared.go

elezar · 2025-10-20T11:34:57Z

pkg/nvml/mock/internal/shared/shared.go

+		PciDeviceId:        config.PciDeviceId,
+		MIGProfiles:        config.MIGProfiles,
+	}
+	device.SetMockFuncs()


Note: Since a device is only functional after setting these mocks, should we be exposing it as a public concrete type?

pkg/nvml/mock/shared/shared.go

elezar · 2025-10-20T11:36:44Z

pkg/nvml/mock/dgxh100/dgxh100.go

+type Server = shared.Server
+type Device = shared.Device
+type GpuInstance = shared.GpuInstance
+type ComputeInstance = shared.ComputeInstance
+type CudaComputeCapability = shared.CudaComputeCapability


Since this is a new package, we probably don't need backward compatible types.

elezar · 2025-10-20T11:37:36Z

pkg/nvml/mock/dgxh100/dgxh100.go

+func New() *Server {
+	return shared.NewServerFromConfig(shared.ServerConfig{
+		Config:            gpus.H100_SXM5_80GB,
+		GPUCount:          8,


Question: Instead of specifying the count, why not pass in a list of devices? This will also make it simpler to handle heterogenous configs.

Great point. Modified to passing a list and added test for heterogeneous GPUs.

@elezar

This commit addresses feedback from PR review #3356107365 by @elezar, implementing architectural improvements and code quality enhancements. - Added compile-time interface assertions to all test files - `var _ nvml.Interface = (*shared.Server)(nil)` pattern - Catches interface compliance issues at compile time vs runtime - Added missing trailing newline to dgxa100_test.go - Removed unnecessary type aliases from new packages (dgxh100, dgxh200, dgxb200) - New packages now use shared.Server directly instead of local aliases - Moved pkg/nvml/mock/shared → pkg/nvml/mock/internal/shared - Hides implementation details from external users - Public API remains unchanged through package-specific wrappers - Reduces field duplication (Name, Brand, Architecture, etc.) - Clearer relationship between configuration and device - Fields now accessed via d.Config.Name, d.Config.Brand, etc. - Removed 5+ duplicate fields from Device struct - Changed `Devices [8]nvml.Device` → `Devices []nvml.Device` - Removes hardcoded 8-device limit - Enables flexible GPU counts and heterogeneous configurations - `const GBtoMB = 1024` for clear memory calculations - Improves readability: `24 * GBtoMB` vs `24576` - Implemented `NewServerWithGPUs(...Config)` with variadic parameters - Supports mixed GPU types in single server - Example: `NewServerWithGPUs(gpus.A100_SXM4_40GB, gpus.A100_SXM4_80GB)` - Added comprehensive test: TestHeterogeneousGPUs - Added deprecation comments to backward-compatible type aliases - Deprecated single-GPU functions in favor of flexible alternatives - Maintains backward compatibility while guiding users to better APIs - All existing tests pass (dgxa100, dgxh100, dgxh200, dgxb200) - Added TestHeterogeneousGPUs demonstrating mixed GPU configurations - No breaking changes to public API Signed-off-by: Fabien Dupont <fdupont@redhat.com>

elezar

Sorry for the delay here. I am revisiting this due to my wanting to add a specific mock for something that I'm working on.

Looking at the orginisation, I think the SPECIFIC GPU implementations are useful -- as is the scaffolding to create a server with a specific list of GPUs or a COUNG of a single GPU type.

In my case, I want to mock a Tegra-based system which includes an iGPU and a dGPU. Here I would imaging instantiating this as:

NewServer(WithGPUs(gpus.IGX_THOR, gpus.RTX_PRO_6000))

Let's assume that I don't care about the driver version and / or the CUDA version in this case.

WIth this in mind, I have some proposed changes under https://github.com/elezar/go-nvml/tree/feat/add-dgxh100-dgxh200-dgxb200-mocks

I also have a branch: https://github.com/elezar/go-nvml/tree/add-tegra-mocks that adds a tegra (Thor) device.

ArangoGutierrez self-assigned this Sep 9, 2025

ArangoGutierrez requested a review from Copilot September 9, 2025 14:50

Copilot AI reviewed Sep 9, 2025

View reviewed changes

ArangoGutierrez requested a review from guptaNswati September 9, 2025 14:56

fabiendupont force-pushed the feat/add-dgxh100-dgxh200-dgxb200-mocks branch from e892f1c to 9252b27 Compare September 9, 2025 15:10

ArangoGutierrez reviewed Sep 9, 2025

View reviewed changes

fabiendupont force-pushed the feat/add-dgxh100-dgxh200-dgxb200-mocks branch from 9252b27 to 697c1d4 Compare September 9, 2025 20:37

elezar reviewed Sep 10, 2025

View reviewed changes

ArangoGutierrez requested a review from Copilot September 17, 2025 12:18

Copilot AI reviewed Sep 17, 2025

View reviewed changes

fabiendupont added 3 commits September 24, 2025 08:15

fabiendupont force-pushed the feat/add-dgxh100-dgxh200-dgxb200-mocks branch from 697c1d4 to 51f3c75 Compare September 24, 2025 15:12

ArangoGutierrez requested a review from Copilot October 7, 2025 09:50

Copilot AI reviewed Oct 7, 2025

View reviewed changes

ArangoGutierrez requested changes Oct 7, 2025

View reviewed changes

ArangoGutierrez mentioned this pull request Oct 8, 2025

Add GPU Mock Infrastructure NVIDIA/k8s-test-infra#180

Merged

fabiendupont force-pushed the feat/add-dgxh100-dgxh200-dgxb200-mocks branch from 51f3c75 to 30e5dae Compare October 13, 2025 14:24

ArangoGutierrez requested review from elezar and klueska October 16, 2025 13:29

elezar reviewed Oct 20, 2025

View reviewed changes

elezar reviewed Jan 22, 2026

View reviewed changes

	MemoryInfo: nvml.Memory{Total: 151397597184}, // 141GB for H200 (vs 80GB for H100)
	MemoryInfo: nvml.Memory{Total: 141 * 1024 * 1024 * 1024}, // 141GB for H200 (vs 80GB for H100)

	MemoryInfo: nvml.Memory{Total: 206158430208}, // 192GB for B200 (massive memory)
	MemoryInfo: nvml.Memory{Total: 192 * 1024 * 1024 * 1024}, // 192GB for B200 (massive memory)

	MemoryInfo: nvml.Memory{Total: 85899345920}, // 80GB for H100
	MemoryInfo: nvml.Memory{Total: 80 * 1024 * 1024 * 1024}, // 80GB for H100

	Id: nvml.COMPUTE_INSTANCE_PROFILE_1_SLICE_REV1,
	Id: nvml.COMPUTE_INSTANCE_PROFILE_1_SLICE,

		require.Implements(t, (*nvml.Interface)(nil), server)
		require.Implements(t, (*nvml.ExtendedInterface)(nil), server)

Conversation

fabiendupont commented Sep 9, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

ArangoGutierrez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

fabiendupont commented Sep 29, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

ArangoGutierrez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

elezar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

fabiendupont Oct 24, 2025 •

edited

Loading