This suite of scripts is designed to perform stress testing and functional checks on GPUs within a containerized environment, typically managed by Slurm. It includes node-local hardware tests (GPU burn, NCCL bandwidth) and a multi-node distributed PyTorch training test.
Dockerfile: Defines the container image with necessary dependencies including PyTorch, CUDA, NCCL, and tools likegit,make,gccrequired by the test scripts.requirements.txt: Lists Python dependencies (e.g.,torch,torchvision).src/gpu_probe/runner.py: Python script that orchestrates node-local tests. It executesrun_nccl.shandrun_gpu_burn.sh.src/gpu_probe/train.py: A PyTorch script that performs a simple distributed training routine (CIFAR10 dataset with ResNet50 model) usingDistributedDataParallel (DDP)to verify multi-GPU/multi-node training functionality.run_nccl.sh: A shell script that clones the NVIDIAnccl-testsrepository, builds them, and runsall_reduce_perfto measure inter-GPU communication bandwidth on a node.run_gpu_burn.sh: A shell script that clones thegpu-burnutility, builds it, and runs it to stress-test GPUs for stability and thermal performance. The default duration is 30 seconds.entrypoint.sh: The main entrypoint for the Docker container, which executessrc/gpu_probe/runner.py.submit_gpu_probe.sbatch: An example Slurm batch script that demonstrates how to run the GPU probe suite. It includes steps for:- Running node-local probes (
runner.py) on the first allocated node. - Running the distributed training test (
train.py) across all allocated nodes and GPUs usingtorchrun.
- Running node-local probes (
- Docker: For building the container image.
- Slurm Workload Manager: For submitting and managing the job on a cluster.
- Pyxis/Enroot (or similar Slurm container integration): For running Docker images with Slurm (
srun --container-image=...). - NVIDIA GPUs: Accessible to the Slurm cluster nodes.
- Container Registry: (Optional, if not using local image files like
.sqsh) To host the built Docker image.
-
Navigate to the
gpu-probedirectory. -
Build the Docker image:
docker build -t your-registry/your-repo/gpu-probe:latest -f Dockerfile .Replace
your-registry/your-repo/gpu-probe:latestwith your desired image name and tag.
-
Push Image (if using a registry): If you're using a central container registry, push the built image:
docker push your-registry/your-repo/gpu-probe:latest
-
Update sbatch Script: Edit
submit_gpu_probe.sbatch:- Set the
DOCKER_IMAGEvariable to the correct path of your image in the registry (e.g.,DOCKER_IMAGE=your-registry/your-repo/gpu-probe:latest). - If you are using a locally converted Enroot image (
.sqshfile), updateDOCKER_IMAGEto the absolute path of the.sqshfile on the cluster nodes (e.g.,DOCKER_IMAGE=/path/to/shared/gpu-probe.sqsh). - Adjust Slurm parameters (
--nodes,--gres=gpu:X,--mem,--time,--output) as needed for your cluster and testing requirements. - Modify
TRAIN_ARGSif you need different training parameters or a different shared--data_pathfor the CIFAR10 dataset. Ensure the chosen--data_pathis on a shared filesystem accessible by all nodes with the same path.
- Set the
-
Submit the Slurm Job:
sbatch submit_gpu_probe.sbatch
The submit_gpu_probe.sbatch script orchestrates the following:
-
Node-Local Probe (on first node):
srunlaunches the container on the first allocated node.- The container's
entrypoint.shexecutespython -m gpu_probe.runner --test. runner.pythen executes:run_nccl.sh: Clones, builds, and runsnccl-tests(all_reduce_perfon 1 GPU by default). Output and parsed bandwidth are logged.run_gpu_burn.sh: Clones, builds, and runsgpu-burn.
- The exit code of
runner.py(PROBE_RC) indicates success (0) or failure (1) of these local tests.
-
Multi-Node Distributed Training:
srunlaunchestorchrunacross all allocated nodes and GPUs.torchrunexecutessrc/gpu_probe/train.pyfor a DDP training session.- The script downloads CIFAR10 to the specified
--data_path(master rank downloads, others wait) and trains a ResNet50 model for a few epochs/batches. - The exit code of this step (
TRAIN_RC) indicates success or failure.
-
Final Result:
- The sbatch script checks
PROBE_RCandTRAIN_RCto determine overall success or failure of the probe.
- The sbatch script checks
- Slurm Output: The main log file is specified by
--outputinsubmit_gpu_probe.sbatch(default:/root/gpu_probe_%j.log). This captures stdout/stderr from the sbatch script itself and thesruncommands. - Script Logs:
runner.pyandtrain.pyuse Python'sloggingmodule, which prints to stdout/stderr and will be captured in the Slurm output file. - NCCL Test Output:
run_nccl.shsaves the raw output ofall_reduce_perfto/tmp/nccl.txtinside the container and also prints this content to stdout.
- GPU Burn Duration: Modify the
gpu_burn 30command inrun_gpu_burn.sh(30 is seconds). - NCCL Test Parameters: Adjust the
all_reduce_perfarguments inrun_nccl.sh(e.g.,-b,-e,-g). The current bandwidth check inrun_nccl.shis commented out for diagnostic purposes but can be re-enabled and its threshold adjusted. - Training Parameters: Modify
TRAIN_ARGSinsubmit_gpu_probe.sbatchto change epochs, batch count, learning rate, or data path fortrain.py. - Container Base Image: The
Dockerfileusesnvcr.io/nvidia/pytorch:24.07-py3. This can be changed if a different PyTorch/CUDA version is needed. - Python Dependencies: Add to
requirements.txtand rebuild the Docker image.