-
Notifications
You must be signed in to change notification settings - Fork 760
Open
Description
--packed flag causes cudaErrorIllegalAddress in NCCL watchdog during multi-GPU distributed training via mp.spawn
Environment
- gsplat (a8d88d3)
- Cuda 13.0
- PyTorch 2.x with NCCL 2.27.7
- 4x NVIDIA GPUs (AWS
g5.12xlarge4xA10G) - Python 3.12
- Launched via
python simple_trainer.py default --packed ... --world_size 4(matchingbasic_4gpus.sh)
Description
Similar issue: #845
When running simple_trainer.py with --packed and world_size > 1, training crashes at step 0 with a cudaErrorIllegalAddress in the NCCL process group watchdog thread across all ranks simultaneously.
Without --packed, multi-GPU training runs correctly with the same configuration.
Steps to Reproduce
python simple_trainer.py default \
--packed \
--world_size 4 \
--data-dir /path/to/dataset \
--result-dir /path/to/output \
--disable_viewer \
--max_steps 15000Error
[rank0]:[E] ProcessGroupNCCL.cpp:2057] [PG ID 0 PG GUID 0(default_pg) Rank 0]
Process group watchdog thread terminated with exception:
CUDA error: an illegal memory access was encountered
Exception raised from c10_cuda_check_implementation at
/pytorch/c10/cuda/CUDAException.cpp:44
terminate called after throwing an instance of 'c10::DistBackendError'
what(): CUDA error: an illegal memory access was encountered
All 4 ranks (0-3) crash simultaneously at the first training step with the same error.
Expected Behavior
Multi-GPU training with --packed should work as demonstrated in examples/benchmarks/basic_4gpus.sh.
Additional Context
- Removing
--packedresolves the crash — training completes successfully across all 4 ranks, but the speed-up is negligible - The crash occurs at step 0 before any meaningful training, suggesting the issue is in the initial NCCL
all_reduceover the packed sparse tensors produced by the first rasterization pass - Hypothesis: packed mode produces variable-length sparse tensors per rank whose memory layout is incompatible with how NCCL accesses GPU memory during
all_reduce, triggering an illegal memory access in the NCCL watchdog
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels