Skip to content

--packed flag causes cudaErrorIllegalAddress in NCCL watchdog during multi-GPU distributed training via mp.spawn #910

@Eecornwell

Description

@Eecornwell

--packed flag causes cudaErrorIllegalAddress in NCCL watchdog during multi-GPU distributed training via mp.spawn

Environment

  • gsplat (a8d88d3)
  • Cuda 13.0
  • PyTorch 2.x with NCCL 2.27.7
  • 4x NVIDIA GPUs (AWS g5.12xlarge 4xA10G)
  • Python 3.12
  • Launched via python simple_trainer.py default --packed ... --world_size 4 (matching basic_4gpus.sh)

Description

Similar issue: #845
When running simple_trainer.py with --packed and world_size > 1, training crashes at step 0 with a cudaErrorIllegalAddress in the NCCL process group watchdog thread across all ranks simultaneously.

Without --packed, multi-GPU training runs correctly with the same configuration.

Steps to Reproduce

python simple_trainer.py default \
    --packed \
    --world_size 4 \
    --data-dir /path/to/dataset \
    --result-dir /path/to/output \
    --disable_viewer \
    --max_steps 15000

Error

[rank0]:[E] ProcessGroupNCCL.cpp:2057] [PG ID 0 PG GUID 0(default_pg) Rank 0]
Process group watchdog thread terminated with exception:
CUDA error: an illegal memory access was encountered

Exception raised from c10_cuda_check_implementation at
/pytorch/c10/cuda/CUDAException.cpp:44

terminate called after throwing an instance of 'c10::DistBackendError'
  what(): CUDA error: an illegal memory access was encountered

All 4 ranks (0-3) crash simultaneously at the first training step with the same error.

Expected Behavior

Multi-GPU training with --packed should work as demonstrated in examples/benchmarks/basic_4gpus.sh.

Additional Context

  • Removing --packed resolves the crash — training completes successfully across all 4 ranks, but the speed-up is negligible
  • The crash occurs at step 0 before any meaningful training, suggesting the issue is in the initial NCCL all_reduce over the packed sparse tensors produced by the first rasterization pass
  • Hypothesis: packed mode produces variable-length sparse tensors per rank whose memory layout is incompatible with how NCCL accesses GPU memory during all_reduce, triggering an illegal memory access in the NCCL watchdog

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions