-
Notifications
You must be signed in to change notification settings - Fork 11
Open
Labels
bugSomething isn't workingSomething isn't working
Description
I could run successfully a reduced version of magnetised shock problem on CUDA which takes about 4 GM of RAM on 2 nodes (8 GPUs) (according to Slurm) and last for about 7 minutes. However, running the same problem on MI300A (2 nodes, 8 GPUs), there are severe memory leaks (> 500 GB) on single and multiple nodes leading to crash of the run. I attach the err and outfiles together with the shock.txt. I used
cray-hdf5-parallel/1.14.3.1 rocm-6.2.2 modules
with
export HSA_OVERRIDE_GFX_VERSION=9.4.2; export MPICH_GPU_SUPPORT_ENABLED=1
cmake -B build -D pgen=shock -D mpi=ON -D CMAKE_CXX_COMPILER=hipcc -D CMAKE_C_COMPILER=hipcc -D Kokkos_ENABLE_HIP=ON -D Kokkos_ARCH_AMD_GFX942_APU=ON
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working