Skip to content

[BUG] Memory leaks and crashes on AMD MI300A APU #145

@Tissot11

Description

@Tissot11

I could run successfully a reduced version of magnetised shock problem on CUDA which takes about 4 GM of RAM on 2 nodes (8 GPUs) (according to Slurm) and last for about 7 minutes. However, running the same problem on MI300A (2 nodes, 8 GPUs), there are severe memory leaks (> 500 GB) on single and multiple nodes leading to crash of the run. I attach the err and outfiles together with the shock.txt. I used

cray-hdf5-parallel/1.14.3.1 rocm-6.2.2 modules

with

export HSA_OVERRIDE_GFX_VERSION=9.4.2; export MPICH_GPU_SUPPORT_ENABLED=1

cmake -B build -D pgen=shock -D mpi=ON -D CMAKE_CXX_COMPILER=hipcc -D CMAKE_C_COMPILER=hipcc -D Kokkos_ENABLE_HIP=ON -D Kokkos_ARCH_AMD_GFX942_APU=ON

errEntity.txt
outEntity.txt

shock.txt

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions