Fixes to support Aurora GPU#363
Conversation
|
Thanks @overfelt! I'm still testing this out on Aurora. I'll let you know how it goes. But at least it builds successfully now! |
|
My jobs have sat in the queue for days and days. I feel like I must be doing something wrong. I'll have to look into it when I get back at the end of next week. |
amametjanov
left a comment
There was a problem hiding this comment.
Checked with oneapi-ifxgpu and PARMETIS_ROOT=/lus/flare/projects/E3SM_Dec/soft/polaris/aurora/spack/dev_polaris_0_10_0_oneapi-ifxgpu_mpich/var/spack/environments/dev_polaris_0_10_0_oneapi-ifxgpu_mpich/.spack-env/view .
Build fails with develop, but works with this branch merged-in locally.
|
I haven't been able to get through the queue with my tests on Aurora. I'll try again this Friday. |
|
Xylar, please check if you can start an interactive 1-node job with |
|
That's without GPUs, right? I had no problem with that earlier. It was with GPUs was what never ran. |
|
I was doing the equivalent of: |
|
All compute nodes on aurora have 12 gpus (6 cards x 2 tiles): accessible by default -- |
51384cc to
39dba35
Compare
|
@amametjanov, thanks, that's super helpful! Now, I made it through the queue and I'm seeing: I'm basically trying to mimic E3SM flags to the best of my ability. Can you tell me what I might have set up wrong here? |
|
That mpiexec line looks correct, but i don't know why you're getting a seg-fault. Maybe OOM, worth trying with all 12 tasks on-node (or more tasks/nodes at 12 ppn if the input-mesh is large): i.e. |
|
This is a tiny mesh. 4 GPUs is likely very generous. I'll play around with it then and see what might help. |
|
@xylar If you were testing this PR in a polaris environment see E3SM-Project/mache#370. |
SYCL on Aurora doesn't allow recursive functions on the GPU.
This is required in kokkos funcitons to support SYCL on Aurora. This merge also fixes some lint.
39dba35 to
03877c2
Compare
TestingCTest unit tests:
The errors look like: These are as @mwarusz reported in #368 Polaris
|
More testingPolaris
|
This brings in E3SM-Project/Omega#363.
This brings in E3SM-Project/Omega#363.
This merge fixes various issues that were preventing GPU builds on Aurora:
sort_by_key()method in horizontal operators to use non-recursive insertion sort. SYCL on Aurora doesn't allow recursive functions on the GPU.printf()tokokkos::abort()calls in two methods in horizontal operators. This is required in kokkos functions to support SYCL on Aurora.Checklist
Testingwith the following:have been run on and indicate that are all passing.
has passed, using the Polaris
e3sm_submodules/Omegabaseline-pfor both the baseline (Polarise3sm_submodules/Omega) and the PR build