Skip to content

Fixes to support Aurora GPU#363

Merged
xylar merged 3 commits intoE3SM-Project:developfrom
xylar:omega/fix-aurora-gpu
Mar 27, 2026
Merged

Fixes to support Aurora GPU#363
xylar merged 3 commits intoE3SM-Project:developfrom
xylar:omega/fix-aurora-gpu

Conversation

@xylar
Copy link
Copy Markdown

@xylar xylar commented Mar 16, 2026

This merge fixes various issues that were preventing GPU builds on Aurora:

  • Switches the sort_by_key() method in horizontal operators to use non-recursive insertion sort. SYCL on Aurora doesn't allow recursive functions on the GPU.
  • Switches from printf() to kokkos::abort() calls in two methods in horizontal operators. This is required in kokkos functions to support SYCL on Aurora.

Checklist

  • Linting
  • Building
    • CMake build does not produce any new warnings from changes in this PR
  • Testing
    • Add a comment to the PR titled Testing with the following:
      • Which machines CTest unit tests
        have been run on and indicate that are all passing.
      • The Polaris omega_pr test suite
        has passed, using the Polaris e3sm_submodules/Omega baseline
      • Document machine(s), compiler(s), and the build path(s) used for -p for both the baseline (Polaris e3sm_submodules/Omega) and the PR build
      • Indicate "All tests passed" or document failing tests

@xylar xylar requested a review from overfelt March 16, 2026 14:40
@xylar xylar self-assigned this Mar 16, 2026
@xylar xylar added the bug Something isn't working label Mar 16, 2026
@xylar
Copy link
Copy Markdown
Author

xylar commented Mar 16, 2026

Thanks @overfelt! I'm still testing this out on Aurora. I'll let you know how it goes. But at least it builds successfully now!

@xylar
Copy link
Copy Markdown
Author

xylar commented Mar 20, 2026

My jobs have sat in the queue for days and days. I feel like I must be doing something wrong. I'll have to look into it when I get back at the end of next week.

Copy link
Copy Markdown
Member

@amametjanov amametjanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked with oneapi-ifxgpu and PARMETIS_ROOT=/lus/flare/projects/E3SM_Dec/soft/polaris/aurora/spack/dev_polaris_0_10_0_oneapi-ifxgpu_mpich/var/spack/environments/dev_polaris_0_10_0_oneapi-ifxgpu_mpich/.spack-env/view .

Build fails with develop, but works with this branch merged-in locally.

@xylar
Copy link
Copy Markdown
Author

xylar commented Mar 25, 2026

I haven't been able to get through the queue with my tests on Aurora. I'll try again this Friday.

@amametjanov
Copy link
Copy Markdown
Member

Xylar, please check if you can start an interactive 1-node job with

qsub -q debug -l walltime=01:00:00 -A E3SM_Dec -l select=1,filesystems=home:flare -I

@xylar
Copy link
Copy Markdown
Author

xylar commented Mar 25, 2026

That's without GPUs, right? I had no problem with that earlier. It was with GPUs was what never ran.

@xylar
Copy link
Copy Markdown
Author

xylar commented Mar 25, 2026

I was doing the equivalent of:

qsub -q debug -l walltime=01:00:00 -A E3SM_Dec -l select=1::ngpus=12,filesystems=home:flare -I

@amametjanov
Copy link
Copy Markdown
Member

All compute nodes on aurora have 12 gpus (6 cards x 2 tiles): accessible by default -- ::ngpus=12 isn't needed.

@xylar xylar force-pushed the omega/fix-aurora-gpu branch from 51384cc to 39dba35 Compare March 25, 2026 21:33
@xylar
Copy link
Copy Markdown
Author

xylar commented Mar 25, 2026

@amametjanov, thanks, that's super helpful!

Now, I made it through the queue and I'm seeing:

mpiexec --label -n 4 --ppn 4 --depth 1 --cpu-bind list:1-8:9-16:17-24:25-32:33-40:41-48:53-60:61-68:69-76:77-84:85-92:93-100 --gpu-bind list:0.0:0.1:1.0:1.1:2.0:2.1:3.0:3.1:4.0:4.1:5.0:5.1 --mem-bind list:0:0:0:0:0:0:1:1:1:1:1:1 ./omega.exe
x4311c0s6b0n0.hsn.cm.aurora.alcf.anl.gov: rank 3 died from signal 11

I'm basically trying to mimic E3SM flags to the best of my ability. Can you tell me what I might have set up wrong here?

@amametjanov
Copy link
Copy Markdown
Member

That mpiexec line looks correct, but i don't know why you're getting a seg-fault. Maybe OOM, worth trying with all 12 tasks on-node (or more tasks/nodes at 12 ppn if the input-mesh is large): i.e.

mpiexec --label -n 12 --ppn 12 --depth 1
--cpu-bind list:1-8:9-16:17-24:25-32:33-40:41-48:53-60:61-68:69-76:77-84:85-92:93-100
--gpu-bind list:0.0:0.1:1.0:1.1:2.0:2.1:3.0:3.1:4.0:4.1:5.0:5.1 
--mem-bind list:0:0:0:0:0:0:1:1:1:1:1:1 ./omega.exe

@xylar
Copy link
Copy Markdown
Author

xylar commented Mar 26, 2026

This is a tiny mesh. 4 GPUs is likely very generous.

I'll play around with it then and see what might help.

@mwarusz
Copy link
Copy Markdown
Member

mwarusz commented Mar 26, 2026

@xylar If you were testing this PR in a polaris environment see E3SM-Project/mache#370.

xylar added 3 commits March 27, 2026 13:16
SYCL on Aurora doesn't allow recursive functions on the GPU.
This is required in kokkos funcitons to support SYCL on Aurora.

This merge also fixes some lint.
@xylar xylar force-pushed the omega/fix-aurora-gpu branch from 39dba35 to 03877c2 Compare March 27, 2026 13:16
@xylar
Copy link
Copy Markdown
Author

xylar commented Mar 27, 2026

Testing

CTest unit tests:

  • Machine: aurora
  • Compiler: oneapi-ifxgpu
  • Build type: Release
  • Failures (10 of 39):
    • HORZOPERATORS_PLANE_TEST
    • HORZOPERATORS_SPHERE_TEST
    • AUXVARS_PLANE_TEST
    • AUXVARS_SPHERE_TEST
    • TEND_PLANE_TEST
    • TEND_PLANE_SINGLE_PRECISION_TEST
    • TEND_SPHERE_TEST
    • TIMESTEPPER_TEST
    • EOS_TEST
    • VERTMIX_TEST
  • Log: /lus/flare/projects/E3SM_Dec/xylar/polaris_0.10/aurora/test_20260327/omega-pr-1.0.0a2-oneapi-ifxgpu/build/ctests.log

The errors look like:

      Start 36: EOS_TEST
36/39 Test #36: EOS_TEST ...................................***Failed    2.40 sec
x4311c6s3b0n0.hsn.cm.aurora.alcf.anl.gov 0: terminate called after throwing an instance of 'sycl::_V1::exception'
  what():  Calls to sycl::queue::submit cannot be nested. Command group function objects should use the sycl::handler API instead.
x4311c6s3b0n0.hsn.cm.aurora.alcf.anl.gov 1: terminate called after throwing an instance of 'sycl::_V1::exception'
  what():  Calls to sycl::queue::submit cannot be nested. Command group function objects should use the sycl::handler API instead.
x4311c6s3b0n0.hsn.cm.aurora.alcf.anl.gov: rank 1 died from signal 6

These are as @mwarusz reported in #368

Polaris omega_pr suite

  • PR build: /lus/flare/projects/E3SM_Dec/xylar/polaris_0.10/aurora/test_20260327/omega-pr-1.0.0a2-oneapi-ifxgpu/build
  • PR workdir: /lus/flare/projects/E3SM_Dec/xylar/polaris_0.10/aurora/test_20260327/omega-pr-1.0.0a2-oneapi-ifxgpu
  • Machine: aurora
  • Compiler: oneapi-ifx
  • Build type: Release
  • Logs: /lus/flare/projects/E3SM_Dec/xylar/polaris_0.10/aurora/test_20260327/omega-pr-1.0.0a2-oneapi-ifxgpu/polaris_omega_pr.*8404925
  • Result:
    • Failures (1 of 9):
      • ocean/spherical/icos/cosine_bell/decomp

The test failure is because we allocated 1 node but the test requires 24 gpus (2 nodes). I'll try to figure out what's wrong on the Polaris side.

@xylar
Copy link
Copy Markdown
Author

xylar commented Mar 27, 2026

More testing

Polaris omega_pr suite

I was able to run the omega_pr suite with some fixed in E3SM-Project/polaris#509:

  • PR build: /lus/flare/projects/E3SM_Dec/xylar/polaris_0.10/aurora/test_20260327/omega-pr-1.0.0a2-oneapi-ifxgpu2/build
  • PR workdir: /lus/flare/projects/E3SM_Dec/xylar/polaris_0.10/aurora/test_20260327/omega-pr-1.0.0a2-oneapi-ifxgpu2
  • Machine: aurora
  • Compiler: oneapi-ifx
  • Build type: Release
  • Log: /lus/flare/projects/E3SM_Dec/xylar/polaris_0.10/aurora/test_20260327/omega-pr-1.0.0a2-oneapi-ifxgpu/polaris_omega_pr.*8405052
  • Result: All tests passed

@xylar xylar merged commit c409e9d into E3SM-Project:develop Mar 27, 2026
1 check passed
@xylar xylar deleted the omega/fix-aurora-gpu branch March 27, 2026 16:08
xylar added a commit to xylar/polaris that referenced this pull request Mar 27, 2026
xylar added a commit to xylar/polaris that referenced this pull request Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants