Refactor `fill_halo_regions!`: Reduce boundary conditions slowdown by simone-silvestri · Pull Request #4706 · CliMA/Oceananigans.jl

simone-silvestri · 2025-08-08T16:46:35Z

A lot of slowdown in the fill_halo_regions! function comes from the fact that every time we need to fill the halos, we are allocating arrays and permuting them to figure out what is the order of execution of the boundary conditions. This procedure is filled with type instability that leads to a lot of allocation, which eventually needs to be cleaned up.

This information, however, is known at model construction, so it is relatively easy to store this info in the FieldBoundaryConditions and use it when we need it.

This PR does exactly this, avoiding the need for permute_boundary_conditions

Edit: This PR is actually becoming quite a refactor. The objective of this PR is to have preconfigured kernels stored in the field to be able to fill halo regions, as part of this PR

the kwarg fill_boundary_normal_velocities is changed to fill_open_bcs
tupled fill halo regions is removed (we need to test thoroughly if this leads to slowdown somewhere).
we add types that define the boundaries to be filled (WestAndEast, West, East, South, ...)
the boundaries are split in case of a MultiRegionCommunication in the same direction as another BC, following what we do for distributed BC. This simplifies the code, but requires a double pass for the moment, which might not be optimal. However, MultiRegion is not really used much, so optimization has a low priority there (the cubed sphere is not affected).

This should lead to a cleaner and more straightforward halo regions filling algorithm, as well as slimmer code.

…nto ss/catke-for-rk3

simone-silvestri · 2025-08-08T17:07:52Z

a field in this PR looks something like this

julia> c = CenterField(grid)
1×1×10 Field{Center, Center, Center} on RectilinearGrid on CPU
├── grid: 1×1×10 RectilinearGrid{Float64, Periodic, Flat, Bounded} on CPU with 1×0×3 halo
├── boundary conditions: FieldBoundaryConditions
│   └── west: Periodic, east: Periodic, south: Nothing, north: Nothing, bottom: ZeroFlux, top: ZeroFlux, immersed: ZeroFlux
└── data: 3×1×16 OffsetArray(::Array{Float64, 3}, 0:2, 1:1, -2:13) with eltype Float64 with indices 0:2×1:1×-2:13
    └── max=0.0, min=0.0, mean=0.0

julia> c.boundary_conditions.kernels
3-element Vector{Function}:
 fill_bottom_and_top_halo! (generic function with 5 methods)
 fill_south_and_north_halo! (generic function with 11 methods)
 fill_west_and_east_halo! (generic function with 8 methods)

julia> c.boundary_conditions.ordered_bcs
((FluxBoundaryCondition: Nothing, FluxBoundaryCondition: Nothing), (nothing, nothing), (PeriodicBoundaryCondition, PeriodicBoundaryCondition))

… into ss/fix-bc-slowdown

Co-authored-by: Navid C. Constantinou <navidcy@users.noreply.github.com>

simone-silvestri · 2025-08-27T06:40:08Z

an example of the information ordered and stored in the field boundary condition is

julia> grid = RectilinearGrid(size = (12, 1), x = (0, 1), z = (0,1), topology = (Periodic, Flat, Bounded))
12×1×1 RectilinearGrid{Float64, Periodic, Flat, Bounded} on CPU with 3×0×1 halo
├── Periodic x ∈ [0.0, 1.0) regularly spaced with Δx=0.0833333
├── Flat y
└── Bounded  z ∈ [0.0, 1.0] regularly spaced with Δz=1.0

julia> c = CenterField(grid)
12×1×1 Field{Center, Center, Center} on RectilinearGrid on CPU
├── grid: 12×1×1 RectilinearGrid{Float64, Periodic, Flat, Bounded} on CPU with 3×0×1 halo
├── boundary conditions: FieldBoundaryConditions
│   └── west: Periodic, east: Periodic, south: Nothing, north: Nothing, bottom: ZeroFlux, top: ZeroFlux, immersed: ZeroFlux
└── data: 18×1×3 OffsetArray(::Array{Float64, 3}, -2:15, 1:1, 0:2) with eltype Float64 with indices -2:15×1:1×0:2
    └── max=0.0, min=0.0, mean=0.0

julia> c.boundary_conditions
Oceananigans.FieldBoundaryConditions, with boundary conditions
├── west: PeriodicBoundaryCondition
├── east: PeriodicBoundaryCondition
├── south: Nothing
├── north: Nothing
├── bottom: FluxBoundaryCondition: Nothing
├── top: FluxBoundaryCondition: Nothing
└── immersed: FluxBoundaryCondition: Nothing

julia> c.boundary_conditions.ordered_bcs
((FluxBoundaryCondition: Nothing, FluxBoundaryCondition: Nothing), (nothing, nothing), (PeriodicBoundaryCondition, PeriodicBoundaryCondition))

julia> c.boundary_conditions.kernels
(KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIteration.StaticSize{(12, 1)}, KernelAbstractions.NDIteration.StaticSize{(12, 1)}, typeof(Oceananigans.BoundaryConditions.cpu__fill_bottom_and_top_halo!)}(KernelAbstractions.CPU(false), Oceananigans.BoundaryConditions.cpu__fill_bottom_and_top_halo!), nothing, KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIteration.StaticSize{(1, 3)}, Oceananigans.Utils.OffsetStaticSize{(1:1, 1:3)}, typeof(Oceananigans.BoundaryConditions.cpu__fill_periodic_west_and_east_halo!)}(KernelAbstractions.CPU(false), Oceananigans.BoundaryConditions.cpu__fill_periodic_west_and_east_halo!))

julia> c.boundary_conditions.kernels[1]
KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIteration.StaticSize{(12, 1)}, KernelAbstractions.NDIteration.StaticSize{(12, 1)}, typeof(Oceananigans.BoundaryConditions.cpu__fill_bottom_and_top_halo!)}(KernelAbstractions.CPU(false), Oceananigans.BoundaryConditions.cpu__fill_bottom_and_top_halo!)

julia> c.boundary_conditions.kernels[2]

julia> c.boundary_conditions.kernels[3]
KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIteration.StaticSize{(1, 3)}, Oceananigans.Utils.OffsetStaticSize{(1:1, 1:3)}, typeof(Oceananigans.BoundaryConditions.cpu__fill_periodic_west_and_east_halo!)}(KernelAbstractions.CPU(false), Oceananigans.BoundaryConditions.cpu__fill_periodic_west_and_east_halo!)

glwagner · 2025-08-27T17:40:49Z

Should boundary_conditions.kernels be a named tuple so we know which kernel fills which boundary?

simone-silvestri · 2025-08-28T06:49:54Z

Ok, the new example of stored info is

julia> grid = RectilinearGrid(size = (12, 2), x = (0, 1), z = (0, 1), topology = (Periodic, Flat, Bounded));

julia> c = CenterField(grid);

julia> c.boundary_conditions.ordered_bcs
(bottom_and_top = (FluxBoundaryCondition: Nothing, FluxBoundaryCondition: Nothing), south_and_north = (nothing, nothing), west_and_east = (PeriodicBoundaryCondition, PeriodicBoundaryCondition))

julia> c.boundary_conditions.kernels
(bottom_and_top = KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIteration.StaticSize{(12, 1)}, KernelAbstractions.NDIteration.StaticSize{(12, 1)}, typeof(Oceananigans.BoundaryConditions.cpu__fill_bottom_and_top_halo!)}(KernelAbstractions.CPU(false), Oceananigans.BoundaryConditions.cpu__fill_bottom_and_top_halo!), south_and_north = nothing, west_and_east = KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIteration.StaticSize{(1, 6)}, Oceananigans.Utils.OffsetStaticSize{(1:1, 1:6)}, typeof(Oceananigans.BoundaryConditions.cpu__fill_periodic_west_and_east_halo!)}(KernelAbstractions.CPU(false), Oceananigans.BoundaryConditions.cpu__fill_periodic_west_and_east_halo!))

glwagner · 2025-08-28T17:39:30Z

Ok, the new example of stored info is

julia> grid = RectilinearGrid(size = (12, 2), x = (0, 1), z = (0, 1), topology = (Periodic, Flat, Bounded));

julia> c = CenterField(grid);

julia> c.boundary_conditions.ordered_bcs
(bottom_and_top = (FluxBoundaryCondition: Nothing, FluxBoundaryCondition: Nothing), south_and_north = (nothing, nothing), west_and_east = (PeriodicBoundaryCondition, PeriodicBoundaryCondition))

julia> c.boundary_conditions.kernels
(bottom_and_top = KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIteration.StaticSize{(12, 1)}, KernelAbstractions.NDIteration.StaticSize{(12, 1)}, typeof(Oceananigans.BoundaryConditions.cpu__fill_bottom_and_top_halo!)}(KernelAbstractions.CPU(false), Oceananigans.BoundaryConditions.cpu__fill_bottom_and_top_halo!), south_and_north = nothing, west_and_east = KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIteration.StaticSize{(1, 6)}, Oceananigans.Utils.OffsetStaticSize{(1:1, 1:6)}, typeof(Oceananigans.BoundaryConditions.cpu__fill_periodic_west_and_east_halo!)}(KernelAbstractions.CPU(false), Oceananigans.BoundaryConditions.cpu__fill_periodic_west_and_east_halo!))

looks great!

glwagner · 2025-08-28T18:11:01Z

 window_boundary_conditions(::Colon,      left, right) = left, right

+# The only thing we need
+Adapt.adapt_structure(to, fbcs::FieldBoundaryConditions) = (kernels = bcs.kernels, Adapt.adapt(to, fbcs.ordered_bcs))


why do you need the kernels on GPU?

I was playing with making fill_halo_regions! possible to call with adapted input here

Oceananigans.jl/src/Fields/field.jl

Lines 809 to 827 in 7436103

function fill_halo_regions!(field::Field, positional_args...; kwargs...)

arch = architecture(field.grid)

args = (field.data,

field.boundary_conditions,

field.indices,

instantiated_location(field),

field.grid,

positional_args...)

# Manually convert args... to be

# passed to the fill_halo_regions! function.

GC.@preserve args begin

converted_args = convert_to_device(arch, args)

fill_halo_regions!(converted_args...; kwargs...)

end

return nothing

end

so that we do not need to adapt the grid a bunch of times, which is, I suspect, the reason for high launch latency.
I can always only adapt the grid.

simone-silvestri · 2025-09-01T22:33:47Z

This should be ready to merge, I think all tests pass except for the reactant ones

navidcy · 2025-09-02T07:10:47Z

@simone-silvestri could you resolve the conflicts that came after #4687?
the doctest should fix when we merge main back here ;)

… and use Tuples in `boundary_mass_fluxes` (#4687) * convert vectors to tuples * remove warning * matching_scheme -> scheme * revert some undesirable changes * mograte to new interface * cleanup * navid's suggestions * bump minor version * remove FlatExtrapolation * migrate validation to new interface * slightly better formatting * Update src/Models/NonhydrostaticModels/boundary_mass_fluxes.jl * rename perturbation advection file * oops, typo * change file name to perturbation_advection.jl * Update src/Models/NonhydrostaticModels/boundary_mass_fluxes.jl Co-authored-by: Simone Silvestri <silvestri.simone0@gmail.com> * Update cubed_sphere_grid.jl * fix doctest --------- Co-authored-by: Simone Silvestri <silvestri.simone0@gmail.com> Co-authored-by: Gregory L. Wagner <wagner.greg@gmail.com> Co-authored-by: Navid C. Constantinou <navidcy@users.noreply.github.com>

simone-silvestri · 2025-09-02T11:59:44Z

Tests pass, I would need an approval to merge

tomchor · 2025-09-02T16:45:40Z

Seems like tests pass 🎉

@simone-silvestri maybe I missed this, but can we expect a significant speedup in simulations after this PR is merged?

simone-silvestri · 2025-09-02T17:23:05Z

Yeah, generally for smaller models on GPU (for example for a one degree ocean model), and more so for models that involve a lot of fill halos. for sea ice dynamics for example we have a speed up of a factor of 5

glwagner · 2025-09-02T17:31:14Z

it might be nice to report a benchmark for some small problem in this PR for record purposes

simone-silvestri · 2025-09-02T17:43:26Z

On it.

simone-silvestri · 2025-09-02T18:22:59Z

This is a hydrostatic model benchmark of a fairly small hydrostatic model (64 x 32 x 8) initialized with a divergent velocity field (the benchmark_hydrostatic_model.jl file revamped a bit) ran on my mac M1:

branch main

This branch

In particular, allocations seem to be down quite a bit. In terms of performance, there is an improvement across the board, more for some free surfaces than the other though.

simone-silvestri added 15 commits August 8, 2025 15:20

start it

1a05c0b

change

780060b

add this

e8125e1

also this goes away

952bee4

all away

6e3618c

let's go

21fd637

go with it

44655da

go for it

a061b7f

Update hydrostatic_free_surface_tendency_kernel_functions.jl

f8c9dec

remove number of lines

3f82063

Merge branch 'ss/catke-for-rk3' of github.com:CliMA/Oceananigans.jl i…

2d5cf2a

…nto ss/catke-for-rk3

bugfix

c711b81

make it work

09850d0

go like this

f1cfb4d

vestigial stuff

617c429

simone-silvestri requested a review from glwagner August 8, 2025 16:48

simone-silvestri added 6 commits August 8, 2025 18:49

fix

1a7d6b1

go ahead

7611c14

let;s go

2ff463c

fix a bit

0ad96c0

go ahead

51ea3d7

this should all work

9fa3da7

add a tuple

347abf3

navidcy added performance 🏍️ So we can get the wrong answer even faster boundary conditions 🏓 labels Aug 8, 2025

simone-silvestri added 4 commits August 9, 2025 08:50

Merge branch 'main' into ss/fix-bc-slowdown

393d590

test hypothesis

51322fd

Merge branch 'ss/fix-bc-slowdown' of github.com:CliMA/Oceananigans.jl…

4b19e27

… into ss/fix-bc-slowdown

fix

6738f32

simone-silvestri and others added 2 commits August 26, 2025 11:49

Apply suggestion from @navidcy

41e0b2d

Co-authored-by: Navid C. Constantinou <navidcy@users.noreply.github.com>

Merge branch 'main' into ss/fix-bc-slowdown

be0b738

Merge branch 'main' into ss/fix-bc-slowdown

1184dc5

give a name to the namedtuple

d0359b6

simone-silvestri added 2 commits August 28, 2025 19:59

Implement adapt_structure for boundary conditions

d178c24

Fix adapt_structure to include ordered_bcs

da61000

glwagner reviewed Aug 28, 2025

View reviewed changes

simone-silvestri added 4 commits August 29, 2025 08:54

test again

7436103

Update fill_halo_kernels.jl

915d6e6

Merge branch 'main' into ss/fix-bc-slowdown

4d39d90

Merge branch 'main' into ss/fix-bc-slowdown

89f8974

Merge remote-tracking branch 'origin/main' into ss/fix-bc-slowdown

a305538

glwagner approved these changes Sep 2, 2025

View reviewed changes

tomchor mentioned this pull request Sep 2, 2025

Update Oceananigans to 0.98 (on main, not the registered version) tomchor/rough-seamounts#54

Merged

simone-silvestri merged commit 667bbd0 into main Sep 2, 2025
67 of 70 checks passed

simone-silvestri deleted the ss/fix-bc-slowdown branch September 2, 2025 18:25

simone-silvestri mentioned this pull request Oct 21, 2025

Fix type inference failure in AbstractOperations #4870

Merged

	function fill_halo_regions!(field::Field, positional_args...; kwargs...)

	arch = architecture(field.grid)
	args = (field.data,
	field.boundary_conditions,
	field.indices,
	instantiated_location(field),
	field.grid,
	positional_args...)

	# Manually convert args... to be
	# passed to the fill_halo_regions! function.
	GC.@preserve args begin
	converted_args = convert_to_device(arch, args)
	fill_halo_regions!(converted_args...; kwargs...)
	end

	return nothing
	end

Conversation

simone-silvestri commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simone-silvestri commented Aug 8, 2025

Uh oh!

simone-silvestri commented Aug 27, 2025

Uh oh!

glwagner commented Aug 27, 2025

Uh oh!

simone-silvestri commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glwagner commented Aug 28, 2025

Uh oh!

glwagner Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

simone-silvestri Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

simone-silvestri commented Sep 1, 2025

Uh oh!

navidcy commented Sep 2, 2025

Uh oh!

simone-silvestri commented Sep 2, 2025

Uh oh!

tomchor commented Sep 2, 2025

Uh oh!

simone-silvestri commented Sep 2, 2025

Uh oh!

glwagner commented Sep 2, 2025

Uh oh!

simone-silvestri commented Sep 2, 2025

Uh oh!

simone-silvestri commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

branch main

This branch

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

simone-silvestri commented Aug 8, 2025 •

edited

Loading

simone-silvestri commented Aug 28, 2025 •

edited

Loading

simone-silvestri commented Sep 2, 2025 •

edited

Loading