Skip to content

Refactor fill_halo_regions!: Reduce boundary conditions slowdown#4706

Merged
simone-silvestri merged 126 commits intomainfrom
ss/fix-bc-slowdown
Sep 2, 2025
Merged

Refactor fill_halo_regions!: Reduce boundary conditions slowdown#4706
simone-silvestri merged 126 commits intomainfrom
ss/fix-bc-slowdown

Conversation

@simone-silvestri
Copy link
Copy Markdown
Collaborator

@simone-silvestri simone-silvestri commented Aug 8, 2025

A lot of slowdown in the fill_halo_regions! function comes from the fact that every time we need to fill the halos, we are allocating arrays and permuting them to figure out what is the order of execution of the boundary conditions. This procedure is filled with type instability that leads to a lot of allocation, which eventually needs to be cleaned up.

This information, however, is known at model construction, so it is relatively easy to store this info in the FieldBoundaryConditions and use it when we need it.

This PR does exactly this, avoiding the need for permute_boundary_conditions

Edit: This PR is actually becoming quite a refactor. The objective of this PR is to have preconfigured kernels stored in the field to be able to fill halo regions, as part of this PR

  • the kwarg fill_boundary_normal_velocities is changed to fill_open_bcs
  • tupled fill halo regions is removed (we need to test thoroughly if this leads to slowdown somewhere).
  • we add types that define the boundaries to be filled (WestAndEast, West, East, South, ...)
  • the boundaries are split in case of a MultiRegionCommunication in the same direction as another BC, following what we do for distributed BC. This simplifies the code, but requires a double pass for the moment, which might not be optimal. However, MultiRegion is not really used much, so optimization has a low priority there (the cubed sphere is not affected).

This should lead to a cleaner and more straightforward halo regions filling algorithm, as well as slimmer code.

@simone-silvestri
Copy link
Copy Markdown
Collaborator Author

a field in this PR looks something like this

julia> c = CenterField(grid)
1×1×10 Field{Center, Center, Center} on RectilinearGrid on CPU
├── grid: 1×1×10 RectilinearGrid{Float64, Periodic, Flat, Bounded} on CPU with 1×0×3 halo
├── boundary conditions: FieldBoundaryConditions
│   └── west: Periodic, east: Periodic, south: Nothing, north: Nothing, bottom: ZeroFlux, top: ZeroFlux, immersed: ZeroFlux
└── data: 3×1×16 OffsetArray(::Array{Float64, 3}, 0:2, 1:1, -2:13) with eltype Float64 with indices 0:2×1:1×-2:13
    └── max=0.0, min=0.0, mean=0.0

julia> c.boundary_conditions.kernels
3-element Vector{Function}:
 fill_bottom_and_top_halo! (generic function with 5 methods)
 fill_south_and_north_halo! (generic function with 11 methods)
 fill_west_and_east_halo! (generic function with 8 methods)

julia> c.boundary_conditions.ordered_bcs
((FluxBoundaryCondition: Nothing, FluxBoundaryCondition: Nothing), (nothing, nothing), (PeriodicBoundaryCondition, PeriodicBoundaryCondition))

@navidcy navidcy added performance 🏍️ So we can get the wrong answer even faster boundary conditions 🏓 labels Aug 8, 2025
simone-silvestri and others added 2 commits August 26, 2025 11:49
Co-authored-by: Navid C. Constantinou <navidcy@users.noreply.github.com>
@simone-silvestri
Copy link
Copy Markdown
Collaborator Author

an example of the information ordered and stored in the field boundary condition is

julia> grid = RectilinearGrid(size = (12, 1), x = (0, 1), z = (0,1), topology = (Periodic, Flat, Bounded))
12×1×1 RectilinearGrid{Float64, Periodic, Flat, Bounded} on CPU with 3×0×1 halo
├── Periodic x  [0.0, 1.0) regularly spaced with Δx=0.0833333
├── Flat y
└── Bounded  z  [0.0, 1.0] regularly spaced with Δz=1.0

julia> c = CenterField(grid)
12×1×1 Field{Center, Center, Center} on RectilinearGrid on CPU
├── grid: 12×1×1 RectilinearGrid{Float64, Periodic, Flat, Bounded} on CPU with 3×0×1 halo
├── boundary conditions: FieldBoundaryConditions
│   └── west: Periodic, east: Periodic, south: Nothing, north: Nothing, bottom: ZeroFlux, top: ZeroFlux, immersed: ZeroFlux
└── data: 18×1×3 OffsetArray(::Array{Float64, 3}, -2:15, 1:1, 0:2) with eltype Float64 with indices -2:15×1:1×0:2
    └── max=0.0, min=0.0, mean=0.0

julia> c.boundary_conditions
Oceananigans.FieldBoundaryConditions, with boundary conditions
├── west: PeriodicBoundaryCondition
├── east: PeriodicBoundaryCondition
├── south: Nothing
├── north: Nothing
├── bottom: FluxBoundaryCondition: Nothing
├── top: FluxBoundaryCondition: Nothing
└── immersed: FluxBoundaryCondition: Nothing

julia> c.boundary_conditions.ordered_bcs
((FluxBoundaryCondition: Nothing, FluxBoundaryCondition: Nothing), (nothing, nothing), (PeriodicBoundaryCondition, PeriodicBoundaryCondition))

julia> c.boundary_conditions.kernels
(KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIteration.StaticSize{(12, 1)}, KernelAbstractions.NDIteration.StaticSize{(12, 1)}, typeof(Oceananigans.BoundaryConditions.cpu__fill_bottom_and_top_halo!)}(KernelAbstractions.CPU(false), Oceananigans.BoundaryConditions.cpu__fill_bottom_and_top_halo!), nothing, KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIteration.StaticSize{(1, 3)}, Oceananigans.Utils.OffsetStaticSize{(1:1, 1:3)}, typeof(Oceananigans.BoundaryConditions.cpu__fill_periodic_west_and_east_halo!)}(KernelAbstractions.CPU(false), Oceananigans.BoundaryConditions.cpu__fill_periodic_west_and_east_halo!))

julia> c.boundary_conditions.kernels[1]
KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIteration.StaticSize{(12, 1)}, KernelAbstractions.NDIteration.StaticSize{(12, 1)}, typeof(Oceananigans.BoundaryConditions.cpu__fill_bottom_and_top_halo!)}(KernelAbstractions.CPU(false), Oceananigans.BoundaryConditions.cpu__fill_bottom_and_top_halo!)

julia> c.boundary_conditions.kernels[2]

julia> c.boundary_conditions.kernels[3]
KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIteration.StaticSize{(1, 3)}, Oceananigans.Utils.OffsetStaticSize{(1:1, 1:3)}, typeof(Oceananigans.BoundaryConditions.cpu__fill_periodic_west_and_east_halo!)}(KernelAbstractions.CPU(false), Oceananigans.BoundaryConditions.cpu__fill_periodic_west_and_east_halo!)

@glwagner
Copy link
Copy Markdown
Member

Should boundary_conditions.kernels be a named tuple so we know which kernel fills which boundary?

@simone-silvestri
Copy link
Copy Markdown
Collaborator Author

simone-silvestri commented Aug 28, 2025

Ok, the new example of stored info is

julia> grid = RectilinearGrid(size = (12, 2), x = (0, 1), z = (0, 1), topology = (Periodic, Flat, Bounded));

julia> c = CenterField(grid);

julia> c.boundary_conditions.ordered_bcs
(bottom_and_top = (FluxBoundaryCondition: Nothing, FluxBoundaryCondition: Nothing), south_and_north = (nothing, nothing), west_and_east = (PeriodicBoundaryCondition, PeriodicBoundaryCondition))

julia> c.boundary_conditions.kernels
(bottom_and_top = KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIteration.StaticSize{(12, 1)}, KernelAbstractions.NDIteration.StaticSize{(12, 1)}, typeof(Oceananigans.BoundaryConditions.cpu__fill_bottom_and_top_halo!)}(KernelAbstractions.CPU(false), Oceananigans.BoundaryConditions.cpu__fill_bottom_and_top_halo!), south_and_north = nothing, west_and_east = KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIteration.StaticSize{(1, 6)}, Oceananigans.Utils.OffsetStaticSize{(1:1, 1:6)}, typeof(Oceananigans.BoundaryConditions.cpu__fill_periodic_west_and_east_halo!)}(KernelAbstractions.CPU(false), Oceananigans.BoundaryConditions.cpu__fill_periodic_west_and_east_halo!))

@glwagner
Copy link
Copy Markdown
Member

Ok, the new example of stored info is

julia> grid = RectilinearGrid(size = (12, 2), x = (0, 1), z = (0, 1), topology = (Periodic, Flat, Bounded));

julia> c = CenterField(grid);

julia> c.boundary_conditions.ordered_bcs
(bottom_and_top = (FluxBoundaryCondition: Nothing, FluxBoundaryCondition: Nothing), south_and_north = (nothing, nothing), west_and_east = (PeriodicBoundaryCondition, PeriodicBoundaryCondition))

julia> c.boundary_conditions.kernels
(bottom_and_top = KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIteration.StaticSize{(12, 1)}, KernelAbstractions.NDIteration.StaticSize{(12, 1)}, typeof(Oceananigans.BoundaryConditions.cpu__fill_bottom_and_top_halo!)}(KernelAbstractions.CPU(false), Oceananigans.BoundaryConditions.cpu__fill_bottom_and_top_halo!), south_and_north = nothing, west_and_east = KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIteration.StaticSize{(1, 6)}, Oceananigans.Utils.OffsetStaticSize{(1:1, 1:6)}, typeof(Oceananigans.BoundaryConditions.cpu__fill_periodic_west_and_east_halo!)}(KernelAbstractions.CPU(false), Oceananigans.BoundaryConditions.cpu__fill_periodic_west_and_east_halo!))

looks great!

window_boundary_conditions(::Colon, left, right) = left, right

# The only thing we need
Adapt.adapt_structure(to, fbcs::FieldBoundaryConditions) = (kernels = bcs.kernels, Adapt.adapt(to, fbcs.ordered_bcs))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need the kernels on GPU?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was playing with making fill_halo_regions! possible to call with adapted input here

function fill_halo_regions!(field::Field, positional_args...; kwargs...)
arch = architecture(field.grid)
args = (field.data,
field.boundary_conditions,
field.indices,
instantiated_location(field),
field.grid,
positional_args...)
# Manually convert args... to be
# passed to the fill_halo_regions! function.
GC.@preserve args begin
converted_args = convert_to_device(arch, args)
fill_halo_regions!(converted_args...; kwargs...)
end
return nothing
end

so that we do not need to adapt the grid a bunch of times, which is, I suspect, the reason for high launch latency.
I can always only adapt the grid.

@simone-silvestri
Copy link
Copy Markdown
Collaborator Author

This should be ready to merge, I think all tests pass except for the reactant ones

@navidcy
Copy link
Copy Markdown
Member

navidcy commented Sep 2, 2025

@simone-silvestri could you resolve the conflicts that came after #4687?
the doctest should fix when we merge main back here ;)

simone-silvestri referenced this pull request Sep 2, 2025
… and use Tuples in `boundary_mass_fluxes` (#4687)

* convert vectors to tuples

* remove warning

* matching_scheme -> scheme

* revert some undesirable changes

* mograte to new interface

* cleanup

* navid's suggestions

* bump minor version

* remove FlatExtrapolation

* migrate validation to new interface

* slightly better formatting

* Update src/Models/NonhydrostaticModels/boundary_mass_fluxes.jl

* rename perturbation advection file

* oops, typo

* change file name to perturbation_advection.jl

* Update src/Models/NonhydrostaticModels/boundary_mass_fluxes.jl

Co-authored-by: Simone Silvestri <silvestri.simone0@gmail.com>

* Update cubed_sphere_grid.jl

* fix doctest

---------

Co-authored-by: Simone Silvestri <silvestri.simone0@gmail.com>
Co-authored-by: Gregory L. Wagner <wagner.greg@gmail.com>
Co-authored-by: Navid C. Constantinou <navidcy@users.noreply.github.com>
@simone-silvestri
Copy link
Copy Markdown
Collaborator Author

Tests pass, I would need an approval to merge

@tomchor
Copy link
Copy Markdown
Member

tomchor commented Sep 2, 2025

Seems like tests pass 🎉

@simone-silvestri maybe I missed this, but can we expect a significant speedup in simulations after this PR is merged?

@simone-silvestri
Copy link
Copy Markdown
Collaborator Author

Yeah, generally for smaller models on GPU (for example for a one degree ocean model), and more so for models that involve a lot of fill halos. for sea ice dynamics for example we have a speed up of a factor of 5

@glwagner
Copy link
Copy Markdown
Member

glwagner commented Sep 2, 2025

it might be nice to report a benchmark for some small problem in this PR for record purposes

@simone-silvestri
Copy link
Copy Markdown
Collaborator Author

On it.

@simone-silvestri
Copy link
Copy Markdown
Collaborator Author

simone-silvestri commented Sep 2, 2025

This is a hydrostatic model benchmark of a fairly small hydrostatic model (64 x 32 x 8) initialized with a divergent velocity field (the benchmark_hydrostatic_model.jl file revamped a bit) ran on my mac M1:

branch main

image

This branch

image

In particular, allocations seem to be down quite a bit. In terms of performance, there is an improvement across the board, more for some free surfaces than the other though.

@simone-silvestri simone-silvestri merged commit 667bbd0 into main Sep 2, 2025
67 of 70 checks passed
@simone-silvestri simone-silvestri deleted the ss/fix-bc-slowdown branch September 2, 2025 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmark performance runs preconfigured benchamarks and spits out timing boundary conditions 🏓 performance 🏍️ So we can get the wrong answer even faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants