Refactor fill_halo_regions!: Reduce boundary conditions slowdown#4706
Refactor fill_halo_regions!: Reduce boundary conditions slowdown#4706simone-silvestri merged 126 commits intomainfrom
fill_halo_regions!: Reduce boundary conditions slowdown#4706Conversation
…nto ss/catke-for-rk3
|
a field in this PR looks something like this julia> c = CenterField(grid)
1×1×10 Field{Center, Center, Center} on RectilinearGrid on CPU
├── grid: 1×1×10 RectilinearGrid{Float64, Periodic, Flat, Bounded} on CPU with 1×0×3 halo
├── boundary conditions: FieldBoundaryConditions
│ └── west: Periodic, east: Periodic, south: Nothing, north: Nothing, bottom: ZeroFlux, top: ZeroFlux, immersed: ZeroFlux
└── data: 3×1×16 OffsetArray(::Array{Float64, 3}, 0:2, 1:1, -2:13) with eltype Float64 with indices 0:2×1:1×-2:13
└── max=0.0, min=0.0, mean=0.0
julia> c.boundary_conditions.kernels
3-element Vector{Function}:
fill_bottom_and_top_halo! (generic function with 5 methods)
fill_south_and_north_halo! (generic function with 11 methods)
fill_west_and_east_halo! (generic function with 8 methods)
julia> c.boundary_conditions.ordered_bcs
((FluxBoundaryCondition: Nothing, FluxBoundaryCondition: Nothing), (nothing, nothing), (PeriodicBoundaryCondition, PeriodicBoundaryCondition)) |
Co-authored-by: Navid C. Constantinou <navidcy@users.noreply.github.com>
|
an example of the information ordered and stored in the field boundary condition is julia> grid = RectilinearGrid(size = (12, 1), x = (0, 1), z = (0,1), topology = (Periodic, Flat, Bounded))
12×1×1 RectilinearGrid{Float64, Periodic, Flat, Bounded} on CPU with 3×0×1 halo
├── Periodic x ∈ [0.0, 1.0) regularly spaced with Δx=0.0833333
├── Flat y
└── Bounded z ∈ [0.0, 1.0] regularly spaced with Δz=1.0
julia> c = CenterField(grid)
12×1×1 Field{Center, Center, Center} on RectilinearGrid on CPU
├── grid: 12×1×1 RectilinearGrid{Float64, Periodic, Flat, Bounded} on CPU with 3×0×1 halo
├── boundary conditions: FieldBoundaryConditions
│ └── west: Periodic, east: Periodic, south: Nothing, north: Nothing, bottom: ZeroFlux, top: ZeroFlux, immersed: ZeroFlux
└── data: 18×1×3 OffsetArray(::Array{Float64, 3}, -2:15, 1:1, 0:2) with eltype Float64 with indices -2:15×1:1×0:2
└── max=0.0, min=0.0, mean=0.0
julia> c.boundary_conditions
Oceananigans.FieldBoundaryConditions, with boundary conditions
├── west: PeriodicBoundaryCondition
├── east: PeriodicBoundaryCondition
├── south: Nothing
├── north: Nothing
├── bottom: FluxBoundaryCondition: Nothing
├── top: FluxBoundaryCondition: Nothing
└── immersed: FluxBoundaryCondition: Nothing
julia> c.boundary_conditions.ordered_bcs
((FluxBoundaryCondition: Nothing, FluxBoundaryCondition: Nothing), (nothing, nothing), (PeriodicBoundaryCondition, PeriodicBoundaryCondition))
julia> c.boundary_conditions.kernels
(KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIteration.StaticSize{(12, 1)}, KernelAbstractions.NDIteration.StaticSize{(12, 1)}, typeof(Oceananigans.BoundaryConditions.cpu__fill_bottom_and_top_halo!)}(KernelAbstractions.CPU(false), Oceananigans.BoundaryConditions.cpu__fill_bottom_and_top_halo!), nothing, KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIteration.StaticSize{(1, 3)}, Oceananigans.Utils.OffsetStaticSize{(1:1, 1:3)}, typeof(Oceananigans.BoundaryConditions.cpu__fill_periodic_west_and_east_halo!)}(KernelAbstractions.CPU(false), Oceananigans.BoundaryConditions.cpu__fill_periodic_west_and_east_halo!))
julia> c.boundary_conditions.kernels[1]
KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIteration.StaticSize{(12, 1)}, KernelAbstractions.NDIteration.StaticSize{(12, 1)}, typeof(Oceananigans.BoundaryConditions.cpu__fill_bottom_and_top_halo!)}(KernelAbstractions.CPU(false), Oceananigans.BoundaryConditions.cpu__fill_bottom_and_top_halo!)
julia> c.boundary_conditions.kernels[2]
julia> c.boundary_conditions.kernels[3]
KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIteration.StaticSize{(1, 3)}, Oceananigans.Utils.OffsetStaticSize{(1:1, 1:3)}, typeof(Oceananigans.BoundaryConditions.cpu__fill_periodic_west_and_east_halo!)}(KernelAbstractions.CPU(false), Oceananigans.BoundaryConditions.cpu__fill_periodic_west_and_east_halo!) |
|
Should |
|
Ok, the new example of stored info is julia> grid = RectilinearGrid(size = (12, 2), x = (0, 1), z = (0, 1), topology = (Periodic, Flat, Bounded));
julia> c = CenterField(grid);
julia> c.boundary_conditions.ordered_bcs
(bottom_and_top = (FluxBoundaryCondition: Nothing, FluxBoundaryCondition: Nothing), south_and_north = (nothing, nothing), west_and_east = (PeriodicBoundaryCondition, PeriodicBoundaryCondition))
julia> c.boundary_conditions.kernels
(bottom_and_top = KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIteration.StaticSize{(12, 1)}, KernelAbstractions.NDIteration.StaticSize{(12, 1)}, typeof(Oceananigans.BoundaryConditions.cpu__fill_bottom_and_top_halo!)}(KernelAbstractions.CPU(false), Oceananigans.BoundaryConditions.cpu__fill_bottom_and_top_halo!), south_and_north = nothing, west_and_east = KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIteration.StaticSize{(1, 6)}, Oceananigans.Utils.OffsetStaticSize{(1:1, 1:6)}, typeof(Oceananigans.BoundaryConditions.cpu__fill_periodic_west_and_east_halo!)}(KernelAbstractions.CPU(false), Oceananigans.BoundaryConditions.cpu__fill_periodic_west_and_east_halo!)) |
looks great! |
| window_boundary_conditions(::Colon, left, right) = left, right | ||
|
|
||
| # The only thing we need | ||
| Adapt.adapt_structure(to, fbcs::FieldBoundaryConditions) = (kernels = bcs.kernels, Adapt.adapt(to, fbcs.ordered_bcs)) |
There was a problem hiding this comment.
why do you need the kernels on GPU?
There was a problem hiding this comment.
I was playing with making fill_halo_regions! possible to call with adapted input here
Oceananigans.jl/src/Fields/field.jl
Lines 809 to 827 in 7436103
so that we do not need to adapt the grid a bunch of times, which is, I suspect, the reason for high launch latency.
I can always only adapt the grid.
|
This should be ready to merge, I think all tests pass except for the reactant ones |
|
@simone-silvestri could you resolve the conflicts that came after #4687? |
… and use Tuples in `boundary_mass_fluxes` (#4687) * convert vectors to tuples * remove warning * matching_scheme -> scheme * revert some undesirable changes * mograte to new interface * cleanup * navid's suggestions * bump minor version * remove FlatExtrapolation * migrate validation to new interface * slightly better formatting * Update src/Models/NonhydrostaticModels/boundary_mass_fluxes.jl * rename perturbation advection file * oops, typo * change file name to perturbation_advection.jl * Update src/Models/NonhydrostaticModels/boundary_mass_fluxes.jl Co-authored-by: Simone Silvestri <silvestri.simone0@gmail.com> * Update cubed_sphere_grid.jl * fix doctest --------- Co-authored-by: Simone Silvestri <silvestri.simone0@gmail.com> Co-authored-by: Gregory L. Wagner <wagner.greg@gmail.com> Co-authored-by: Navid C. Constantinou <navidcy@users.noreply.github.com>
|
Tests pass, I would need an approval to merge |
|
Seems like tests pass 🎉 @simone-silvestri maybe I missed this, but can we expect a significant speedup in simulations after this PR is merged? |
|
Yeah, generally for smaller models on GPU (for example for a one degree ocean model), and more so for models that involve a lot of fill halos. for sea ice dynamics for example we have a speed up of a factor of 5 |
|
it might be nice to report a benchmark for some small problem in this PR for record purposes |
|
On it. |


A lot of slowdown in the
fill_halo_regions!function comes from the fact that every time we need to fill the halos, we are allocating arrays and permuting them to figure out what is the order of execution of the boundary conditions. This procedure is filled with type instability that leads to a lot of allocation, which eventually needs to be cleaned up.This information, however, is known at model construction, so it is relatively easy to store this info in the
FieldBoundaryConditionsand use it when we need it.This PR does exactly this, avoiding the need for
permute_boundary_conditionsEdit: This PR is actually becoming quite a refactor. The objective of this PR is to have preconfigured kernels stored in the field to be able to fill halo regions, as part of this PR
fill_boundary_normal_velocitiesis changed tofill_open_bcsWestAndEast,West,East,South, ...)MultiRegionCommunicationin the same direction as another BC, following what we do for distributed BC. This simplifies the code, but requires a double pass for the moment, which might not be optimal. However,MultiRegionis not really used much, so optimization has a low priority there (the cubed sphere is not affected).This should lead to a cleaner and more straightforward halo regions filling algorithm, as well as slimmer code.