Skip to content

(0.97.4) Speed up reductions and remove inference problems in Field constructor#4668

Merged
navidcy merged 51 commits intomainfrom
ss/speed-up-reductvions
Jul 26, 2025
Merged

(0.97.4) Speed up reductions and remove inference problems in Field constructor#4668
navidcy merged 51 commits intomainfrom
ss/speed-up-reductvions

Conversation

@simone-silvestri
Copy link
Copy Markdown
Collaborator

Apparently, map reduces performance; this is the cost of some simple reductions on main and on this branch:

on main

julia> grid = RectilinearGrid(size=(8, 8, 8), extent=(1, 1, 1), halo=(1, 1, 1));

julia> u = XFaceField(grid);

julia>= Field(Average(u));

julia> bmean_baseline = @benchmark mean(interior(u))
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min  max):  9.000 μs  52.875 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     9.209 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.365 μs ±  1.489 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

          ▇ █ ▄
  ▂▁▂▁▃▁█▁█▁█▁█▁█▁▆▁▄▁▃▁▃▁▃▁▃▁▂▂▁▂▁▂▁▂▁▂▁▂▁▂▁▂▁▂▁▂▁▂▁▂▁▂▁▂▁▂ ▃
  9 μs           Histogram: frequency by time        10.2 μs <

 Memory estimate: 1.92 KiB, allocs estimate: 58.

julia> bmean = @benchmark mean(u)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min  max):  161.083 μs   2.163 ms  ┊ GC (min  max): 0.00%  89.16%
 Time  (median):     163.208 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   165.978 μs ± 56.813 μs  ┊ GC (mean ± σ):  0.98% ±  2.64%

   ▄▆███▇▅▄▄▄▅▃▃▂▁▁▁▁                                          ▂
  ███████████████████████▇▇▇▇▇▇▇▇▆▆▆▆▅▆▅▄▅▃▅▅▅▅▃▄▄▅▃▄▅▅▅▄▂▃▅▄▄ █
  161 μs        Histogram: log(frequency) by time       185 μs <

 Memory estimate: 65.73 KiB, allocs estimate: 732.

julia> bavg = @benchmark compute!(ū)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min  max):  161.084 μs   2.494 ms  ┊ GC (min  max): 0.00%  87.99%
 Time  (median):     163.583 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   167.401 μs ± 76.304 μs  ┊ GC (mean ± σ):  1.58% ±  3.23%

     ▂▇█▇▂
  ▁▂▅█████▇▆▄▄▄▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  161 μs          Histogram: frequency by time          185 μs <

 Memory estimate: 89.63 KiB, allocs estimate: 711.

julia> bsum_baseline = @benchmark sum(interior(u))
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min  max):  18.125 μs  115.083 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     18.417 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   18.673 μs ±   2.207 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄▇██▆▅▅▄▂▁▁                                                  ▂
  ███████████▇▇▆█▇▆▇▇▇▆▇▇▇▇▇▆▅▅▄▄▅▄▄▄▁▃▄▁▃▁▁▃▁▁▁▁▁▃▃▄▁▃▁▃▃▃▄▃▃ █
  18.1 μs       Histogram: log(frequency) by time      23.8 μs <

 Memory estimate: 1.92 KiB, allocs estimate: 58.

On this PR

julia> bmean_baseline = @benchmark mean(interior(u))
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (min  max):  1.733 μs   4.021 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.762 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.765 μs ± 42.596 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                ▅  █  █  ▃  ▁
  ▂▁▂▁▁▃▁▁▅▁▁█▁▁█▁▁█▁▁█▁▁█▁▁█▁▆▁▁▄▁▁▄▁▁▃▁▁▃▁▁▃▁▁▃▁▁▂▁▁▂▁▁▂▁▂ ▃
  1.73 μs        Histogram: frequency by time        1.82 μs <

 Memory estimate: 192 bytes, allocs estimate: 5.

julia> bmean = @benchmark mean(u)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min  max):  93.209 μs   2.457 ms  ┊ GC (min  max): 0.00%  93.81%
 Time  (median):     94.958 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   97.578 μs ± 61.457 μs  ┊ GC (mean ± σ):  1.74% ±  2.65%

      ▂█▇▇▂▁
  ▁▂▃▇██████▅▄▂▂▂▃▃▃▃▃▃▂▂▂▂▁▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  93.2 μs         Histogram: frequency by time         107 μs <

 Memory estimate: 58.80 KiB, allocs estimate: 520.

julia> bavg = @benchmark compute!(ū)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min  max):  109.542 μs   2.555 ms  ┊ GC (min  max): 0.00%  89.15%
 Time  (median):     112.583 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   117.427 μs ± 77.285 μs  ┊ GC (mean ± σ):  2.18% ±  3.18%

   ▅▇█▇▆▆▅▅▄▄▃▃▃▃▂▃▂▂▁▁▁▁                                      ▂
  ██████████████████████████████▆█▇▇▇▆▆▆▆▄▆▆▆▅▅▆▅▅▆▅▄▅▅▅▅▃▅▁▅▄ █
  110 μs        Histogram: log(frequency) by time       149 μs <

 Memory estimate: 84.43 KiB, allocs estimate: 552.

julia> bsum_baseline = @benchmark sum(interior(u))
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (min  max):  1.771 μs   3.967 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.792 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.798 μs ± 50.658 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

           ▃  █  █  ▄  ▁
  ▂▁▁▃▁▁▅▁▁█▁▁█▁▁█▁▁█▁▁█▁▁▆▁▁▅▁▁▄▁▁▄▁▁▄▁▁▃▁▁▃▁▁▂▁▁▂▁▁▂▁▁▂▁▁▂ ▃
  1.77 μs        Histogram: frequency by time        1.85 μs <

 Memory estimate: 192 bytes, allocs estimate: 5.

julia>

There are still differences between the reduction of interior and reduction of fields that can be ascribed completely to the Field constructor. I ll try to solve also that issue.

@glwagner
Copy link
Copy Markdown
Member

Nice work. Perhaps, there is a type inference issue (the usual culprit)...

@simone-silvestri
Copy link
Copy Markdown
Collaborator Author

simone-silvestri commented Jul 24, 2025

Yeah, the Field and the FieldBoundaryConditions constructors seem to have some inference problems related to the location.
image

@simone-silvestri
Copy link
Copy Markdown
Collaborator Author

simone-silvestri commented Jul 24, 2025

Indeed, there is quite a difference between allocating vs non-allocating reductions:

julia> ur = Field{Nothing, Nothing, Nothing}(grid);

julia> @benchmark sum(interior(u))
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (min  max):  1.779 μs   4.162 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.817 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.824 μs ± 64.381 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                   ▃  ▇ █  ▇ ▃
  ▂▁▂▁▁▂▁▂▁▁▃▁▄▁▁▇▁█▁▁█▁█▁▁█▁█▁▁█▁▆▁▁▅▁▄▁▁▃▁▃▁▁▃▁▂▁▁▂▁▂▁▁▂▁▂ ▃
  1.78 μs        Histogram: frequency by time        1.88 μs <

 Memory estimate: 192 bytes, allocs estimate: 5.

julia> @benchmark sum(u)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min  max):  46.875 μs   2.416 ms  ┊ GC (min  max): 0.00%  96.24%
 Time  (median):     47.958 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   48.826 μs ± 40.719 μs  ┊ GC (mean ± σ):  1.42% ±  1.67%

          ▁▄▇ ██▆▆▄ ▂
  ▁▁▂▂▃▂▅▇█████████▆██▆▅▅▂▃▃▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  46.9 μs         Histogram: frequency by time        51.4 μs <

 Memory estimate: 27.88 KiB, allocs estimate: 253.

julia> @benchmark sum!(ur, u)
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (min  max):  1.492 μs  263.913 μs  ┊ GC (min  max): 0.00%  98.28%
 Time  (median):     1.546 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.620 μs ±   2.629 μs  ┊ GC (mean ± σ):  1.60% ±  0.98%

   ▃▅█▂▁   ▁
  ▃█████▄▅▆█▅▅▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂ ▃
  1.49 μs         Histogram: frequency by time        2.11 μs <

 Memory estimate: 1.30 KiB, allocs estimate: 5.

So I guess the culprit is not the condition_operand

@simone-silvestri
Copy link
Copy Markdown
Collaborator Author

simone-silvestri commented Jul 24, 2025

Apparently, constructors have problems in inference when passing types instead of the instantiated. Working with instantiated locations rather than types speeds up considerably the computations.

This is the most recent result:

julia> @benchmark mean(u)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min  max):  11.542 μs  698.083 μs  ┊ GC (min  max): 0.00%  95.18%
 Time  (median):     12.166 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   12.474 μs ±   9.701 μs  ┊ GC (mean ± σ):  1.05% ±  1.33%

    ▁▄▃▄▅▆▆▇█▆▆▆▆▄▃▃▂▁▁▂ ▂▃▂▂▂▃▂▁▃▁▁▁▁  ▁                      ▂
  ▅▇███████████████████████████████████▇█▆▇▆▇▆▆▇▇▇▇█▇▅▇▅▆▅▅▅▄▅ █
  11.5 μs       Histogram: log(frequency) by time      14.8 μs <

 Memory estimate: 10.69 KiB, allocs estimate: 65.

so now mean(u) is basically as fast as mean(interior(u)) was on main. However, it is slower than mean(interior(u)) on this branch...

@simone-silvestri
Copy link
Copy Markdown
Collaborator Author

simone-silvestri commented Jul 24, 2025

This might become a lengthy PR because it changes the constructors so it might take a while to make sure everything is consistent, however, I feel like this PR might also improve the time for building models, because it removes quite some inference problems in the field constructor and boundary conditions constructors

@simone-silvestri simone-silvestri changed the title Speed up reductions Speed up reductions and remove inference problems in Field constructor Jul 24, 2025
@simone-silvestri simone-silvestri added performance 🏍️ So we can get the wrong answer even faster benchmark performance runs preconfigured benchamarks and spits out timing labels Jul 24, 2025
@simone-silvestri
Copy link
Copy Markdown
Collaborator Author

I think all tests are fixed now, this should be ready to review.

This PR changes the field constructor from Field((Center, Center, Center), grid)
to Field((Center(), Center(), Center()), grid)

Note that the Field{Center, Center, Center}(grid) constructor is still valid and encouraged.

Comment thread src/OrthogonalSphericalShellGrids/distributed_tripolar_grid.jl
Comment thread src/OrthogonalSphericalShellGrids/distributed_tripolar_grid.jl
Comment thread Project.toml
Comment thread src/OrthogonalSphericalShellGrids/distributed_tripolar_grid.jl
Comment thread src/OrthogonalSphericalShellGrids/distributed_tripolar_grid.jl Outdated
@navidcy navidcy changed the title Speed up reductions and remove inference problems in Field constructor (0.97.4) Speed up reductions and remove inference problems in Field constructor Jul 26, 2025
@navidcy
Copy link
Copy Markdown
Member

navidcy commented Jul 26, 2025

Some of the test in the test_boundary_condition_integration.jl fail, in particular those with immersed boundary conditions.

Not sure if this is related to #4670?

Look at this:

using Oceananigans
underlying_grid = RectilinearGrid(topology=(Bounded, Periodic, Bounded), size = (2, 2, 2), x = (0, 0.3), y = (0, 0.4), z=(-0.5, 0.5))
grid = ImmersedBoundaryGrid(underlying_grid, GridFittedBottom((x, y) -> 0))
boundary_conditions = (; :c => FieldBoundaryConditions(immersed = BoundaryCondition(Flux(), π)))
model = NonhydrostaticModel(; grid, boundary_conditions, tracers=:c)
model.tracers.c.boundary_conditions

on #main

Oceananigans.FieldBoundaryConditions, with boundary conditions
├── west: FluxBoundaryCondition: Nothing
├── east: FluxBoundaryCondition: Nothing
├── south: PeriodicBoundaryCondition
├── north: PeriodicBoundaryCondition
├── bottom: FluxBoundaryCondition: Nothing
├── top: FluxBoundaryCondition: Nothing
└── immersed: ImmersedBoundaryCondition with west=Flux, east=Flux, south=Flux, north=Flux, bottom=Flux, top=Flux

on this PR

Oceananigans.FieldBoundaryConditions, with boundary conditions
├── west: FluxBoundaryCondition: Nothing
├── east: FluxBoundaryCondition: Nothing
├── south: PeriodicBoundaryCondition
├── north: PeriodicBoundaryCondition
├── bottom: FluxBoundaryCondition: Nothing
├── top: FluxBoundaryCondition: Nothing
└── immersed: ImmersedBoundaryCondition with west=Nothing, east=Nothing, south=Nothing, north=Nothing, bottom=Nothing, top=Nothing

Comment thread src/ImmersedBoundaries/immersed_boundary_condition.jl
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmark performance runs preconfigured benchamarks and spits out timing performance 🏍️ So we can get the wrong answer even faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants