Skip to content

CPU fastpath: lower per-step overhead (Boris, Yee update, filters)#72

Open
rogeriojorge wants to merge 2 commits intouwplasma:mainfrom
rogeriojorge:codex/cpu-5x-fastpath
Open

CPU fastpath: lower per-step overhead (Boris, Yee update, filters)#72
rogeriojorge wants to merge 2 commits intouwplasma:mainfrom
rogeriojorge:codex/cpu-5x-fastpath

Conversation

@rogeriojorge
Copy link
Member

Summary

This PR adds a CPU-focused fast path (still pure Python/JAX; no C++/numba) aimed at reducing per-step overhead while keeping the physics the same.

Core changes are confined to a small set of runtime-critical files:

  • PyPIC3D/boris.py: vectorized Boris push; cheaper field interpolation; shape-factor treated as static where possible.
  • PyPIC3D/solvers/first_order_yee.py: removes per-step pad(..., mode="wrap") + slice; uses direct periodic roll stencils.
  • PyPIC3D/utils.py: replaces conv-based bilinear_filter/digital_filter with roll-based equivalents (same kernels).
  • PyPIC3D/J.py: reduces current-deposition overhead and fuses component handling.
  • PyPIC3D/evolve.py: enables buffer donation for (particles, fields).
  • PyPIC3D/initialization.py: ensures E/B/J are tuples (stable PyTree); adds enable_x64 + scan_chunk config keys.
  • PyPIC3D/__main__.py: optional chunked stepping; lazy-imports VTK diagnostics; reads enable_x64 from config.

Benchmarks (CPU)

Benchmark artifacts (plots + raw numbers + perfetto traces) are stored on a separate branch/commit so they do not need to be merged into main:

  • Branch: rogeriojorge:codex/cpu-5x-fastpath-artifacts
  • Commit: 8dca5d0

Runtime (steady-state s/step)

runtime_s_per_step

speedup

Raw numbers:

origin/main: {"s_per_step": 0.0003897660415386781, "steps": 2000, "x64": true, "seed": 0}
fastpath:   {"s_per_step": 0.0002724121874780394, "steps": 2000, "x64": true, "seed": 0}

Accuracy (two-stream)

Electric field energy:
two_stream_electric_energy

Relative energy error:
two_stream_energy_error

Raw traces (.npz):

  • docs/benchmarks/results/two_stream_origin_main_energy.npz
  • docs/benchmarks/results/two_stream_fastpath_energy.npz

Profiling (Perfetto)

Perfetto traces (.trace.json.gz):

  • docs/benchmarks/profiles/two_stream_origin_main.trace.json.gz
  • docs/benchmarks/profiles/two_stream_fastpath.trace.json.gz

Repro

  • Config: demos/two_stream/two_stream.toml (plotting disabled for timing)
  • Seed: np.random.seed(0)
  • CPU, enable_x64=true

@rogeriojorge
Copy link
Member Author

For now, this is too much of a change for too little gains. Doing a final pass, otherwise I'll close this PR

@rogeriojorge
Copy link
Member Author

Update (commit 5184a38): added opt-in fast modes and more CPU reductions.\n\n- New : .\n - : x64 off + shape_factor=1 + J filter none + forces all outputs off + compiles full-run when possible.\n - : same as aggressive + forces (explicitly a physics approximation).\n- now uses the numerically-stable relativistic KE formula so fp32 fast modes don’t produce negative kinetic energies.\n- Fixed 2D stacked-field interpolation broadcasting (multi-component interpolation now expands weights).\n\nTwo-stream long bench (CPU, , , plotting disabled):\n- origin/main: (from CLI output)\n- fast_mode=aggressive: (~3.2×)\n- fast_mode=extreme: (~3.6×)\n\nI’m still not seeing a physics-preserving path to a true 5×+ on this workload without a much larger refactor (e.g., monolithic particle state like JAX-in-Cell) or additional physics approximations (electrostatic / frozen species / non-relativistic).

@rogeriojorge
Copy link
Member Author

Update (commit 5184a38): added opt-in fast modes and more CPU reductions.

  • New simulation_parameters.fast_mode: off|fp32|aggressive|extreme.
    • aggressive: x64 off + shape_factor=1 + J filter none + forces all outputs off + compiles full-run when possible.
    • extreme: same as aggressive + forces relativistic=false (explicitly a physics approximation).
  • compute_energy now uses the numerically-stable relativistic KE formula m c^2 (gamma-1) so fp32 fast modes don’t produce negative kinetic energies.
  • Fixed 2D stacked-field interpolation broadcasting (multi-component interpolation now expands weights).

Two-stream long bench (CPU, t_wind=5.413e-7, Nx=100,Ny=1,Nz=1, plotting disabled):

  • origin/main: 7.787e-4 s/step (from CLI output)
  • fast_mode=aggressive: 2.434e-4 s/step (~3.2×)
  • fast_mode=extreme: 2.162e-4 s/step (~3.6×)

I’m still not seeing a physics-preserving path to a true 5×+ on this workload without a much larger refactor (e.g., monolithic particle state like JAX-in-Cell) or additional physics approximations (electrostatic / frozen species / non-relativistic).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant