CPU fastpath: lower per-step overhead (Boris, Yee update, filters) by rogeriojorge · Pull Request #72 · uwplasma/PyPIC3D

rogeriojorge · 2026-03-14T20:15:48Z

Summary

This PR adds a CPU-focused fast path (still pure Python/JAX; no C++/numba) aimed at reducing per-step overhead while keeping the physics the same.

Core changes are confined to a small set of runtime-critical files:

PyPIC3D/boris.py: vectorized Boris push; cheaper field interpolation; shape-factor treated as static where possible.
PyPIC3D/solvers/first_order_yee.py: removes per-step pad(..., mode="wrap") + slice; uses direct periodic roll stencils.
PyPIC3D/utils.py: replaces conv-based bilinear_filter/digital_filter with roll-based equivalents (same kernels).
PyPIC3D/J.py: reduces current-deposition overhead and fuses component handling.
PyPIC3D/evolve.py: enables buffer donation for (particles, fields).
PyPIC3D/initialization.py: ensures E/B/J are tuples (stable PyTree); adds enable_x64 + scan_chunk config keys.
PyPIC3D/__main__.py: optional chunked stepping; lazy-imports VTK diagnostics; reads enable_x64 from config.

Benchmarks (CPU)

Benchmark artifacts (plots + raw numbers + perfetto traces) are stored on a separate branch/commit so they do not need to be merged into main:

Branch: rogeriojorge:codex/cpu-5x-fastpath-artifacts
Commit: 8dca5d0

Runtime (steady-state s/step)

Raw numbers:

origin/main: {"s_per_step": 0.0003897660415386781, "steps": 2000, "x64": true, "seed": 0}
fastpath:   {"s_per_step": 0.0002724121874780394, "steps": 2000, "x64": true, "seed": 0}

Accuracy (two-stream)

Electric field energy:

Relative energy error:

Raw traces (.npz):

docs/benchmarks/results/two_stream_origin_main_energy.npz
docs/benchmarks/results/two_stream_fastpath_energy.npz

Profiling (Perfetto)

Perfetto traces (.trace.json.gz):

docs/benchmarks/profiles/two_stream_origin_main.trace.json.gz
docs/benchmarks/profiles/two_stream_fastpath.trace.json.gz

Repro

Config: demos/two_stream/two_stream.toml (plotting disabled for timing)
Seed: np.random.seed(0)
CPU, enable_x64=true

rogeriojorge · 2026-03-14T20:28:45Z

For now, this is too much of a change for too little gains. Doing a final pass, otherwise I'll close this PR

rogeriojorge · 2026-03-14T21:06:11Z

Update (commit 5184a38): added opt-in fast modes and more CPU reductions.\n\n- New : .\n - : x64 off + shape_factor=1 + J filter none + forces all outputs off + compiles full-run when possible.\n - : same as aggressive + forces (explicitly a physics approximation).\n- now uses the numerically-stable relativistic KE formula so fp32 fast modes don’t produce negative kinetic energies.\n- Fixed 2D stacked-field interpolation broadcasting (multi-component interpolation now expands weights).\n\nTwo-stream long bench (CPU, , , plotting disabled):\n- origin/main: (from CLI output)\n- fast_mode=aggressive: (~3.2×)\n- fast_mode=extreme: (~3.6×)\n\nI’m still not seeing a physics-preserving path to a true 5×+ on this workload without a much larger refactor (e.g., monolithic particle state like JAX-in-Cell) or additional physics approximations (electrostatic / frozen species / non-relativistic).

rogeriojorge · 2026-03-14T21:06:23Z

Update (commit 5184a38): added opt-in fast modes and more CPU reductions.

New simulation_parameters.fast_mode: off|fp32|aggressive|extreme.
- aggressive: x64 off + shape_factor=1 + J filter none + forces all outputs off + compiles full-run when possible.
- extreme: same as aggressive + forces relativistic=false (explicitly a physics approximation).
compute_energy now uses the numerically-stable relativistic KE formula m c^2 (gamma-1) so fp32 fast modes don’t produce negative kinetic energies.
Fixed 2D stacked-field interpolation broadcasting (multi-component interpolation now expands weights).

Two-stream long bench (CPU, t_wind=5.413e-7, Nx=100,Ny=1,Nz=1, plotting disabled):

origin/main: 7.787e-4 s/step (from CLI output)
fast_mode=aggressive: 2.434e-4 s/step (~3.2×)
fast_mode=extreme: 2.162e-4 s/step (~3.6×)

I’m still not seeing a physics-preserving path to a true 5×+ on this workload without a much larger refactor (e.g., monolithic particle state like JAX-in-Cell) or additional physics approximations (electrostatic / frozen species / non-relativistic).

CPU fastpath: cheaper stencils, fewer allocations

c85ec0b

Aggressive CPU fast modes: fewer ops, opt-in extreme

5184a38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU fastpath: lower per-step overhead (Boris, Yee update, filters)#72

CPU fastpath: lower per-step overhead (Boris, Yee update, filters)#72
rogeriojorge wants to merge 2 commits intouwplasma:mainfrom
rogeriojorge:codex/cpu-5x-fastpath

rogeriojorge commented Mar 14, 2026

Uh oh!

rogeriojorge commented Mar 14, 2026

Uh oh!

rogeriojorge commented Mar 14, 2026

Uh oh!

rogeriojorge commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rogeriojorge commented Mar 14, 2026

Summary

Benchmarks (CPU)

Runtime (steady-state s/step)

Accuracy (two-stream)

Profiling (Perfetto)

Repro

Uh oh!

rogeriojorge commented Mar 14, 2026

Uh oh!

rogeriojorge commented Mar 14, 2026

Uh oh!

rogeriojorge commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant