Add Additional Examples by smartalecH · Pull Request #22 · facebookresearch/Khronos.jl

smartalecH · 2026-02-09T04:07:31Z

No description provided.

…amples Add detailed timing to prepare_simulation! (per-phase: init_geometry, init_boundaries, add_sources, init_fields, init_monitors) and to run() (total steps, elapsed time, MVoxels/s overall and after 50-step warmup). Augment all 6 example scripts to run twice with fresh simulation objects: once cold (includes JIT compilation) and once warm, to separate algorithmic cost from compilation overhead. Plotting is done only once at the end.

Two new documents in docs/: - ROADMAP.md: Three-layer architecture (Frontend, Graph Compiler, Engine Backend) organizing 32+ feature gaps identified by comparing Khronos against Meep, Tidy3D, and fdtdx. Each feature specifies priority tier (P0-P3), implementation guidance, code references, and scope estimates. - EXAMPLES.md: 49 proposed examples mapped against 33 Meep examples, 199 Tidy3D notebooks, and 5 fdtdx examples. Includes 13 scaling and performance benchmarks (9 buildable now), cross-solver coverage matrix, and phased implementation plan aligned with the roadmap.

run_benchmark was measuring CPU kernel launch time instead of actual GPU execution time, inflating reported MCells/s by up to 120x at large grid sizes. Add KernelAbstractions.synchronize() after warmup and after the measurement loop.

- cw_steady_state.jl: CW source steady-state field validation - fresnel_reflectance.jl: Fresnel reflectance via reflected-field subtraction - anisotropic_slab.jl: Polarization-dependent Fabry-Perot transmission - gaussian_beam_waist.jl: Gaussian beam waist measurement - waveguide_2d_te.jl: 2D TE waveguide confinement - throughput_vs_size.jl: Single-GPU throughput scaling with grid size - throughput_vs_complexity.jl: Throughput vs physics complexity - precision_comparison.jl: Float32 vs Float64 accuracy comparison

- bandwidth_analysis.jl: Actual vs reported throughput (sync bug analysis) - bandwidth_deepdive.jl: Per-kernel bandwidth, stencil stride, OffsetArray overhead - composability_test.jl: Isolate dispatch vs OffsetArray vs fusion overhead - kernel_profiling.jl: Per-kernel timing breakdown with CUDA events - profile_kernel.jl: Nsight-compatible profiling wrapper

@inline

Removed ::Union{AbstractArray,Nothing} type annotations from step_curl! and update_field! @kernel signatures in Timestep.jl. These annotations prevented GPUCompiler from specializing per call-site concrete types, forcing a single generic kernel with runtime dispatch. Added @inline to all helper functions (generic_curl!, update_field_generic, scale_by_half, get_σ, get_σD, get_m_inv) to ensure full inlining on GPU. Results on A100 80GB (256³, Float64, PML): - Full step: 363 → 1747 MCells/s (4.8x) - Memory bandwidth: 262 → 1258 GB/s (13% → 62% of peak) - step_curl!: 512 → 1661 GB/s (81% peak) - update_field! (no source): 148 → 1365 GB/s (67% peak) Updated ROADMAP.md with performance benchmarks across all examples and a prioritized list of remaining optimization opportunities.

github-actions

Remaining comments which cannot be posted as a review comment to avoid GitHub Rate Limit

JuliaFormatter