Skip to content

Add Additional Examples#22

Open
smartalecH wants to merge 7 commits intomainfrom
examples
Open

Add Additional Examples#22
smartalecH wants to merge 7 commits intomainfrom
examples

Conversation

@smartalecH
Copy link
Copy Markdown
Contributor

No description provided.

…amples

Add detailed timing to prepare_simulation! (per-phase: init_geometry,
init_boundaries, add_sources, init_fields, init_monitors) and to run()
(total steps, elapsed time, MVoxels/s overall and after 50-step warmup).

Augment all 6 example scripts to run twice with fresh simulation objects:
once cold (includes JIT compilation) and once warm, to separate algorithmic
cost from compilation overhead. Plotting is done only once at the end.
Two new documents in docs/:

- ROADMAP.md: Three-layer architecture (Frontend, Graph Compiler, Engine
  Backend) organizing 32+ feature gaps identified by comparing Khronos
  against Meep, Tidy3D, and fdtdx. Each feature specifies priority tier
  (P0-P3), implementation guidance, code references, and scope estimates.

- EXAMPLES.md: 49 proposed examples mapped against 33 Meep examples,
  199 Tidy3D notebooks, and 5 fdtdx examples. Includes 13 scaling and
  performance benchmarks (9 buildable now), cross-solver coverage matrix,
  and phased implementation plan aligned with the roadmap.
run_benchmark was measuring CPU kernel launch time instead of actual
GPU execution time, inflating reported MCells/s by up to 120x at
large grid sizes. Add KernelAbstractions.synchronize() after warmup
and after the measurement loop.
- cw_steady_state.jl: CW source steady-state field validation
- fresnel_reflectance.jl: Fresnel reflectance via reflected-field subtraction
- anisotropic_slab.jl: Polarization-dependent Fabry-Perot transmission
- gaussian_beam_waist.jl: Gaussian beam waist measurement
- waveguide_2d_te.jl: 2D TE waveguide confinement
- throughput_vs_size.jl: Single-GPU throughput scaling with grid size
- throughput_vs_complexity.jl: Throughput vs physics complexity
- precision_comparison.jl: Float32 vs Float64 accuracy comparison
- bandwidth_analysis.jl: Actual vs reported throughput (sync bug analysis)
- bandwidth_deepdive.jl: Per-kernel bandwidth, stencil stride, OffsetArray overhead
- composability_test.jl: Isolate dispatch vs OffsetArray vs fusion overhead
- kernel_profiling.jl: Per-kernel timing breakdown with CUDA events
- profile_kernel.jl: Nsight-compatible profiling wrapper
Removed ::Union{AbstractArray,Nothing} type annotations from step_curl!
and update_field! @kernel signatures in Timestep.jl. These annotations
prevented GPUCompiler from specializing per call-site concrete types,
forcing a single generic kernel with runtime dispatch. Added @inline to
all helper functions (generic_curl!, update_field_generic, scale_by_half,
get_σ, get_σD, get_m_inv) to ensure full inlining on GPU.

Results on A100 80GB (256³, Float64, PML):
- Full step: 363 → 1747 MCells/s (4.8x)
- Memory bandwidth: 262 → 1258 GB/s (13% → 62% of peak)
- step_curl!: 512 → 1661 GB/s (81% peak)
- update_field! (no source): 148 → 1365 GB/s (67% peak)

Updated ROADMAP.md with performance benchmarks across all examples and
a prioritized list of remaining optimization opportunities.
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 9, 2026
Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remaining comments which cannot be posted as a review comment to avoid GitHub Rate Limit

JuliaFormatter

[JuliaFormatter] reported by reviewdog 🐶

results[N] = (mcells_s=mcells_s, reported=reported_rate,
working_set_mb=working_set_mb, ms_per_step=ms_per_step)


[JuliaFormatter] reported by reviewdog 🐶

println("$(N)^3: $(round(mcells_s, digits=0)) MCells/s (actual) vs $(round(reported_rate, digits=0)) MCells/s (run_benchmark) | working set: $(round(working_set_mb, digits=0)) MB | $(round(ms_per_step, digits=2)) ms/step")


[JuliaFormatter] reported by reviewdog 🐶

inflation = round(r.reported / r.mcells_s, digits=1)
println(" $(rpad(N, 6)) $(rpad(round(r.mcells_s, digits=0), 18)) $(rpad(round(r.reported, digits=0), 20)) $(rpad(string(inflation)*"x", 13)) $(round(r.working_set_mb, digits=0)) MB")


[JuliaFormatter] reported by reviewdog 🐶

f = Figure(size=(900, 500))
ax = Axis(f[1, 1],


[JuliaFormatter] reported by reviewdog 🐶

title = "FDTD Throughput: Actual (synced) vs Reported (no sync)",


[JuliaFormatter] reported by reviewdog 🐶

scatterlines!(ax, Float64.(Ns), actual, label="Actual (CUDA events)", color=:blue, markersize=10, linewidth=2)
scatterlines!(ax, Float64.(Ns), reported, label="Reported (run_benchmark)", color=:red, markersize=10, linewidth=2, linestyle=:dash)
axislegend(ax, position=:lt)


[JuliaFormatter] reported by reviewdog 🐶

const Nx = N; const Ny = N; const Nz = N


[JuliaFormatter] reported by reviewdog 🐶

function cuda_time_events(f, n_reps; n_warmup=10)
for _ in 1:n_warmup; f(); end


[JuliaFormatter] reported by reviewdog 🐶

for _ in 1:n_reps; f(); end


[JuliaFormatter] reported by reviewdog 🐶

println("cudaMemcpy (1R+1W): $(round(bw_memcpy, digits=1)) GB/s ($(round(t_memcpy*1e6, digits=1)) μs)")


[JuliaFormatter] reported by reviewdog 🐶

println("Broadcast SAXPY: $(round(bw_bcast, digits=1)) GB/s ($(round(t_bcast*1e6, digits=1)) μs)")


[JuliaFormatter] reported by reviewdog 🐶

println("Broadcast 5-sum: $(round(bw_5sum, digits=1)) GB/s ($(round(t_5sum*1e6, digits=1)) μs)")


[JuliaFormatter] reported by reviewdog 🐶

println("Broadcast 15-array: $(round(bw_15arr, digits=1)) GB/s ($(round(t_15arr*1e6, digits=1)) μs)")


[JuliaFormatter] reported by reviewdog 🐶

t_ka_copy = cuda_time_events(() -> ka_copy_k(B_ka, A_ka, ndrange=n_elements), 200)


[JuliaFormatter] reported by reviewdog 🐶

println("KA Copy (1R+1W): $(round(bw_ka_copy, digits=1)) GB/s ($(round(t_ka_copy*1e6, digits=1)) μs)")


[JuliaFormatter] reported by reviewdog 🐶

t_ka_saxpy = cuda_time_events(() -> ka_saxpy_k(C_ka, A_ka, B_ka, 2.0, ndrange=n_elements), 200)


[JuliaFormatter] reported by reviewdog 🐶

println("KA SAXPY (2R+1W): $(round(bw_ka_saxpy, digits=1)) GB/s ($(round(t_ka_saxpy*1e6, digits=1)) μs)")


[JuliaFormatter] reported by reviewdog 🐶

println("KA copy overhead vs memcpy: $(round((t_ka_copy - t_memcpy) / t_memcpy * 100, digits=1))%")
println("KA copy overhead vs broadcast: $(round((t_ka_copy - t_bcast) / t_bcast * 100, digits=1))% (different op, just timing ref)")


[JuliaFormatter] reported by reviewdog 🐶

t_3d_plain = cuda_time_events(() -> ka_3d_plain(B_3d, A_3d, ndrange=(Nx, Ny, Nz)), 200)


[JuliaFormatter] reported by reviewdog 🐶

println("KA 3D plain (ix+1): $(round(bw_3d_plain, digits=1)) GB/s ($(round(t_3d_plain*1e6, digits=1)) μs)")


[JuliaFormatter] reported by reviewdog 🐶

t_3d_off = cuda_time_events(() -> ka_3d_off(B_off, A_off, ndrange=(Nx, Ny, Nz)), 200)


[JuliaFormatter] reported by reviewdog 🐶

println("KA 3D OffsetArray+CI: $(round(bw_3d_off, digits=1)) GB/s ($(round(t_3d_off*1e6, digits=1)) μs)")


[JuliaFormatter] reported by reviewdog 🐶

println("\nOffsetArray+CI overhead: $(round((t_3d_off - t_3d_plain) / t_3d_plain * 100, digits=1))%")


[JuliaFormatter] reported by reviewdog 🐶

Kx = (src1[ix+1, iy+1, iz+2] - src1[ix+1, iy+1, iz+1]) -
(src2[ix+1, iy+2, iz+1] - src2[ix+1, iy+1, iz+1])


[JuliaFormatter] reported by reviewdog 🐶

t_sx = cuda_time_events(() -> k_sx(dst_3d, src_3d, ndrange=(Nx,Ny,Nz)), 200)
t_sy = cuda_time_events(() -> k_sy(dst_3d, src_3d, ndrange=(Nx,Ny,Nz)), 200)
t_sz = cuda_time_events(() -> k_sz(dst_3d, src_3d, ndrange=(Nx,Ny,Nz)), 200)


[JuliaFormatter] reported by reviewdog 🐶

println("Stencil d/dx (stride-1): $(round(bytes_stencil/t_sx/1e9, digits=1)) GB/s ($(round(t_sx*1e6, digits=1)) μs)")
println("Stencil d/dy (stride-$(Nx+2)): $(round(bytes_stencil/t_sy/1e9, digits=1)) GB/s ($(round(t_sy*1e6, digits=1)) μs)")
println("Stencil d/dz (stride-$((Nx+2)*(Ny+2))): $(round(bytes_stencil/t_sz/1e9, digits=1)) GB/s ($(round(t_sz*1e6, digits=1)) μs)")


[JuliaFormatter] reported by reviewdog 🐶

t_curl_v = cuda_time_events(() -> k_curl(dst_3d, src_3d, src2_3d, ndrange=(Nx,Ny,Nz)), 200)
println("Full curl (4R+1W, 2 arrays): $(round(bytes_curl/t_curl_v/1e9, digits=1)) GB/s ($(round(t_curl_v*1e6, digits=1)) μs)")


[JuliaFormatter] reported by reviewdog 🐶

A[ix+1, iy+1, iz+1] = A[ix+1, iy+1, iz+1] +
(1.0 + sigma) * W[ix+1, iy+1, iz+1] - (1.0 - sigma) * W_old


[JuliaFormatter] reported by reviewdog 🐶

Ax, Ay, Az, Tx, Ty, Tz, Wx, Wy, Wz,
sigma::Float64, m_inv::Float64)


[JuliaFormatter] reported by reviewdog 🐶

Ax[ix+1, iy+1, iz+1] = Ax[ix+1, iy+1, iz+1] +
(1.0 + sigma) * Wx[ix+1, iy+1, iz+1] - (1.0 - sigma) * W_old


[JuliaFormatter] reported by reviewdog 🐶

Ay[ix+1, iy+1, iz+1] = Ay[ix+1, iy+1, iz+1] +
(1.0 + sigma) * Wy[ix+1, iy+1, iz+1] - (1.0 - sigma) * W_old


[JuliaFormatter] reported by reviewdog 🐶

Az[ix+1, iy+1, iz+1] = Az[ix+1, iy+1, iz+1] +
(1.0 + sigma) * Wz[ix+1, iy+1, iz+1] - (1.0 - sigma) * W_old


[JuliaFormatter] reported by reviewdog 🐶

@kernel function ka_curl_3comp!(Tx, Ty, Tz, Ax, Ay, Az, dt::Float64, dx::Float64, dy::Float64, dz::Float64)


[JuliaFormatter] reported by reviewdog 🐶

inv_dx = 1.0 / dx; inv_dy = 1.0 / dy; inv_dz = 1.0 / dz


[JuliaFormatter] reported by reviewdog 🐶

Kx = dt * (inv_dz * (Ay[ix+1, iy+1, iz+2] - Ay[ix+1, iy+1, iz+1]) -
inv_dy * (Az[ix+1, iy+2, iz+1] - Az[ix+1, iy+1, iz+1]))


[JuliaFormatter] reported by reviewdog 🐶

Ky = dt * (inv_dx * (Az[ix+2, iy+1, iz+1] - Az[ix+1, iy+1, iz+1]) -
inv_dz * (Ax[ix+1, iy+1, iz+2] - Ax[ix+1, iy+1, iz+1]))


[JuliaFormatter] reported by reviewdog 🐶

Kz = dt * (inv_dy * (Ax[ix+1, iy+2, iz+1] - Ax[ix+1, iy+1, iz+1]) -
inv_dx * (Ay[ix+2, iy+1, iz+1] - Ay[ix+1, iy+1, iz+1]))


[JuliaFormatter] reported by reviewdog 🐶

test_arrays = [CUDA.rand(Float64, Nx+2, Ny+2, Nz+2) for _ in 1:12]


[JuliaFormatter] reported by reviewdog 🐶

t_upd1 = cuda_time_events(() -> k_upd1(test_arrays[1], test_arrays[2], test_arrays[3],
0.1, 1.0, ndrange=(Nx,Ny,Nz)), 200)
println("KA 1-comp update (3R+2W): $(round(bytes_1c/t_upd1/1e9, digits=1)) GB/s ($(round(t_upd1*1e6, digits=1)) μs)")


[JuliaFormatter] reported by reviewdog 🐶

t_upd3 = cuda_time_events(() -> k_upd3(
test_arrays[1], test_arrays[2], test_arrays[3],
test_arrays[4], test_arrays[5], test_arrays[6],
test_arrays[7], test_arrays[8], test_arrays[9],
0.1, 1.0, ndrange=(Nx,Ny,Nz)), 200)
println("KA 3-comp update (9R+6W): $(round(bytes_3c/t_upd3/1e9, digits=1)) GB/s ($(round(t_upd3*1e6, digits=1)) μs)")


[JuliaFormatter] reported by reviewdog 🐶

t_curl3 = cuda_time_events(() -> k_curl3(
test_arrays[1], test_arrays[2], test_arrays[3],
test_arrays[4], test_arrays[5], test_arrays[6],
0.5, 0.1, 0.1, 0.1, ndrange=(Nx,Ny,Nz)), 200)
println("KA 3-comp curl (9R+3RMW): $(round(bytes_curl3/t_curl3/1e9, digits=1)) GB/s ($(round(t_curl3*1e6, digits=1)) μs)")


[JuliaFormatter] reported by reviewdog 🐶

time_profile = Khronos.ContinuousWaveSource(fcen=1.0),


[JuliaFormatter] reported by reviewdog 🐶

size = [0.0, 0.0, 0.0],


[JuliaFormatter] reported by reviewdog 🐶

cell_size = [cell_val, cell_val, cell_val],


[JuliaFormatter] reported by reviewdog 🐶

resolution = 10,
sources = sources,
boundaries = [[dpml, dpml], [dpml, dpml], [dpml, dpml]],


[JuliaFormatter] reported by reviewdog 🐶

for _ in 1:30; Khronos.step!(sim); end


[JuliaFormatter] reported by reviewdog 🐶

println(" $(rpad("Kernel", 25)) $(rpad("Time (μs)", 12)) $(rpad("BW @240B/vx", 14)) $(rpad("BW @120B/vx", 14))")


[JuliaFormatter] reported by reviewdog 🐶

println(" $(rpad(name, 25)) $(rpad(round(t*1e6, digits=1), 12)) $(bytes_vx==240 ? rpad(string(round(bw, digits=1))*" GB/s", 14) : rpad("-", 14)) $(bytes_vx==120 ? rpad(string(round(bw, digits=1))*" GB/s", 14) : rpad("-", 14))")


[JuliaFormatter] reported by reviewdog 🐶

println(" $(rpad("KA 3-comp curl (no PML)", 30)) $(round(bytes_curl3/t_curl3/1e9, digits=1))")
println(" $(rpad("KA 3-comp update (no PML)", 30)) $(round(bytes_3c/t_upd3/1e9, digits=1))")


[JuliaFormatter] reported by reviewdog 🐶

println(" $(rpad("Khronos update_field!", 30)) $(round(voxels*120/t_upd_H/1e9, digits=1))")


[JuliaFormatter] reported by reviewdog 🐶

("fHx", sim.fields.fHx), ("fHy", sim.fields.fHy), ("fHz", sim.fields.fHz),
("fBx", sim.fields.fBx), ("fBy", sim.fields.fBy), ("fBz", sim.fields.fBz),
("fWBx", sim.fields.fWBx), ("fWBy", sim.fields.fWBy), ("fWBz", sim.fields.fWBz),
("fPBx", sim.fields.fPBx), ("fPBy", sim.fields.fPBy), ("fPBz", sim.fields.fPBz),
("fSBx", sim.fields.fSBx), ("fSBy", sim.fields.fSBy), ("fSBz", sim.fields.fSBz),


[JuliaFormatter] reported by reviewdog 🐶

("μ_inv_x", sim.geometry_data.μ_inv_x), ("μ_inv_y", sim.geometry_data.μ_inv_y), ("μ_inv_z", sim.geometry_data.μ_inv_z),
("σBx", sim.boundary_data.σBx), ("σBy", sim.boundary_data.σBy), ("σBz", sim.boundary_data.σBz),


[JuliaFormatter] reported by reviewdog 🐶

("fEx", sim.fields.fEx), ("fEy", sim.fields.fEy), ("fEz", sim.fields.fEz),
("fDx", sim.fields.fDx), ("fDy", sim.fields.fDy), ("fDz", sim.fields.fDz),
("fWDx", sim.fields.fWDx), ("fWDy", sim.fields.fWDy), ("fWDz", sim.fields.fWDz),
("fPDx", sim.fields.fPDx), ("fPDy", sim.fields.fPDy), ("fPDz", sim.fields.fPDz),
("fSDx", sim.fields.fSDx), ("fSDy", sim.fields.fSDy), ("fSDz", sim.fields.fSDz),


[JuliaFormatter] reported by reviewdog 🐶

("ε_inv_x", sim.geometry_data.ε_inv_x), ("ε_inv_y", sim.geometry_data.ε_inv_y), ("ε_inv_z", sim.geometry_data.ε_inv_z),
("σDx", sim.boundary_data.σDx), ("σDy", sim.boundary_data.σDy), ("σDz", sim.boundary_data.σDz),


[JuliaFormatter] reported by reviewdog 🐶

("fCBx", sim.fields.fCBx), ("fCBy", sim.fields.fCBy), ("fCBz", sim.fields.fCBz),
("fUBx", sim.fields.fUBx), ("fUBy", sim.fields.fUBy), ("fUBz", sim.fields.fUBz),
("geo σBx", sim.geometry_data.σBx), ("geo σBy", sim.geometry_data.σBy), ("geo σBz", sim.geometry_data.σBz),
("bnd σBx", sim.boundary_data.σBx), ("bnd σBy", sim.boundary_data.σBy), ("bnd σBz", sim.boundary_data.σBz),


[JuliaFormatter] reported by reviewdog 🐶


[JuliaFormatter] reported by reviewdog 🐶


[JuliaFormatter] reported by reviewdog 🐶

const Nx = N; const Ny = N; const Nz = N
function cuda_time_events(f, n_reps; n_warmup=10)
for _ in 1:n_warmup; f(); end


[JuliaFormatter] reported by reviewdog 🐶

for _ in 1:n_reps; f(); end


[JuliaFormatter] reported by reviewdog 🐶

Ax::Union{AbstractArray,Nothing}, Ay::Union{AbstractArray,Nothing}, Az::Union{AbstractArray,Nothing},
Tx::Union{AbstractArray,Nothing}, Ty::Union{AbstractArray,Nothing}, Tz::Union{AbstractArray,Nothing},
Wx::Union{AbstractArray,Nothing}, Wy::Union{AbstractArray,Nothing}, Wz::Union{AbstractArray,Nothing},
Px::Union{AbstractArray,Nothing}, Py::Union{AbstractArray,Nothing}, Pz::Union{AbstractArray,Nothing},
Sx::Union{AbstractArray,Nothing}, Sy::Union{AbstractArray,Nothing}, Sz::Union{AbstractArray,Nothing},


[JuliaFormatter] reported by reviewdog 🐶

m_inv_x::Union{AbstractArray,Nothing}, m_inv_y::Union{AbstractArray,Nothing}, m_inv_z::Union{AbstractArray,Nothing},
σx::Union{AbstractArray,Nothing}, σy::Union{AbstractArray,Nothing}, σz::Union{AbstractArray,Nothing},


[JuliaFormatter] reported by reviewdog 🐶

update_field_generic_test(Ax, Tx, Wx, Px, Sx,
get_m_inv_test(m_inv, m_inv_x, idx_array), get_σ_test(σx, ix), idx_array)
update_field_generic_test(Ay, Ty, Wy, Py, Sy,
get_m_inv_test(m_inv, m_inv_y, idx_array), get_σ_test(σy, iy), idx_array)
update_field_generic_test(Az, Tz, Wz, Pz, Sz,
get_m_inv_test(m_inv, m_inv_z, idx_array), get_σ_test(σz, iz), idx_array)


[JuliaFormatter] reported by reviewdog 🐶

Ax, Ay, Az, # fields to update (E or H)
Tx, Ty, Tz, # timestepped fields (D or B)
Wx, Wy, Wz, # auxiliary PML fields


[JuliaFormatter] reported by reviewdog 🐶

σx, σy, σz, # PML conductivity (1D arrays)


[JuliaFormatter] reported by reviewdog 🐶

sx = σx[2*ix-1]; sy = σy[2*iy-1]; sz = σz[2*iz-1]


[JuliaFormatter] reported by reviewdog 🐶

Ax, Ay, Az,
Tx, Ty, Tz,
Wx, Wy, Wz,


[JuliaFormatter] reported by reviewdog 🐶


[JuliaFormatter] reported by reviewdog 🐶

sx = σx[2*ix-1]; sy = σy[2*iy-1]; sz = σz[2*iz-1]


[JuliaFormatter] reported by reviewdog 🐶

Ax[ix+1, iy+1, iz+1] = Ax[ix+1, iy+1, iz+1] +
(1.0 + sx) * Wx[ix+1, iy+1, iz+1] - (1.0 - sx) * W_old


[JuliaFormatter] reported by reviewdog 🐶

Ay[ix+1, iy+1, iz+1] = Ay[ix+1, iy+1, iz+1] +
(1.0 + sy) * Wy[ix+1, iy+1, iz+1] - (1.0 - sy) * W_old


[JuliaFormatter] reported by reviewdog 🐶

Az[ix+1, iy+1, iz+1] = Az[ix+1, iy+1, iz+1] +
(1.0 + sz) * Wz[ix+1, iy+1, iz+1] - (1.0 - sz) * W_old


[JuliaFormatter] reported by reviewdog 🐶

off_Ax = make_off(); off_Ay = make_off(); off_Az = make_off()
off_Tx = make_off(); off_Ty = make_off(); off_Tz = make_off()
off_Wx = make_off(); off_Wy = make_off(); off_Wz = make_off()
pln_Ax = make_plain(); pln_Ay = make_plain(); pln_Az = make_plain()
pln_Tx = make_plain(); pln_Ty = make_plain(); pln_Tz = make_plain()
pln_Wx = make_plain(); pln_Wy = make_plain(); pln_Wz = make_plain()


[JuliaFormatter] reported by reviewdog 🐶

t_composed = cuda_time_events(() -> k_composed(
off_Ax, off_Ay, off_Az,
off_Tx, off_Ty, off_Tz,
off_Wx, off_Wy, off_Wz,
nothing, nothing, nothing, # P
nothing, nothing, nothing, # S
1.0, # m_inv (scalar)
nothing, nothing, nothing, # m_inv_x/y/z
σx_off, σy_off, σz_off, # σ
ndrange=(Nx, Ny, Nz)), n_rep)


[JuliaFormatter] reported by reviewdog 🐶

println("Composed (Khronos-style, OffsetArray+Nothing): $(round(bw_composed, digits=1)) GB/s ($(round(t_composed*1e6, digits=1)) μs)")


[JuliaFormatter] reported by reviewdog 🐶

t_flat = cuda_time_events(() -> k_flat(
off_Ax, off_Ay, off_Az,
off_Tx, off_Ty, off_Tz,
off_Wx, off_Wy, off_Wz,
1.0, σx_1d, σy_1d, σz_1d,
ndrange=(Nx, Ny, Nz)), n_rep)


[JuliaFormatter] reported by reviewdog 🐶

println("Flattened (OffsetArray, no dispatch): $(round(bw_flat, digits=1)) GB/s ($(round(t_flat*1e6, digits=1)) μs)")


[JuliaFormatter] reported by reviewdog 🐶

t_bare = cuda_time_events(() -> k_bare(
pln_Ax, pln_Ay, pln_Az,
pln_Tx, pln_Ty, pln_Tz,
pln_Wx, pln_Wy, pln_Wz,
1.0, σx_1d, σy_1d, σz_1d,
ndrange=(Nx, Ny, Nz)), n_rep)


[JuliaFormatter] reported by reviewdog 🐶

println("Bare (plain CuArray, ix+1 indexing): $(round(bw_bare, digits=1)) GB/s ($(round(t_bare*1e6, digits=1)) μs)")


[JuliaFormatter] reported by reviewdog 🐶

cell_size = [cell_val, cell_val, cell_val],


[JuliaFormatter] reported by reviewdog 🐶

resolution = 10,
sources = [Khronos.UniformSource(
time_profile = Khronos.ContinuousWaveSource(fcen=1.0),
component = Khronos.Ez(),
center = [0.0, 0.0, 0.0], size = [0.0, 0.0, 0.0])],
boundaries = [[1.0, 1.0], [1.0, 1.0], [1.0, 1.0]],


[JuliaFormatter] reported by reviewdog 🐶

for _ in 1:30; Khronos.step!(sim); end


[JuliaFormatter] reported by reviewdog 🐶

println("Actual Khronos update_H_from_B!: $(round(bw_khronos, digits=1)) GB/s ($(round(t_khronos*1e6, digits=1)) μs)")


[JuliaFormatter] reported by reviewdog 🐶

println(" Bare → Flattened+OffsetArray: $(round((t_flat/t_bare - 1)*100, digits=1))% (OffsetArray + CartesianIndex cost)")
println(" Flattened → Composed: $(round((t_composed/t_flat - 1)*100, digits=1))% (nested dispatch + Nothing args cost)")
println(" Composed → Khronos: $(round((t_khronos/t_composed - 1)*100, digits=1))% (remaining Khronos-specific overhead)")
println(" Total: Bare → Khronos: $(round((t_khronos/t_bare - 1)*100, digits=1))% ($(round(t_khronos/t_bare, digits=2))x slowdown)")


[JuliaFormatter] reported by reviewdog 🐶

typeof(off_Ax), typeof(off_Ay), typeof(off_Az),
typeof(off_Tx), typeof(off_Ty), typeof(off_Tz),
typeof(off_Wx), typeof(off_Wy), typeof(off_Wz),
Nothing, Nothing, Nothing,
Nothing, Nothing, Nothing,


[JuliaFormatter] reported by reviewdog 🐶

Nothing, Nothing, Nothing,
typeof(σx_off), typeof(σy_off), typeof(σz_off),


[JuliaFormatter] reported by reviewdog 🐶


[JuliaFormatter] reported by reviewdog 🐶


[JuliaFormatter] reported by reviewdog 🐶


[JuliaFormatter] reported by reviewdog 🐶


[JuliaFormatter] reported by reviewdog 🐶


[JuliaFormatter] reported by reviewdog 🐶

dt::Float64, dx::Float64, dy::Float64, dz::Float64, m_inv::Float64,


[JuliaFormatter] reported by reviewdog 🐶


[JuliaFormatter] reported by reviewdog 🐶

inv_dx = 1.0 / dx; inv_dy = 1.0 / dy; inv_dz = 1.0 / dz


[JuliaFormatter] reported by reviewdog 🐶

Kx = dt * (inv_dz * (Ey[ix+1, iy+1, iz+2] - Ey[ix+1, iy+1, iz+1]) -
inv_dy * (Ez[ix+1, iy+2, iz+1] - Ez[ix+1, iy+1, iz+1]))


[JuliaFormatter] reported by reviewdog 🐶

Hx[ix+1, iy+1, iz+1] = Hx[ix+1, iy+1, iz+1] +
(1.0 + sx) * Wx[ix+1, iy+1, iz+1] - (1.0 - sx) * W_old_x


[JuliaFormatter] reported by reviewdog 🐶

Ky = dt * (inv_dx * (Ez[ix+2, iy+1, iz+1] - Ez[ix+1, iy+1, iz+1]) -
inv_dz * (Ex[ix+1, iy+1, iz+2] - Ex[ix+1, iy+1, iz+1]))


[JuliaFormatter] reported by reviewdog 🐶

Hy[ix+1, iy+1, iz+1] = Hy[ix+1, iy+1, iz+1] +
(1.0 + sy) * Wy[ix+1, iy+1, iz+1] - (1.0 - sy) * W_old_y


[JuliaFormatter] reported by reviewdog 🐶

Kz = dt * (inv_dy * (Ex[ix+1, iy+2, iz+1] - Ex[ix+1, iy+1, iz+1]) -
inv_dx * (Ey[ix+2, iy+1, iz+1] - Ey[ix+1, iy+1, iz+1]))


[JuliaFormatter] reported by reviewdog 🐶

Hz[ix+1, iy+1, iz+1] = Hz[ix+1, iy+1, iz+1] +
(1.0 + sz) * Wz[ix+1, iy+1, iz+1] - (1.0 - sz) * W_old_z


[JuliaFormatter] reported by reviewdog 🐶

fH = [make_plain() for _ in 1:3]
fE = [make_plain() for _ in 1:3]
fB = [make_plain() for _ in 1:3]
fW = [make_plain() for _ in 1:3]
fU = [make_plain() for _ in 1:3]


[JuliaFormatter] reported by reviewdog 🐶

t_fused = cuda_time_events(() -> k_fused(
fH[1], fH[2], fH[3],
fE[1], fE[2], fE[3],
fB[1], fB[2], fB[3],
fW[1], fW[2], fW[3],
fU[1], fU[2], fU[3],
0.5, 0.1, 0.1, 0.1, 1.0,
σx_1d, σy_1d, σz_1d,
ndrange=(Nx, Ny, Nz)), n_rep)


[JuliaFormatter] reported by reviewdog 🐶

@kernel function unfused_curl!(Bx, By, Bz, Ex, Ey, Ez, Ux, Uy, Uz,
dt::Float64, dx::Float64, dy::Float64, dz::Float64, σx, σy, σz)


[JuliaFormatter] reported by reviewdog 🐶

inv_dx = 1.0/dx; inv_dy = 1.0/dy; inv_dz = 1.0/dz
Kx = dt * (inv_dz * (Ey[ix+1,iy+1,iz+2] - Ey[ix+1,iy+1,iz+1]) -
inv_dy * (Ez[ix+1,iy+2,iz+1] - Ez[ix+1,iy+1,iz+1]))
U_old = Ux[ix+1,iy+1,iz+1]
Ux[ix+1,iy+1,iz+1] = Ux[ix+1,iy+1,iz+1] + Kx
Bx[ix+1,iy+1,iz+1] = Bx[ix+1,iy+1,iz+1] + Ux[ix+1,iy+1,iz+1] - U_old
Ky = dt * (inv_dx * (Ez[ix+2,iy+1,iz+1] - Ez[ix+1,iy+1,iz+1]) -
inv_dz * (Ex[ix+1,iy+1,iz+2] - Ex[ix+1,iy+1,iz+1]))
U_old = Uy[ix+1,iy+1,iz+1]
Uy[ix+1,iy+1,iz+1] = Uy[ix+1,iy+1,iz+1] + Ky
By[ix+1,iy+1,iz+1] = By[ix+1,iy+1,iz+1] + Uy[ix+1,iy+1,iz+1] - U_old
Kz = dt * (inv_dy * (Ex[ix+1,iy+2,iz+1] - Ex[ix+1,iy+1,iz+1]) -
inv_dx * (Ey[ix+2,iy+1,iz+1] - Ey[ix+1,iy+1,iz+1]))
U_old = Uz[ix+1,iy+1,iz+1]
Uz[ix+1,iy+1,iz+1] = Uz[ix+1,iy+1,iz+1] + Kz
Bz[ix+1,iy+1,iz+1] = Bz[ix+1,iy+1,iz+1] + Uz[ix+1,iy+1,iz+1] - U_old


[JuliaFormatter] reported by reviewdog 🐶

@kernel function unfused_update!(Hx, Hy, Hz, Bx, By, Bz, Wx, Wy, Wz,
m_inv::Float64, σx, σy, σz)


[JuliaFormatter] reported by reviewdog 🐶

sx = σx[2*ix-1]; sy = σy[2*iy-1]; sz = σz[2*iz-1]


[JuliaFormatter] reported by reviewdog 🐶

W_old = Wx[ix+1,iy+1,iz+1]
Wx[ix+1,iy+1,iz+1] = m_inv * Bx[ix+1,iy+1,iz+1]
Hx[ix+1,iy+1,iz+1] = Hx[ix+1,iy+1,iz+1] + (1+sx)*Wx[ix+1,iy+1,iz+1] - (1-sx)*W_old


[JuliaFormatter] reported by reviewdog 🐶

W_old = Wy[ix+1,iy+1,iz+1]
Wy[ix+1,iy+1,iz+1] = m_inv * By[ix+1,iy+1,iz+1]
Hy[ix+1,iy+1,iz+1] = Hy[ix+1,iy+1,iz+1] + (1+sy)*Wy[ix+1,iy+1,iz+1] - (1-sy)*W_old


[JuliaFormatter] reported by reviewdog 🐶

W_old = Wz[ix+1,iy+1,iz+1]
Wz[ix+1,iy+1,iz+1] = m_inv * Bz[ix+1,iy+1,iz+1]
Hz[ix+1,iy+1,iz+1] = Hz[ix+1,iy+1,iz+1] + (1+sz)*Wz[ix+1,iy+1,iz+1] - (1-sz)*W_old


[JuliaFormatter] reported by reviewdog 🐶

k_ucurl(fB[1], fB[2], fB[3], fE[1], fE[2], fE[3],
fU[1], fU[2], fU[3], 0.5, 0.1, 0.1, 0.1, σx_1d, σy_1d, σz_1d,
ndrange=(Nx, Ny, Nz))
k_uupd(fH[1], fH[2], fH[3], fB[1], fB[2], fB[3],
fW[1], fW[2], fW[3], 1.0, σx_1d, σy_1d, σz_1d,
ndrange=(Nx, Ny, Nz))


[JuliaFormatter] reported by reviewdog 🐶

println("Unfused (2 kernels, plain arrays): $(round(bw_unfused, digits=1)) GB/s ($(round(t_unfused*1e6, digits=1)) μs)")
println("Fused (1 kernel, plain arrays): $(round(bw_fused, digits=1)) GB/s ($(round(t_fused*1e6, digits=1)) μs)")


[JuliaFormatter] reported by reviewdog 🐶

println(" $(rpad(name, 50)) $(rpad(round(bw, digits=0), 8)) $(rpad(round(t*1e6, digits=1), 10)) $(round(t/t_ref, digits=2))x")


[JuliaFormatter] reported by reviewdog 🐶

resolution = 15 # pixels/μm (increase for accuracy)
cell_xyz = 14.0 # domain size (μm)
dpml = 1.0 # PML thickness (μm)
fcen = 1.0 # source frequency (1/μm → λ = 1 μm)
t_end = 25.0 # simulation time (enough CW cycles for steady state)
n_probes = 8 # number of radial probe points


[JuliaFormatter] reported by reviewdog 🐶

function main(; resolution=resolution, cell_xyz=cell_xyz, dpml=dpml,
fcen=fcen, t_end=t_end, n_probes=n_probes)


[JuliaFormatter] reported by reviewdog 🐶

r_probes = collect(range(r_min, r_max, length=n_probes))


[JuliaFormatter] reported by reviewdog 🐶

time_profile = Khronos.ContinuousWaveSource(fcen=fcen),


[JuliaFormatter] reported by reviewdog 🐶

size = [0.0, 0.0, 0.0],


[JuliaFormatter] reported by reviewdog 🐶

size = [0.0, 0.0, 0.0],


[JuliaFormatter] reported by reviewdog 🐶

)
for r in r_probes


[JuliaFormatter] reported by reviewdog 🐶

cell_size = [cell_xyz, cell_xyz, cell_xyz],


[JuliaFormatter] reported by reviewdog 🐶

resolution = resolution,
sources = sources,
boundaries = [[dpml, dpml], [dpml, dpml], [dpml, dpml]],
monitors = monitors,


[JuliaFormatter] reported by reviewdog 🐶

Khronos.run(sim, until=t_end)


[JuliaFormatter] reported by reviewdog 🐶

for i in 1:n_probes


[JuliaFormatter] reported by reviewdog 🐶

println(" Probes: $n_probes points, r ∈ [$(r_probes[1]), $(round(r_probes[end],digits=1))] μm")


[JuliaFormatter] reported by reviewdog 🐶

for i in 1:n_probes
println(" $(rpad(round(r_probes[i],digits=2), 11))$(rpad(round(amplitudes[i],digits=6), 13))$(rpad(round(predicted[i],digits=6), 13))$(round(rel_errors[i],digits=4))")


[JuliaFormatter] reported by reviewdog 🐶

println(" Status: $(mean_rel_err < 0.15 ? "PASS" : "FAIL") (threshold=0.15)")


[JuliaFormatter] reported by reviewdog 🐶

f = Figure(size=(900, 400))


[JuliaFormatter] reported by reviewdog 🐶

xs = range(-cell_xyz/2, cell_xyz/2, length=size(ez_slice,1))
ys = range(-cell_xyz/2, cell_xyz/2, length=size(ez_slice,2))


[JuliaFormatter] reported by reviewdog 🐶

ax1 = Axis(f[1, 1], xlabel="x (μm)", ylabel="y (μm)",
title="CW Dipole Ez (XY plane)", aspect=DataAspect())
heatmap!(ax1, collect(xs), collect(ys), ez_slice,
colormap=:bluesreds, colorrange=(-vmax, vmax))


[JuliaFormatter] reported by reviewdog 🐶

ax2 = Axis(f[1, 2], xlabel="r (μm)", ylabel="|Ez| (DFT amplitude)",
title="Amplitude Decay: |Ez| vs A/r", yscale=log10, xscale=log10)
scatter!(ax2, r_probes, amplitudes, label="Khronos DFT", color=:blue, markersize=8)
r_dense = range(r_probes[1], r_probes[end], length=100)
lines!(ax2, collect(r_dense), A_fit ./ collect(r_dense),
label="A/r fit", color=:red, linewidth=2)
axislegend(ax2, position=:rt)


[JuliaFormatter] reported by reviewdog 🐶

n_slab = 3.5 # refractive index of dielectric half-space
wvl_min = 0.9 # shortest wavelength (μm)
wvl_max = 1.6 # longest wavelength (μm)
nfreq = 50 # number of frequency points
dpml = 1.0 # PML thickness (μm)


[JuliaFormatter] reported by reviewdog 🐶

function main(; resolution=resolution, n_slab=n_slab, wvl_min=wvl_min,
wvl_max=wvl_max, nfreq=nfreq, dpml=dpml)


[JuliaFormatter] reported by reviewdog 🐶

λ = range(wvl_min, wvl_max, length=nfreq)


[JuliaFormatter] reported by reviewdog 🐶

sz = 2 * dpml + 8.0


[JuliaFormatter] reported by reviewdog 🐶

function build_sim(; with_slab=false)


[JuliaFormatter] reported by reviewdog 🐶

time_profile = Khronos.GaussianPulseSource(fcen=fcen, fwidth=fwidth),


[JuliaFormatter] reported by reviewdog 🐶

size = [Inf, Inf, 0.0],


[JuliaFormatter] reported by reviewdog 🐶

size = [0.4, 0.4, 0.0],


[JuliaFormatter] reported by reviewdog 🐶

cell_size = [sxy, sxy, sz],


[JuliaFormatter] reported by reviewdog 🐶

resolution = resolution,
geometry = geometry,
sources = sources,
boundaries = [[dpml, dpml], [dpml, dpml], [dpml, dpml]],
monitors = [refl_monitor],


[JuliaFormatter] reported by reviewdog 🐶

sim_norm = build_sim(with_slab=false)
Khronos.run(sim_norm,


[JuliaFormatter] reported by reviewdog 🐶

tolerance=1e-11, minimum_runtime=20.0, maximum_runtime=500.0))


[JuliaFormatter] reported by reviewdog 🐶

sim_slab = build_sim(with_slab=true)
Khronos.run(sim_slab,


[JuliaFormatter] reported by reviewdog 🐶

tolerance=1e-11, minimum_runtime=20.0, maximum_runtime=500.0))


[JuliaFormatter] reported by reviewdog 🐶

P_inc = sum(abs.(inc_fields).^2, dims=[1,2])[1,1,1,:]
P_refl = sum(abs.(refl_fields).^2, dims=[1,2])[1,1,1,:]
R_all = P_refl ./ P_inc


[JuliaFormatter] reported by reviewdog 🐶

rms_err = sqrt(sum((R_khronos .- R_analytic).^2) / length(R_khronos))


[JuliaFormatter] reported by reviewdog 🐶

f = Figure(size=(900, 400))
ax1 = Axis(f[1, 1], xlabel="Wavelength (μm)", ylabel="Reflectance",
title="Fresnel Reflectance: air → n=$n_slab")
lines!(ax1, λ_good, R_analytic, label="Analytic R=|(n-1)/(n+1)|²",
color=:red, linewidth=2)
scatterlines!(ax1, λ_good, R_khronos, label="Khronos",
color=:blue, markersize=4)
axislegend(ax1, position=:rt)
ax2 = Axis(f[1, 2], xlabel="Wavelength (μm)", ylabel="|R_khronos - R_analytic|",
title="Absolute Error", yscale=log10)
scatterlines!(ax2, λ_good, abs.(R_khronos .- R_analytic) .+ 1e-16,
color=:black, markersize=3)


[JuliaFormatter] reported by reviewdog 🐶

resolution = 20 # pixels/μm
λ = 1.0 # wavelength (μm)


[JuliaFormatter] reported by reviewdog 🐶

dpml = 1.0 # PML thickness
n_monitors = 7 # number of z-planes to measure waist


[JuliaFormatter] reported by reviewdog 🐶

function main(; resolution=resolution, λ=λ, waist_radius=waist_radius,
dpml=dpml, n_monitors=n_monitors)


[JuliaFormatter] reported by reviewdog 🐶

sz = 2*dpml + 2*z_R + 4.0


[JuliaFormatter] reported by reviewdog 🐶

z_monitor = range(-z_R, z_R, length=n_monitors)


[JuliaFormatter] reported by reviewdog 🐶

time_profile = Khronos.GaussianPulseSource(fcen=fcen, fwidth=0.15*fcen),
center = [0.0, 0.0, src_z],
size = [sxy, sxy, 0.0],


[JuliaFormatter] reported by reviewdog 🐶

k_vector = [0.0, 0.0, 1.0],


[JuliaFormatter] reported by reviewdog 🐶

size = [monitor_xy_size, 0.0, 0.0], # 1D line along x


[JuliaFormatter] reported by reviewdog 🐶

)
for z in z_monitor


[JuliaFormatter] reported by reviewdog 🐶

cell_size = [sxy, sxy, sz],


[JuliaFormatter] reported by reviewdog 🐶

resolution = resolution,
sources = sources,
boundaries = [[dpml, dpml], [dpml, dpml], [dpml, dpml]],
monitors = monitors,


[JuliaFormatter] reported by reviewdog 🐶


[JuliaFormatter] reported by reviewdog 🐶

tolerance=1e-7, minimum_runtime=0.0, maximum_runtime=300.0))


[JuliaFormatter] reported by reviewdog 🐶

intensity = abs.(dft_data).^2


[JuliaFormatter] reported by reviewdog 🐶

xs = range(-monitor_xy_size/2, monitor_xy_size/2, length=nx)


[JuliaFormatter] reported by reviewdog 🐶

x_var = sum((xs .- x_mean).^2 .* intensity) / total


[JuliaFormatter] reported by reviewdog 🐶

println(" $(round(z, digits=2))\t\t$(round(measured_waists[i], digits=3))\t\t$(round(w_analytic[i], digits=3))\t\t$(round(rel_errors[i], digits=3))")


[JuliaFormatter] reported by reviewdog 🐶

f = Figure(size=(900, 400))
ax1 = Axis(f[1, 1], xlabel="z (μm)", ylabel="Beam waist w(z) (μm)",
title="Gaussian Beam Waist: w₀=$(waist_radius) μm")
z_dense = range(-z_R, z_R, length=200)
w_dense = waist_radius .* sqrt.(1 .+ (collect(z_dense) ./ z_R).^2)
lines!(ax1, collect(z_dense), w_dense, label="Analytic w(z)=w₀√(1+(z/z_R)²)",
color=:red, linewidth=2)
scatter!(ax1, collect(z_monitor)[valid], measured_waists[valid],
label="Khronos (second moment)", color=:blue, markersize=8)
axislegend(ax1, position=:ct)


[JuliaFormatter] reported by reviewdog 🐶

intensity_mid = abs.(dft_mid).^2


[JuliaFormatter] reported by reviewdog 🐶

xs = range(-monitor_xy_size/2, monitor_xy_size/2, length=nx)


[JuliaFormatter] reported by reviewdog 🐶

ax2 = Axis(f[1, 2], xlabel="x (μm)", ylabel="|Ex|² (a.u.)",
title="Beam profile at z=$(round(collect(z_monitor)[mid_idx], digits=1)) μm")
lines!(ax2, collect(xs), intensity_mid ./ maximum(intensity_mid), color=:blue,
label="Khronos", linewidth=1.5)


[JuliaFormatter] reported by reviewdog 🐶

gauss_fit = exp.(-2 .* collect(xs).^2 ./ w_at_z^2)
lines!(ax2, collect(xs), gauss_fit, color=:red, linestyle=:dash,
label="Gaussian (w=$(round(w_at_z, digits=2)))", linewidth=1.5)
axislegend(ax2, position=:rt)


[JuliaFormatter] reported by reviewdog 🐶

time_profile = Khronos.ContinuousWaveSource(fcen=1.0),


[JuliaFormatter] reported by reviewdog 🐶

size = [0.0, 0.0, 0.0],


[JuliaFormatter] reported by reviewdog 🐶

cell_size = [cell_val, cell_val, cell_val],


[JuliaFormatter] reported by reviewdog 🐶

resolution = 10,
sources = sources,
boundaries = [[dpml, dpml], [dpml, dpml], [dpml, dpml]],


[JuliaFormatter] reported by reviewdog 🐶


[JuliaFormatter] reported by reviewdog 🐶

for _ in 1:n_reps


[JuliaFormatter] reported by reviewdog 🐶

t_inc = @elapsed for _ in 1:n_measure; Khronos.increment_timestep!(sim); end


[JuliaFormatter] reported by reviewdog 🐶

("step_B_from_E!", t_step_B),
("update_H_from_B!", t_update_H),
("update_H_monitors!", t_H_mon),


[JuliaFormatter] reported by reviewdog 🐶

("step_D_from_H!", t_step_D),
("update_E_from_D!", t_update_E),
("update_E_monitors!", t_E_mon),
("increment_timestep!", t_inc),


[JuliaFormatter] reported by reviewdog 🐶

println(" $(rpad(name, 30)) $(rpad(round(t * 1e6, digits=1), 12)) $(rpad(round(pct, digits=1), 10))%")


[JuliaFormatter] reported by reviewdog 🐶

println(" $(rpad("Sum of parts", 30)) $(rpad(round(total_parts * 1e6, digits=1), 12)) $(round(total_parts / t_full_step * 100, digits=1))%")
println(" $(rpad("Full step! (measured)", 30)) $(rpad(round(t_full_step * 1e6, digits=1), 12)) 100.0%")


[JuliaFormatter] reported by reviewdog 🐶

println("\n Unaccounted overhead: $(round(overhead_pct, digits=1))% (Julia dispatch, kernel object creation, etc.)")


[JuliaFormatter] reported by reviewdog 🐶

println(" update_field!: $(bytes_update_field) bytes/voxel × 2 = $(2 * bytes_update_field)")


[JuliaFormatter] reported by reviewdog 🐶

println(" $(rpad("step_B_from_E!", 25)) $(rpad(round(bw_step_B, digits=1), 10)) $(round(bw_step_B/peak_bw*100, digits=1))%")
println(" $(rpad("update_H_from_B!", 25)) $(rpad(round(bw_update_H, digits=1), 10)) $(round(bw_update_H/peak_bw*100, digits=1))%")
println(" $(rpad("step_D_from_H!", 25)) $(rpad(round(bw_step_D, digits=1), 10)) $(round(bw_step_D/peak_bw*100, digits=1))%")
println(" $(rpad("update_E_from_D!", 25)) $(rpad(round(bw_update_E, digits=1), 10)) $(round(bw_update_E/peak_bw*100, digits=1))%")


[JuliaFormatter] reported by reviewdog 🐶

println(" $(rpad("Full step (aggregate)", 25)) $(rpad(round(bw_total, digits=1), 10)) $(round(bw_total/peak_bw*100, digits=1))%")


[JuliaFormatter] reported by reviewdog 🐶

t_empty = cuda_time(() -> empty_k(ndrange=(N, N, N)), n_measure)


[JuliaFormatter] reported by reviewdog 🐶

println("Est. launch overhead per step: $(round(n_total_kernels * t_empty * 1e6, digits=1)) μs ($(round(n_total_kernels * t_empty / t_full_step * 100, digits=1))%)")


[JuliaFormatter] reported by reviewdog 🐶

println(" $(rpad("Workgroup", 12)) $(rpad("Time (μs)", 12)) $(rpad("Rel. perf", 10))")


[JuliaFormatter] reported by reviewdog 🐶

println(" $(rpad("default", 12)) $(rpad(round(t_default * 1e6, digits=1), 12)) 1.00x")


[JuliaFormatter] reported by reviewdog 🐶

kern = Khronos.step_curl!(CUDABackend(), workgroupsize=ws)
t_ws = cuda_time(() -> kern(
sim.fields.fEx, sim.fields.fEy, sim.fields.fEz,
sim.fields.fBx, sim.fields.fBy, sim.fields.fBz,
sim.fields.fCBx, sim.fields.fCBy, sim.fields.fCBz,
sim.fields.fUBx, sim.fields.fUBy, sim.fields.fUBz,
sim.geometry_data.σBx, sim.geometry_data.σBy, sim.geometry_data.σBz,
sim.boundary_data.σBx, sim.boundary_data.σBy, sim.boundary_data.σBz,
sim.Δt, sim.Δx, sim.Δy, sim.Δz, 1,
ndrange = (sim.Nx, sim.Ny, sim.Nz),
), 50)


[JuliaFormatter] reported by reviewdog 🐶

println(" $(rpad(ws, 12)) $(rpad(round(t_ws * 1e6, digits=1), 12)) $(round(speedup, digits=2))x")


[JuliaFormatter] reported by reviewdog 🐶

println(" $(rpad("Workgroup", 20)) $(rpad("Time (μs)", 12)) $(rpad("Rel. perf", 10))")


[JuliaFormatter] reported by reviewdog 🐶

for (wx, wy, wz) in [(8,8,4), (8,8,8), (4,4,16), (16,16,1), (32,8,1), (8,4,8), (4,8,8)]


[JuliaFormatter] reported by reviewdog 🐶

kern = Khronos.step_curl!(CUDABackend(), workgroupsize=(wx, wy, wz))
t_ws = cuda_time(() -> kern(
sim.fields.fEx, sim.fields.fEy, sim.fields.fEz,
sim.fields.fBx, sim.fields.fBy, sim.fields.fBz,
sim.fields.fCBx, sim.fields.fCBy, sim.fields.fCBz,
sim.fields.fUBx, sim.fields.fUBy, sim.fields.fUBz,
sim.geometry_data.σBx, sim.geometry_data.σBy, sim.geometry_data.σBz,
sim.boundary_data.σBx, sim.boundary_data.σBy, sim.boundary_data.σBz,
sim.Δt, sim.Δx, sim.Δy, sim.Δz, 1,
ndrange = (sim.Nx, sim.Ny, sim.Nz),
), 50)


[JuliaFormatter] reported by reviewdog 🐶

println(" $(rpad("($wx,$wy,$wz)", 20)) $(rpad(round(t_ws * 1e6, digits=1), 12)) $(round(speedup, digits=2))x")


[JuliaFormatter] reported by reviewdog 🐶

println("Single array size: $(round(array_size_bytes / 1024^2, digits=1)) MB (($(N+2))^3 × 8 bytes)")


[JuliaFormatter] reported by reviewdog 🐶

println("step_curl! touches ~15 3D arrays = $(round(15 * array_size_bytes / 1024^2, digits=0)) MB")


[JuliaFormatter] reported by reviewdog 🐶

println("update_field! touches ~9 3D arrays = $(round(9 * array_size_bytes / 1024^2, digits=0)) MB")


[JuliaFormatter] reported by reviewdog 🐶

d_slab = 0.5 # slab thickness (μm)
n_slab = 3.5 # refractive index
nfreq = 60 # frequency points
n_steps = 100 # benchmark steps for throughput measurement


[JuliaFormatter] reported by reviewdog 🐶

t = ((1 + rho) * (1 - rho) .* exp.(-2im * π * n * d ./ λ)) ./


[JuliaFormatter] reported by reviewdog 🐶


[JuliaFormatter] reported by reviewdog 🐶

function run_precision_test(precision::DataType; resolution=resolution,
d_slab=d_slab, n_slab=n_slab, nfreq=nfreq)


[JuliaFormatter] reported by reviewdog 🐶

λ = range(1.0, 2.0, length=nfreq)


[JuliaFormatter] reported by reviewdog 🐶

sz = 2*dpml + 4*0.5


[JuliaFormatter] reported by reviewdog 🐶

function build_sim(; with_slab=false)


[JuliaFormatter] reported by reviewdog 🐶

time_profile = Khronos.GaussianPulseSource(fcen=1/1.5, fwidth=0.4),


[JuliaFormatter] reported by reviewdog 🐶

size = [Inf, Inf, 0.0],


[JuliaFormatter] reported by reviewdog 🐶

geometry = with_slab ?
[Khronos.Object(Cuboid([0,0,0], [sxy, sxy, d_slab]),
Khronos.Material=n_slab^2))] : []


[JuliaFormatter] reported by reviewdog 🐶

size = [0.4, 0.4, 0.0],


[JuliaFormatter] reported by reviewdog 🐶

cell_size = [sxy, sxy, sz],


[JuliaFormatter] reported by reviewdog 🐶

resolution = resolution,
geometry = geometry,
sources = sources,
boundaries = [[dpml, dpml], [dpml, dpml], [dpml, dpml]],
monitors = monitors,


[JuliaFormatter] reported by reviewdog 🐶

sum(abs.(sim.monitors[1].monitor_data.fields).^2, dims=[1,2])[1,1,1,:]


[JuliaFormatter] reported by reviewdog 🐶

sim_norm = build_sim(with_slab=false)
Khronos.run(sim_norm,


[JuliaFormatter] reported by reviewdog 🐶

tolerance=1e-9, minimum_runtime=0.0, maximum_runtime=300.0))


[JuliaFormatter] reported by reviewdog 🐶

sim_slab = build_sim(with_slab=true)
Khronos.run(sim_slab,


[JuliaFormatter] reported by reviewdog 🐶

tolerance=1e-9, minimum_runtime=0.0, maximum_runtime=300.0))


[JuliaFormatter] reported by reviewdog 🐶

sim_bench = build_sim(with_slab=true)


[JuliaFormatter] reported by reviewdog 🐶

λ = collect(range(1.0, 2.0, length=nfreq))


[JuliaFormatter] reported by reviewdog 🐶

println(" Float64 $(rpad(round(rate_f64,digits=1), 12))$(rpad(round(err_f64,digits=6), 17))1.00×")
println(" Float32 $(rpad(round(rate_f32,digits=1), 12))$(rpad(round(err_f32,digits=6), 17))$(round(rate_f32/rate_f64, digits=2))×")


[JuliaFormatter] reported by reviewdog 🐶

f = Figure(size=(900, 400))
ax1 = Axis(f[1, 1], xlabel="Wavelength (μm)", ylabel="Transmission",
title="Slab Transmission: Float32 vs Float64")
lines!(ax1, λ, T_analytic, label="Analytic", color=:black, linewidth=2)
scatterlines!(ax1, λ, T_f64, label="Float64", color=:blue, markersize=4)
scatterlines!(ax1, λ, T_f32, label="Float32", color=:red, markersize=4)
axislegend(ax1, position=:rb)
ax2 = Axis(f[1, 2], xlabel="Precision", ylabel="Value",
title="Performance Comparison",
xticks=([1, 2], ["Float64", "Float32"]))
barplot!(ax2, [1, 2], [rate_f64, rate_f32], color=[:blue, :red],
bar_labels=:y, label_formatter=x -> "$(round(x, digits=0)) MC/s")


[JuliaFormatter] reported by reviewdog 🐶

time_profile = Khronos.ContinuousWaveSource(fcen=1.0),


[JuliaFormatter] reported by reviewdog 🐶

size = [0.0, 0.0, 0.0],


[JuliaFormatter] reported by reviewdog 🐶

cell_size = [cell_val, cell_val, cell_val],


[JuliaFormatter] reported by reviewdog 🐶

resolution = 10,
sources = sources,
boundaries = [[dpml, dpml], [dpml, dpml], [dpml, dpml]],


[JuliaFormatter] reported by reviewdog 🐶


[JuliaFormatter] reported by reviewdog 🐶


[JuliaFormatter] reported by reviewdog 🐶

N = 128 # grid dimension (N³ voxels) — increase for larger GPUs
res = 10 # resolution (pixels/μm)
n_steps = 100 # timesteps per measurement
dpml = 1.0 # PML thickness


[JuliaFormatter] reported by reviewdog 🐶

function main(; N=N, res=res, n_steps=n_steps, dpml=dpml)


[JuliaFormatter] reported by reviewdog 🐶

time_profile = Khronos.ContinuousWaveSource(fcen=1.0),


[JuliaFormatter] reported by reviewdog 🐶

size = [0.0, 0.0, 0.0],


[JuliaFormatter] reported by reviewdog 🐶

configs = OrderedDict{String, NamedTuple}()


[JuliaFormatter] reported by reviewdog 🐶

sources = [base_source],


[JuliaFormatter] reported by reviewdog 🐶

geometry = nothing,
monitors = nothing,


[JuliaFormatter] reported by reviewdog 🐶

sources = [base_source],


[JuliaFormatter] reported by reviewdog 🐶

geometry = nothing,
monitors = nothing,


[JuliaFormatter] reported by reviewdog 🐶

sources = [base_source],


[JuliaFormatter] reported by reviewdog 🐶


[JuliaFormatter] reported by reviewdog 🐶

Khronos.Material=3.5),


[JuliaFormatter] reported by reviewdog 🐶


[JuliaFormatter] reported by reviewdog 🐶

mats = [Khronos.Material=e) for e in [1.5, 2.5, 3.5, 2.0, 1.8, 3.0]]


[JuliaFormatter] reported by reviewdog 🐶

Cuboid([0.0, 0.0, -cell_val/2 + (i-0.5)*layer_thick],
[cell_val, cell_val, layer_thick]),


[JuliaFormatter] reported by reviewdog 🐶

)
for i in 1:6


[JuliaFormatter] reported by reviewdog 🐶

sources = [base_source],


[JuliaFormatter] reported by reviewdog 🐶

geometry = layers,
monitors = nothing,


[JuliaFormatter] reported by reviewdog 🐶

sources = [base_source],


[JuliaFormatter] reported by reviewdog 🐶


[JuliaFormatter] reported by reviewdog 🐶

Khronos.Material=3.5),


[JuliaFormatter] reported by reviewdog 🐶


[JuliaFormatter] reported by reviewdog 🐶

size = [cell_val/2, cell_val/2, 0.0],
frequencies = collect(range(0.5, 1.5, length=10)),


[JuliaFormatter] reported by reviewdog 🐶

results = OrderedDict{String, Float64}()


[JuliaFormatter] reported by reviewdog 🐶

kwargs = Dict{Symbol, Any}(
:cell_size => [cell_val, cell_val, cell_val],


[JuliaFormatter] reported by reviewdog 🐶

:resolution => res,
:sources => cfg.sources,


[JuliaFormatter] reported by reviewdog 🐶

rel = round(rate / baseline, digits=2)


[JuliaFormatter] reported by reviewdog 🐶

f = Figure(size=(800, 500))
ax = Axis(f[1, 1],


[JuliaFormatter] reported by reviewdog 🐶

title = "FDTD Throughput vs. Physics Complexity ($(N)^3 grid)",


[JuliaFormatter] reported by reviewdog 🐶

barplot!(ax, 1:length(results), collect(values(results)),
color=:steelblue, bar_labels=:y,
label_formatter=x -> "$(round(x, digits=0))")


[JuliaFormatter] reported by reviewdog 🐶

n_steps = 100 # timesteps per measurement


[JuliaFormatter] reported by reviewdog 🐶

function main(; grid_sizes=grid_sizes, n_steps=n_steps)


[JuliaFormatter] reported by reviewdog 🐶

results = Dict{Int, Float64}()


[JuliaFormatter] reported by reviewdog 🐶

time_profile = Khronos.ContinuousWaveSource(fcen=1.0),


[JuliaFormatter] reported by reviewdog 🐶

size = [0.0, 0.0, 0.0],


[JuliaFormatter] reported by reviewdog 🐶

cell_size = [cell_size_val, cell_size_val, cell_size_val],


[JuliaFormatter] reported by reviewdog 🐶

resolution = 10,
sources = sources,
boundaries = [[dpml, dpml], [dpml, dpml], [dpml, dpml]],


[JuliaFormatter] reported by reviewdog 🐶

f = Figure(size=(800, 500))
ax = Axis(f[1, 1],


[JuliaFormatter] reported by reviewdog 🐶

title = "Single-GPU FDTD Throughput vs. Problem Size",


[JuliaFormatter] reported by reviewdog 🐶

scatterlines!(ax, Float64.(voxels), rates,
color=:blue, markersize=10, linewidth=2,
label="Khronos (CUDA, Float64)")
axislegend(ax, position=:rb)


[JuliaFormatter] reported by reviewdog 🐶

n_core = 3.4 # core refractive index (e.g. Si)
n_clad = 1.44 # cladding refractive index (e.g. SiO₂)
wg_width = 0.5 # waveguide width (μm)
cell_x = 8.0 # propagation length (μm)
cell_y = 4.0 # transverse extent (μm)
dpml = 0.5 # PML thickness
fcen = 0.65 # source frequency (1/μm) → λ ≈ 1.55 μm
t_end = 30.0 # simulation time


[JuliaFormatter] reported by reviewdog 🐶

function main(; resolution=resolution, n_core=n_core, n_clad=n_clad,
wg_width=wg_width, cell_x=cell_x, cell_y=cell_y,
dpml=dpml, fcen=fcen, t_end=t_end)


[JuliaFormatter] reported by reviewdog 🐶

time_profile = Khronos.ContinuousWaveSource(fcen=fcen),


[JuliaFormatter] reported by reviewdog 🐶

size = [0.0, wg_width * 0.8, 0.0],


[JuliaFormatter] reported by reviewdog 🐶

size = [0.0, mon_y_size, 0.0],


[JuliaFormatter] reported by reviewdog 🐶

size = [0.0, mon_y_size, 0.0],


[JuliaFormatter] reported by reviewdog 🐶

cell_size = [cell_x, cell_y, dz],


[JuliaFormatter] reported by reviewdog 🐶

resolution = resolution,
geometry = geometry,
sources = sources,
boundaries = [[dpml, dpml], [dpml, dpml], [0.0, 0.0]],
monitors = monitors,


[JuliaFormatter] reported by reviewdog 🐶

Khronos.run(sim, until=t_end)


[JuliaFormatter] reported by reviewdog 🐶

profile_far = Array(abs.(sim.monitors[2].monitor_data.fields[1, :, 1, 1]))


[JuliaFormatter] reported by reviewdog 🐶

ys_mon = range(-mon_y_size/2, mon_y_size/2, length=ny_mon)


[JuliaFormatter] reported by reviewdog 🐶

power_in_core = sum(profile_far[core_mask].^2)
power_total = sum(profile_far.^2)
confinement = power_total > 0 ? power_in_core / power_total : 0.0


[JuliaFormatter] reported by reviewdog 🐶

f = Figure(size=(900, 500))


[JuliaFormatter] reported by reviewdog 🐶

ax1 = Axis(f[1, 1:2], xlabel="x (μm)", ylabel="y (μm)",
title="Ez field (TE waveguide, XY plane)", aspect=DataAspect())
xs_2d = range(-cell_x/2, cell_x/2, length=size(ez,1))
ys_2d = range(-cell_y/2, cell_y/2, length=size(ez,2))


[JuliaFormatter] reported by reviewdog 🐶

hm = heatmap!(ax1, collect(xs_2d), collect(ys_2d), ez,
colormap=:bluesreds, colorrange=(-vmax, vmax))


[JuliaFormatter] reported by reviewdog 🐶

hlines!(ax1, [-wg_width/2, wg_width/2], color=:white, linestyle=:dash, linewidth=1)


[JuliaFormatter] reported by reviewdog 🐶

ax2 = Axis(f[2, 1], xlabel="y (μm)", ylabel="|Ez| (a.u.)",
title="Near field (x=$(round(mon_x1,digits=1)))")


[JuliaFormatter] reported by reviewdog 🐶

lines!(ax2, collect(ys_mon), profile_near ./ maximum(profile_near), color=:blue)


[JuliaFormatter] reported by reviewdog 🐶

vlines!(ax2, [-wg_width/2, wg_width/2], color=:gray, linestyle=:dash)


[JuliaFormatter] reported by reviewdog 🐶

ax3 = Axis(f[2, 2], xlabel="y (μm)", ylabel="|Ez| (a.u.)",
title="Far field (x=$(round(mon_x2,digits=1)))")


[JuliaFormatter] reported by reviewdog 🐶

lines!(ax3, collect(ys_mon), profile_far ./ maximum(profile_far), color=:blue)


[JuliaFormatter] reported by reviewdog 🐶

vlines!(ax3, [-wg_width/2, wg_width/2], color=:gray, linestyle=:dash)


[JuliaFormatter] reported by reviewdog 🐶

@info("Preparing simulation object ($(sim.Nx)×$(sim.Ny)×$(sim.Nz) = $(num_voxels) voxels)...")


[JuliaFormatter] reported by reviewdog 🐶

@info("Simulation complete: $(total_steps) steps in $(round(elapsed, digits=3))s " *
"($(round(mcells_per_s, digits=1)) MVoxels/s overall, " *
"$(round(steady_rate, digits=1)) MVoxels/s after warmup)")


[JuliaFormatter] reported by reviewdog 🐶

@info("Simulation complete: $(total_steps) steps in $(round(elapsed, digits=3))s ($(round(mcells_per_s, digits=1)) MVoxels/s)")


[JuliaFormatter] reported by reviewdog 🐶

Khronos.jl/src/Timestep.jl

Lines 183 to 189 in 475ddae

Ax, Ay, Az,
Tx, Ty, Tz,
Cx, Cy, Cz,
Ux, Uy, Uz,
σDx, σDy, σDz,
σx, σy, σz,
Δt, Δx, Δy, Δz,


[JuliaFormatter] reported by reviewdog 🐶

@inline function generic_curl!(K, C, U::Nothing, T, σD, σ_next::AbstractArray, σ_prev, idx_array)


[JuliaFormatter] reported by reviewdog 🐶

Khronos.jl/src/Timestep.jl

Lines 451 to 457 in 475ddae

Ax, Ay, Az,
Tx, Ty, Tz,
Wx, Wy, Wz,
Px, Py, Pz,
Sx, Sy, Sz,
m_inv, m_inv_x, m_inv_y, m_inv_z,
σx, σy, σz,

Comment on lines +18 to +22
εx = 4.0 # permittivity seen by x-polarized light
εy = 1.5 # permittivity seen by y-polarized light
εz = 4.0 # permittivity along propagation direction
d_slab = 0.5 # slab thickness (μm)
dpml = 0.5 # PML thickness
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
εx = 4.0 # permittivity seen by x-polarized light
εy = 1.5 # permittivity seen by y-polarized light
εz = 4.0 # permittivity along propagation direction
d_slab = 0.5 # slab thickness (μm)
dpml = 0.5 # PML thickness
εx = 4.0 # permittivity seen by x-polarized light
εy = 1.5 # permittivity seen by y-polarized light
εz = 4.0 # permittivity along propagation direction
d_slab = 0.5 # slab thickness (μm)
dpml = 0.5 # PML thickness

Comment on lines +25 to +26
function main(; resolution=resolution, εx=εx, εy=εy, εz=εz,
d_slab=d_slab, dpml=dpml)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
function main(; resolution=resolution, εx=εx, εy=εy, εz=εz,
d_slab=d_slab, dpml=dpml)
function main(;
resolution = resolution,
εx = εx,
εy = εy,
εz = εz,
d_slab = d_slab,
dpml = dpml,
)

Khronos.choose_backend(Khronos.CUDADevice(), Float64)

# Use the same source parameters as the known-working periodic_slab.jl
λ = range(1.0, 2.0, length=100)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
λ = range(1.0, 2.0, length=100)
λ = range(1.0, 2.0, length = 100)

monitor_xy = 0.5
monitor_height = 0.5
sxy = 2*dpml + monitor_xy
sz = 2*dpml + 4*monitor_height
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
sz = 2*dpml + 4*monitor_height
sz = 2*dpml + 4*monitor_height

Comment on lines +42 to +44
t = ((1 .+ rho) .* (1 .- rho) .* exp.(-2im * π * n * d ./ λ_vec)) ./
(1 .- rho.^2 .* exp.(-4im * π * n * d ./ λ_vec))
return abs.(t).^2
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
t = ((1 .+ rho) .* (1 .- rho) .* exp.(-2im * π * n * d ./ λ_vec)) ./
(1 .- rho.^2 .* exp.(-4im * π * n * d ./ λ_vec))
return abs.(t).^2
t =
((1 .+ rho) .* (1 .- rho) .* exp.(-2im * π * n * d ./ λ_vec)) ./
(1 .- rho .^ 2 .* exp.(-4im * π * n * d ./ λ_vec))
return abs.(t) .^ 2

Comment on lines +39 to +41
resolution = 10,
sources = sources,
boundaries = [[dpml, dpml], [dpml, dpml], [dpml, dpml]],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
resolution = 10,
sources = sources,
boundaries = [[dpml, dpml], [dpml, dpml], [dpml, dpml]],
resolution = 10,
sources = sources,
boundaries = [[dpml, dpml], [dpml, dpml], [dpml, dpml]],

working_set_mb = n_arrays * array_bytes / 1024^2

# Warmup
for _ in 1:n_warmup
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
for _ in 1:n_warmup
for _ = 1:n_warmup

stop_event = CUDA.CuEvent()

CUDA.record(start_event)
for _ in 1:n_measure
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
for _ in 1:n_measure
for _ = 1:n_measure


# Also measure with run_benchmark for comparison (has the sync bug)
sim2 = Khronos.Simulation(
cell_size = [cell_val, cell_val, cell_val],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
cell_size = [cell_val, cell_val, cell_val],
cell_size = [cell_val, cell_val, cell_val],

Comment on lines +83 to +85
resolution = 10,
sources = sources,
boundaries = [[dpml, dpml], [dpml, dpml], [dpml, dpml]],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
resolution = 10,
sources = sources,
boundaries = [[dpml, dpml], [dpml, dpml], [dpml, dpml]],
resolution = 10,
sources = sources,
boundaries = [[dpml, dpml], [dpml, dpml], [dpml, dpml]],

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant