NVIDIA PTX: f64 support for local memory, conversions, and math by Zaneham · Pull Request #62 · Zaneham/BarraCUDA

Zaneham · 2026-03-25T07:16:54Z

Summary

Add double-precision (f64) codegen to the NVIDIA PTX backend. First f64 workload: quantum chemistry ERI kernel.

5 bugs found and fixed:

Bug	File	Fix
Local ld/st used `.u32` for f64	isel.c	Added `NV_LD/ST_LOC_F64` dispatch
`fabs` hardcoded to f32	isel.c	Check `rf == NV_RF_F64` → `NV_ABS_F64`
`int→double` cvt used `.f32` target	isel.c	Added `NV_CVT_F64_S32/U32`
`double→int` cvt used `.f32` source	isel.c	Added `NV_CVT_S32/U32_F64`
`mov.f64` immediate bare `0`	emit.c	Emit `0d0000000000000000` hex format
`int * double` no implicit promotion	bir_lower.c	`coerce_to()` inserts `SITOFP`

Middle-end fix (bir_lower.c): coerce_to() handles C usual arithmetic conversions for binary ops. One fix, all backends benefit. The AMD backend has the same f64 gaps (scratch_dword, v_and_b32 for fabs, v_cvt_f32_i32 for sitofp) — Phase 2.

Validation

90/90 existing tests pass (no regressions)
Moa (f32 neutron transport): k_eff = 0.995 on RTX 4060 Ti ✓
Kokako (f64 quantum chemistry): benzene C₆H₆ ERIs match CPU to all digits, 20× speedup

Test plan

BarraCUDA test suite: 90/90 pass
Moa GPU smoke test: Godiva k_eff = 0.995
Kokako H₂O/STO-3G: E = -74.9630 (CPU match)
Kokako CH₄/STO-3G: E = -39.7267 (CPU match)
Kokako C₆H₆/STO-3G: E = -227.8907 (CPU match, 20× speedup)

This was generated with Claude Code (Cheers to the peeps gifting me coupons, lol!)

Add double-precision (f64) codegen support to the NVIDIA PTX backend. Previously all f64 local loads/stores fell through to u32, fabs used f32 abs, and int-to-float conversions hardcoded f32 targets. Discovered by Kokako (open-source quantum chemistry) — first f64 GPU workload. NVIDIA backend (isel.c, emit.c, nvidia.h): - NV_LD_LOC_F64, NV_ST_LOC_F64: f64 local (scratch) memory - NV_CVT_F64_S32, NV_CVT_F64_U32: int32 -> fp64 conversion - NV_CVT_S32_F64, NV_CVT_U32_F64: fp64 -> int32 conversion - BIR_FABS: dispatch to NV_ABS_F64 when operand is f64 - BIR_SITOFP/UITOFP: dispatch to f64 cvt when result type is f64 - BIR_FPTOSI/FPTOUI: dispatch to f64 cvt when source type is f64 - NV_MOV_F64 immediate: emit 0d hex format (not bare integer) Middle-end (bir_lower.c): - coerce_to(): implicit operand promotion for binary expressions - Handles int * double -> SITOFP + FMUL (C usual arithmetic) - One fix, all backends benefit Frontend (sema.c): - Register sqrt as f64 math builtin (-> PTX sqrt.rn.f64, exact) Validated: Kokako ERI kernel (Obara-Saika VRR+HRR) produces bit-identical results to CPU for benzene C6H6 (36 basis functions, 222K integrals) on RTX 4060 Ti. 20x speedup at FP64. 90/90 existing tests pass. Moa (f32) unaffected.

Zaneham merged commit 358a9ff into master Mar 25, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVIDIA PTX: f64 support for local memory, conversions, and math#62

NVIDIA PTX: f64 support for local memory, conversions, and math#62
Zaneham merged 1 commit intomasterfrom
f64-nvidia-support

Zaneham commented Mar 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Zaneham commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Zaneham commented Mar 25, 2026 •

edited

Loading