Skip to content

NVIDIA PTX: f64 support for local memory, conversions, and math#62

Merged
Zaneham merged 1 commit intomasterfrom
f64-nvidia-support
Mar 25, 2026
Merged

NVIDIA PTX: f64 support for local memory, conversions, and math#62
Zaneham merged 1 commit intomasterfrom
f64-nvidia-support

Conversation

@Zaneham
Copy link
Copy Markdown
Owner

@Zaneham Zaneham commented Mar 25, 2026

Summary

Add double-precision (f64) codegen to the NVIDIA PTX backend. First f64 workload: quantum chemistry ERI kernel.

5 bugs found and fixed:

Bug File Fix
Local ld/st used .u32 for f64 isel.c Added NV_LD/ST_LOC_F64 dispatch
fabs hardcoded to f32 isel.c Check rf == NV_RF_F64NV_ABS_F64
int→double cvt used .f32 target isel.c Added NV_CVT_F64_S32/U32
double→int cvt used .f32 source isel.c Added NV_CVT_S32/U32_F64
mov.f64 immediate bare 0 emit.c Emit 0d0000000000000000 hex format
int * double no implicit promotion bir_lower.c coerce_to() inserts SITOFP

Middle-end fix (bir_lower.c): coerce_to() handles C usual arithmetic conversions for binary ops. One fix, all backends benefit. The AMD backend has the same f64 gaps (scratch_dword, v_and_b32 for fabs, v_cvt_f32_i32 for sitofp) — Phase 2.

Validation

  • 90/90 existing tests pass (no regressions)
  • Moa (f32 neutron transport): k_eff = 0.995 on RTX 4060 Ti ✓
  • Kokako (f64 quantum chemistry): benzene C₆H₆ ERIs match CPU to all digits, 20× speedup

Test plan

  • BarraCUDA test suite: 90/90 pass
  • Moa GPU smoke test: Godiva k_eff = 0.995
  • Kokako H₂O/STO-3G: E = -74.9630 (CPU match)
  • Kokako CH₄/STO-3G: E = -39.7267 (CPU match)
  • Kokako C₆H₆/STO-3G: E = -227.8907 (CPU match, 20× speedup)

This was generated with Claude Code (Cheers to the peeps gifting me coupons, lol!)

Add double-precision (f64) codegen support to the NVIDIA PTX backend.
Previously all f64 local loads/stores fell through to u32, fabs used
f32 abs, and int-to-float conversions hardcoded f32 targets. Discovered
by Kokako (open-source quantum chemistry) — first f64 GPU workload.

NVIDIA backend (isel.c, emit.c, nvidia.h):
  - NV_LD_LOC_F64, NV_ST_LOC_F64: f64 local (scratch) memory
  - NV_CVT_F64_S32, NV_CVT_F64_U32: int32 -> fp64 conversion
  - NV_CVT_S32_F64, NV_CVT_U32_F64: fp64 -> int32 conversion
  - BIR_FABS: dispatch to NV_ABS_F64 when operand is f64
  - BIR_SITOFP/UITOFP: dispatch to f64 cvt when result type is f64
  - BIR_FPTOSI/FPTOUI: dispatch to f64 cvt when source type is f64
  - NV_MOV_F64 immediate: emit 0d hex format (not bare integer)

Middle-end (bir_lower.c):
  - coerce_to(): implicit operand promotion for binary expressions
  - Handles int * double -> SITOFP + FMUL (C usual arithmetic)
  - One fix, all backends benefit

Frontend (sema.c):
  - Register sqrt as f64 math builtin (-> PTX sqrt.rn.f64, exact)

Validated: Kokako ERI kernel (Obara-Saika VRR+HRR) produces
bit-identical results to CPU for benzene C6H6 (36 basis functions,
222K integrals) on RTX 4060 Ti. 20x speedup at FP64.

90/90 existing tests pass. Moa (f32) unaffected.
@Zaneham Zaneham merged commit 358a9ff into master Mar 25, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant