NVIDIA PTX: f64 support for local memory, conversions, and math#62
Merged
NVIDIA PTX: f64 support for local memory, conversions, and math#62
Conversation
Add double-precision (f64) codegen support to the NVIDIA PTX backend. Previously all f64 local loads/stores fell through to u32, fabs used f32 abs, and int-to-float conversions hardcoded f32 targets. Discovered by Kokako (open-source quantum chemistry) — first f64 GPU workload. NVIDIA backend (isel.c, emit.c, nvidia.h): - NV_LD_LOC_F64, NV_ST_LOC_F64: f64 local (scratch) memory - NV_CVT_F64_S32, NV_CVT_F64_U32: int32 -> fp64 conversion - NV_CVT_S32_F64, NV_CVT_U32_F64: fp64 -> int32 conversion - BIR_FABS: dispatch to NV_ABS_F64 when operand is f64 - BIR_SITOFP/UITOFP: dispatch to f64 cvt when result type is f64 - BIR_FPTOSI/FPTOUI: dispatch to f64 cvt when source type is f64 - NV_MOV_F64 immediate: emit 0d hex format (not bare integer) Middle-end (bir_lower.c): - coerce_to(): implicit operand promotion for binary expressions - Handles int * double -> SITOFP + FMUL (C usual arithmetic) - One fix, all backends benefit Frontend (sema.c): - Register sqrt as f64 math builtin (-> PTX sqrt.rn.f64, exact) Validated: Kokako ERI kernel (Obara-Saika VRR+HRR) produces bit-identical results to CPU for benzene C6H6 (36 basis functions, 222K integrals) on RTX 4060 Ti. 20x speedup at FP64. 90/90 existing tests pass. Moa (f32) unaffected.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add double-precision (f64) codegen to the NVIDIA PTX backend. First f64 workload: quantum chemistry ERI kernel.
5 bugs found and fixed:
.u32for f64NV_LD/ST_LOC_F64dispatchfabshardcoded to f32rf == NV_RF_F64→NV_ABS_F64int→doublecvt used.f32targetNV_CVT_F64_S32/U32double→intcvt used.f32sourceNV_CVT_S32/U32_F64mov.f64immediate bare00d0000000000000000hex formatint * doubleno implicit promotioncoerce_to()insertsSITOFPMiddle-end fix (bir_lower.c):
coerce_to()handles C usual arithmetic conversions for binary ops. One fix, all backends benefit. The AMD backend has the same f64 gaps (scratch_dword, v_and_b32 for fabs, v_cvt_f32_i32 for sitofp) — Phase 2.Validation
Test plan
This was generated with Claude Code (Cheers to the peeps gifting me coupons, lol!)