feat: Implement CUDA acceleration for Ricci observable#7
Open
Conversation
This commit introduces a CUDA-based implementation for the average sphere
distance calculation within the Ricci observable. The goal is to leverage
GPU parallelism to speed up this computationally intensive part of the
simulation.
Key changes include:
1. **Makefile Updates:** Modified the Makefile to support NVCC for
compiling .cu files. It now correctly handles CUDA source files,
dependencies, and links against CUDA libraries.
2. **CUDA Kernels (`observables/ricci_cuda_kernels.cu`, `.hpp`):**
* Introduced `pairwise_bfs_kernel`, a CUDA kernel that computes
distances between all pairs of vertices from two spheres (s1, s2).
* Each pair's distance is found using a BFS, implemented in the
`__device__` function `calculate_distance_bfs_device`. This BFS
is depth-limited to 3*epsilon and uses thread-local fixed-size
arrays for its queue and visited set to manage resources.
* A C++ wrapper function,
`RicciCUDATask::calculate_sum_and_count_distances_cuda`,
manages GPU memory allocation, data conversion (adjacency list to
CSR format), H2D/D2H transfers, kernel launch, and cleanup.
3. **Ricci Observable Modification (`observables/ricci.cpp`):**
* The `Ricci::averageSphereDistance` method now calls the
CUDA wrapper function instead of performing the BFS calculations
on the CPU.
* The original logic for sphere generation and the specific averaging
formula (`sum_distances / (epsilon * count_distances)`) are
preserved.
* Added safety checks for empty data structures and zero epsilon.
This implementation aims to replace the previous CPU-bound calculation
with a parallel GPU version. Further testing and validation in a compiled
environment are needed to verify correctness, performance, and robustness,
especially concerning the fixed-size limitations in the per-thread BFS.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This commit introduces a CUDA-based implementation for the average sphere distance calculation within the Ricci observable. The goal is to leverage GPU parallelism to speed up this computationally intensive part of the simulation.
Key changes include:
Makefile Updates: Modified the Makefile to support NVCC for compiling .cu files. It now correctly handles CUDA source files, dependencies, and links against CUDA libraries.
CUDA Kernels (
observables/ricci_cuda_kernels.cu,.hpp):pairwise_bfs_kernel, a CUDA kernel that computesdistances between all pairs of vertices from two spheres (s1, s2).
__device__functioncalculate_distance_bfs_device. This BFSis depth-limited to 3*epsilon and uses thread-local fixed-size
arrays for its queue and visited set to manage resources.
RicciCUDATask::calculate_sum_and_count_distances_cuda,manages GPU memory allocation, data conversion (adjacency list to
CSR format), H2D/D2H transfers, kernel launch, and cleanup.
Ricci Observable Modification (
observables/ricci.cpp):Ricci::averageSphereDistancemethod now calls theCUDA wrapper function instead of performing the BFS calculations
on the CPU.
formula (
sum_distances / (epsilon * count_distances)) arepreserved.
This implementation aims to replace the previous CPU-bound calculation with a parallel GPU version. Further testing and validation in a compiled environment are needed to verify correctness, performance, and robustness, especially concerning the fixed-size limitations in the per-thread BFS.