Eliminating O(n³) bottlenecks at the silicon level: Automated architectural auditing for next-generation GPU-accelerated solvers.
gemini-cuda is a high-performance C++ utility bridging frontier LLM reasoning with GPU systems engineering. It performs deep architectural audits of NVIDIA .cu source code to identify synchronization errors, unoptimized memory patterns, and hardware-level bottlenecks in parallel solvers that traditional static analysis misses.
As GPU compute becomes the primary line item in AI infrastructure TCO, code efficiency at the kernel level is critical. gemini-cuda leverages frontier reasoning models with massive context windows to help engineering teams:
- Reduce Warp Divergence: Identify branch-heavy logic that degrades streaming multiprocessor (SM) throughput.
- Eliminate Race Conditions: Detect missing
__syncthreads()in complex reduction algorithms across multi-file dependencies. - Optimize Memory Coalescing: Ensure global memory access patterns are aligned for maximum memory bandwidth.
Ensure you have the required networking libraries to communicate with the API.
sudo apt-get update
sudo apt-get install libcurl4-openssl-dev cmake g++We support two frameworks - Gemini and Claude, you can configure which one should be used on compile time using a make variable.
git clone [https://github.com/abokov/gemini-cuda.git](https://github.com/abokov/gemini-cuda.git)
cd gemini-cuda
mkdir build && cd build
cmake ..
makegit clone https://github.com/abokov/gemini-cuda.git cd gemini-cuda mkdir build_claude && cd build_claude cmake -DUSE_CLAUDE=ON .. make
Copy the example environment file and add your Google AI Studio API key.
cp ../.env.example ../.env
# Edit .env to add your keys, then source it:
export $(grep -v '^#' ../.env | xargs)Run the tool against any CUDA kernel. A broken reduction sample is included for testing.
export GEMINI_API_KEY="your_api_key_here"
./gemini-cuda ../samples/broken_reduction.cuWhen running against a kernel with hidden synchronization flaws, the engine outputs actionable, architecturally-aware fixes:
🚀 Dispatching audit to: gemini-pro-latest...
--- AUDIT REPORT ---
[CRITICAL] Race Condition Detected:
Kernel `buggy_sum_reduction` accesses shared memory `sdata` without proper synchronization.
[ANALYSIS]:
Threads are entering the reduction loop before all memory loads from `input` to `sdata` are complete across the block.
[RESOLUTION]:
Insert `__syncthreads()` at line 14, immediately before the `for` loop, to ensure all threads have finished writing to shared memory.
The samples/ directory contains deliberately flawed CUDA kernels designed to evaluate the engine's ability to detect deep silicon-level bottlenecks. You can run gemini-cuda against any of these to test the LLM's diagnostic accuracy:
broken_reduction.cu: Demonstrates critical race conditions (missing__syncthreads()) and severe warp divergence caused by branching within a warp.uncoalesced_transpose.cu: Highlights global memory transaction overhead caused by strided, uncoalesced memory writes.bank_conflict_matmul.cu: Simulates n-way shared memory bank conflicts during column-wise reads in a tiled matrix multiplication kernel.atomic_bottleneck.cu: Shows extreme execution serialization by forcing an entire grid of threads to queue for a single global atomic counter.naive_softmax.cu: Exposes severe global memory bandwidth thrashing common in unoptimized GenAI/Attention mechanisms (missing kernel fusion).blocking_streams.cu: Simulates PCIe pipeline stalls caused by synchronous Host-to-Device memory transfers on the default stream.tail_effect_imbalance.cu: Demonstrates SM resource waste due to extreme thread-level workload imbalance and warp divergence.
Author: Alexey Bokov
Contact: alex@bokov.net
License: Apache 2.0