- 
                Notifications
    You must be signed in to change notification settings 
- Fork 0
π CUDA Programming Guide: From Basics to Advanced
        Amir M. Parvizi edited this page Nov 19, 2024 
        ·
        1 revision
      
    CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform and programming model. This guide will help you understand and implement CUDA kernels efficiently.
- NVIDIA GPU (Compute Capability 3.0+)
- CUDA Toolkit installed
- Basic C/C++ knowledge
- Understanding of parallel computing concepts
Grid
βββ Blocks
    βββ Threads
__global__ void vectorAdd(float *a, float *b, float *c, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}nvcc -o vector_add vector_add.cu
./vector_add| Memory Type | Scope | Lifetime | Speed | 
|---|---|---|---|
| Registers | Thread | Thread | Fastest | 
| Shared Memory | Block | Block | Very Fast | 
| Global Memory | Grid | Application | Slow | 
| Constant Memory | Grid | Application | Fast (cached) | 
- 
Memory Coalescing - Align memory accesses
- Use appropriate data types
 
- 
Occupancy Optimization - Balance resource usage
- Optimize block sizes
 
- 
Warp Efficiency - Minimize divergent branching
- Utilize warp-level primitives
 
__global__ void matrixMul(float *A, float *B, float *C, int N) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    
    if (row < N && col < N) {
        float sum = 0.0f;
        for (int k = 0; k < N; k++) {
            sum += A[row * N + k] * B[k * N + col];
        }
        C[row * N + col] = sum;
    }
}nvprof ./your_program  # Profile CUDA applications- 
Memory Transfer - Minimize host-device transfers
- Use pinned memory for better bandwidth
 
- 
Kernel Configuration - Choose optimal block sizes
- Consider hardware limitations
 
- 
Algorithm Design - Design for parallelism
- Reduce sequential dependencies