GPUSandbox

Repository to explore CUDA C programming language and build GPU kernels for 100 days !

This repository contain my personal notes and code while going through the open-access book (code located here) in my_notes :

Learning GPU kernels for 100 days

Day 1

File `vectorAdd.cu`

Summary : An example of a vector addition on the host and device. Using a simple kernel to add two vectors together. The kernel is launched with a grid of blocks, where each block contains a number of threads. Each thread is responsible for adding one element of the two vectors together.

Concepts used :

Allocated memory for both the host and device (i.e. GPU).
Learned how threads, blocks and grid work in CUDA (for more information, checkout here).
Launched a kernel with __global__ keyword.
Timed our execution with cudaEventCreate,cudaEventRecord and cudaEventSynchronize.
Managed the deallocation of the vectors.

TakeAways : We saw that in such a case memory transfer from Host to Device (HtD) and Device to Host (DtH) takes a majority of the execution time (90%).

Day 2

File `vectorAddWithStreams.cu`

Summary : An example of a vector addition on the host and device. Using a simple kernel to add two vectors together. Day 2 is basically the same as day 1 but with CUDA streams to decrease the data transfer HtD and DtH.

Concepts used :

Used CUDA streams to load data asynchronously.
Split the data according to the number of streams used.
Delved into memory transfer between the host and device.

TakeAways : Thanks to CUDA streams that leverages asynchronous data transfer the global execution was greatly reduced.

Day 3

File `vectorAddOptimized.cu`

Summary : An example of a vector addition on the host and device. Using a simple kernel to add two vectors together that leverage CUDA streams and the CUDA type float4. Day 3 is basically the same as day 3 but with the type float4.

Concepts used :

Learn about float4 CUDA type
Saw a grid-stride loop over a float4 elements (each thread now has more operations to do).

TakeAways : CUDA type float4 use one memory instruction for loading 4 floats. Grid-stride loops allow to reduce overhead, allows for instruction level parallelism (ILP), i.e. more independent instructions within a thread.

Day 4

File `matrixAdd.cu`

Summary : An example of a matrix addition on the host and device. Typically, a matrix is stored linearly in global memory with a row-major approach. Using the row-major approach to assimilate the matrix multiplication as a 1D vector addition.

Concepts used :

Matrix is stored in global memory in row-major approach.

TakeAways : With such a grid and block configuration vector and matrix addition are the same but you have to be careful with indices.

Day 5

File `matrixAdd2.cu`

Summary : An example of a matrix addition on the host and device. Using a 1D block and 1D grid configuration, where each block is responsible for a chunk of matrix columns.

TakeAways : Used another configuration grid, block configuration for the same operation as in day 04.

Day 6

File `matrixAdd3.cu`

Summary : An example of a matrix addition on the host and device. Using a 2D block and 2D grid configuration, where each block is responsible for a matrix chunk.

TakeAways : Used another configuration grid, block configuration for the same operation as in day 05.

Day 7

File `simpleDivergence.cu`

Summary : An example of warp divergence for CUDA and how to deal with it.

Concepts used :

Issues with control-flow instructions in CUDA.
Intra-Warp divergence degrades performance.
Inter-War divergence (called Inter-Warp Variation) doesn't degrade performance because different warps make different branch decisions based on their data.

TakeAways : Intra-Warp Divergence will make your code drop in performance.

Day 8

File `latencyHiding.cu`

Summary : An example of the two types of arithmetic and memory latency hiding

Concepts used :

Discussed the "Occupancy" metric that shows how utilized is the hardware.
Inspected arithmetic and memory latency hiding.
Computed the number of warps needed to get latency hiding for a given block and grid configuration.

TakeAways : To hide arithmetic and memory latency a balance between grid and block size needs to be done.

Small thread blocks: Too few threads per block leads to hardware limits on the number of warps per SM to be reached before all resources are fully utilized.
Large thread blocks: Too many threads per block leads to fewer per-SM hardware resources available to each thread.

Day 9

File `synchronize.cu`

Summary : An example of the thread and global synchronization

Concepts used :

Used __syncthreads(); and cudaDeviceSynchronize().
Very briefly saw the concept of "sticky-errors" in CUDA (this means that the CUDA context is corrupted). Great link here !

TakeAways : CUDA kernel launches are asynchronous. cudaDeviceSynchronize() is crucial for host-device coordination and for reliably catching kernel runtime errors, while __syncthreads() manages synchronization within a thread block. Beware of sticky errors impacting subsequent operations.

Day 10

File `parallelReduce.cu`

Summary : An example of the parallel reduction operator in CUDA.

Concepts used :

Discussed and implemented the parallel reduced with a neighbored pair method.
Warp divergence is present in our problem

TakeAways : The parallel reduction operator is a fundamental technique in CUDA for efficiently combining elements of an array (e.g., summing, finding min/max) in parallel.

Day 11

File `parallelReduce2.cu`

Summary : Second example of the parallel reduction operator in CUDA.

Concepts used :

Reduced warp divergence is present in our problem

TakeAways : We changed the parallel reduction implementation by leverage different thread indexing to reduce warp divergence.

Day 12

File `parallelReduce3.cu`

Summary : Third example of the parallel reduction operator in CUDA.

Concepts used :

Implemented Interleaved based parallel reduction method
Same amount of warp divergence as in day11.

TakeAways : This method is more optimal due to the better memory management with coalescing memory banks. Compared to day11 we saw a 2x improvement.

Day 13

File `parallelReduceUnroll.cu`

Summary : Fourth example of the parallel reduction operator in CUDA.

Concepts used :

Implemented Interleaved based parallel reduction method with unrolling

TakeAways : Using unrolling methods allows for more optimization from the compiler side to enhance performance.

Day 14

File `parallelReduceUnroll2.cu`

Summary : Fifth example of the parallel reduction operator in CUDA.

Concepts used :

Dealt with warp divergence by using CUDA primitives in Interleaved based parallel reduction method.

TakeAways : Used __shfl_down_sync that allows the data exchange to be performed between registers, and is more efficient than going through shared memory, which requires a load, a store and an extra register to hold the address.

Day 14

File `dynamicParallelism.cu`

Summary : Introduction of dynamic parallelism in CUDA.

Concepts used :

Launched a simple HelloWorld with dynamic parallelism.

TakeAways : CUDA Dynamic Parallelism allows new GPU kernels to be created and synchronized directly on the GPU. The ability to dynamically add parallelism to a GPU application at arbitrary points in a kernel offers exciting new capabilities.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.vscode		.vscode
common		common
day01		day01
day02		day02
day03		day03
day04		day04
day05		day05
day06		day06
day07		day07
day08		day08
day09		day09
day10		day10
day11		day11
day12		day12
day13		day13
day14		day14
day15		day15
my_notes		my_notes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

thibmonsel/GPUSandbox

Folders and files

Latest commit

History

Repository files navigation

GPUSandbox

Learning GPU kernels for 100 days

Day 1

File vectorAdd.cu

Day 2

File vectorAddWithStreams.cu

Day 3

File vectorAddOptimized.cu

Day 4

File matrixAdd.cu

Day 5

File matrixAdd2.cu

Day 6

File matrixAdd3.cu

Day 7

File simpleDivergence.cu

Day 8

File latencyHiding.cu

Day 9

File synchronize.cu

Day 10

File parallelReduce.cu

Day 11

File parallelReduce2.cu

Day 12

File parallelReduce3.cu

Day 13

File parallelReduceUnroll.cu

Day 14

File parallelReduceUnroll2.cu

Day 14

File dynamicParallelism.cu

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

File `vectorAdd.cu`

File `vectorAddWithStreams.cu`

File `vectorAddOptimized.cu`

File `matrixAdd.cu`

File `matrixAdd2.cu`

File `matrixAdd3.cu`

File `simpleDivergence.cu`

File `latencyHiding.cu`

File `synchronize.cu`

File `parallelReduce.cu`

File `parallelReduce2.cu`

File `parallelReduce3.cu`

File `parallelReduceUnroll.cu`

File `parallelReduceUnroll2.cu`

File `dynamicParallelism.cu`

Packages