A focused HPC benchmarking project comparing matrix multiplication performance across different parallel computing paradigms.
This project benchmarks matrix multiplication using:
- Baseline C - Simple, unoptimized reference implementation
- CUDA - GPU acceleration with NVIDIA CUDA
- MPI - Distributed computing across multiple nodes
- OpenMP - Multi-threaded shared memory parallelism
- Compiler Optimizations - Testing -O1, -O2, -O3, -Ofast flags
- GCC/G++ compiler
- NVIDIA CUDA Toolkit (for CUDA)
- MPI implementation (OpenMPI/MPICH)
- Python 3.8+ with pandas, matplotlib, numpy
./scripts/build.sh./scripts/run_benchmarks.shpython scripts/plot_results.pysrc/
├── baseline/ # Basic C implementation
│ └── matrix_mult.c
├── cuda/ # CUDA implementation
│ └── matrix_multiplication.cu
├── mpi/ # MPI implementation
│ └── matrix_mult_mpi.c
├── openmp/ # OpenMP implementation
│ └── matrix_mult_omp.c
└── optimized/ # Compiler optimization tests
└── matrix_mult.c
scripts/
├── build.sh # Build all implementations
├── run_benchmarks.sh # Run all benchmarks
└── plot_results.py # Generate performance plots
results/
├── raw/ # CSV output from benchmarks
└── plots/ # Generated performance graphs
Simple triple-loop matrix multiplication with no optimizations. Used as reference for speedup calculations.
- Naive global memory kernel
- Tiled shared memory kernel (16x16, 32x32 blocks)
- Performance metrics: kernel time, transfer time, GFLOPS
- Distributed row-based decomposition
- Each process handles a subset of rows
- Collective communication for result gathering
- Parallel loops with different scheduling strategies
- Thread scaling tests (1, 2, 4, 8, 16 threads)
- Shared memory optimization
Tests same baseline code with:
-O0- No optimization-O1- Basic optimization-O2- Moderate optimization-O3- Aggressive optimization-Ofast- Maximum optimization (may affect accuracy)
- Small: 512×512, 1024×1024
- Medium: 2048×2048, 4096×4096
- Large: 8192×8192 (if memory permits)
All implementations output CSV with:
timestamp,implementation,matrix_size,time_ms,gflops,threads/processes,node
The Python plotting script generates:
- Execution time comparison across implementations
- GFLOPS comparison
- Speedup relative to baseline
- Scaling efficiency (for MPI and OpenMP)
# Baseline
gcc -O0 src/baseline/matrix_mult.c -o bin/baseline
# CUDA
cd src/cuda && make
# OpenMP
gcc -O3 -fopenmp src/openmp/matrix_mult_omp.c -o bin/openmp
# MPI
mpicc -O3 src/mpi/matrix_mult_mpi.c -o bin/mpi# Baseline
./bin/baseline 1024 > results/raw/baseline.csv
# CUDA
./src/cuda/matrix_multiplication > results/raw/cuda.csv
# OpenMP (8 threads)
OMP_NUM_THREADS=8 ./bin/openmp 1024 > results/raw/openmp_8.csv
# MPI (4 processes)
mpirun -np 4 ./bin/mpi 1024 > results/raw/mpi_4.csv- Execution Time: Wall-clock time in milliseconds
- GFLOPS: Billion floating-point operations per second
- Formula: (2 × N³) / (time_seconds × 10⁹)
- Speedup: Time(baseline) / Time(optimized)
- Efficiency: Speedup / Number of processors
When adding new optimizations:
- Follow the existing code structure
- Output results in standard CSV format
- Include verification against baseline
- Update this README
See LICENSE file for details.