matmul

Fast matrix multiplication in Rust, built from scratch. Achieves up to 62% of NumPy/OpenBLAS performance through SIMD, cache blocking, and adaptive multi-threading.

Performance

Results (1024×1024 matrices)

Platform	CPU	Best Single-Thread	Best Multi-Thread	vs Naive
macOS	i5-8257U @ 1.4 GHz	22.37 GFLOPS	58.82 GFLOPS	126×
WSL2	i7-1185G7 @ 4.8 GHz	31.74 GFLOPS	59.11 GFLOPS	100×

vs NumPy (OpenBLAS)

Matrix Size	This Library	NumPy	Ratio
512×512	49 GFLOPS	79 GFLOPS	62%
1024×1024	55 GFLOPS	112 GFLOPS	49%

NumPy/OpenBLAS represents 20+ years of hand-tuned assembly. This implementation demonstrates the same techniques built from scratch in Rust.

Usage

use matmul::{multiply, multiply_parallel};

// Single-threaded (auto-selects best kernel for your CPU)
let a = vec![1.0f64; 1024 * 1024];
let b = vec![1.0f64; 1024 * 1024];
let mut c = vec![0.0f64; 1024 * 1024];

multiply(&a, &b, &mut c, 1024, 1024, 1024);

// Multi-threaded
multiply_parallel(&a, &b, &mut c, 1024, 1024, 1024, 4);

What's Inside

SIMD Kernels:

4×4 AVX2 (28× speedup)
12×4 AVX2 (33× speedup)
8×8 AVX-512 (39× speedup)

Optimizations:

Cache blocking tuned for L1/L2
Matrix packing for sequential access
FMA (fused multiply-add) instructions
Adaptive threading (scales down for small matrices)

Project Structure

src/
├── kernels/          # SIMD microkernels (4×4, 12×4, 8×8)
├── blocked/          # Cache-blocked GEMM implementations
├── threaded/         # Multi-threaded wrappers
├── matrix/           # Naive implementations, transpose
└── lib.rs            # Public API

Building

cargo build --release
cargo test
cargo bench

Requirements

Rust 1.70+
AVX2 support (Intel Haswell+ / AMD Excavator+)
AVX-512 for 8×8 kernel (Intel Skylake-X+ / 11th gen+)

Key Insight

The Mac at 1.4 GHz achieves nearly identical performance to WSL2 at 4.8 GHz:

macOS: 42.0 GFLOPS per GHz
WSL2: 11.5 GFLOPS per GHz

3.6× efficiency difference due to thermal management, native OS vs virtualization, and memory subsystem behavior. Understanding why performance differs matters as much as the raw numbers.

References

Optimization Journey - Deep dive into each optimization step with performance analysis
Goto & van de Geijn - Anatomy of High-Performance Matrix Multiplication
Intel Intrinsics Guide

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docs		docs
python		python
src		src
tests		tests
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
justfile		justfile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

matmul

Performance

Results (1024×1024 matrices)

vs NumPy (OpenBLAS)

Usage

What's Inside

Project Structure

Building

Requirements

Key Insight

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

matmul

Performance

Results (1024×1024 matrices)

vs NumPy (OpenBLAS)

Usage

What's Inside

Project Structure

Building

Requirements

Key Insight

References

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages