Fast matrix multiplication in Rust, built from scratch. Achieves up to 62% of NumPy/OpenBLAS performance through SIMD, cache blocking, and adaptive multi-threading.
| Platform | CPU | Best Single-Thread | Best Multi-Thread | vs Naive |
|---|---|---|---|---|
| macOS | i5-8257U @ 1.4 GHz | 22.37 GFLOPS | 58.82 GFLOPS | 126× |
| WSL2 | i7-1185G7 @ 4.8 GHz | 31.74 GFLOPS | 59.11 GFLOPS | 100× |
| Matrix Size | This Library | NumPy | Ratio |
|---|---|---|---|
| 512×512 | 49 GFLOPS | 79 GFLOPS | 62% |
| 1024×1024 | 55 GFLOPS | 112 GFLOPS | 49% |
NumPy/OpenBLAS represents 20+ years of hand-tuned assembly. This implementation demonstrates the same techniques built from scratch in Rust.
use matmul::{multiply, multiply_parallel};
// Single-threaded (auto-selects best kernel for your CPU)
let a = vec![1.0f64; 1024 * 1024];
let b = vec![1.0f64; 1024 * 1024];
let mut c = vec![0.0f64; 1024 * 1024];
multiply(&a, &b, &mut c, 1024, 1024, 1024);
// Multi-threaded
multiply_parallel(&a, &b, &mut c, 1024, 1024, 1024, 4);SIMD Kernels:
- 4×4 AVX2 (28× speedup)
- 12×4 AVX2 (33× speedup)
- 8×8 AVX-512 (39× speedup)
Optimizations:
- Cache blocking tuned for L1/L2
- Matrix packing for sequential access
- FMA (fused multiply-add) instructions
- Adaptive threading (scales down for small matrices)
src/
├── kernels/ # SIMD microkernels (4×4, 12×4, 8×8)
├── blocked/ # Cache-blocked GEMM implementations
├── threaded/ # Multi-threaded wrappers
├── matrix/ # Naive implementations, transpose
└── lib.rs # Public API
cargo build --release
cargo test
cargo bench- Rust 1.70+
- AVX2 support (Intel Haswell+ / AMD Excavator+)
- AVX-512 for 8×8 kernel (Intel Skylake-X+ / 11th gen+)
The Mac at 1.4 GHz achieves nearly identical performance to WSL2 at 4.8 GHz:
- macOS: 42.0 GFLOPS per GHz
- WSL2: 11.5 GFLOPS per GHz
3.6× efficiency difference due to thermal management, native OS vs virtualization, and memory subsystem behavior. Understanding why performance differs matters as much as the raw numbers.
- Optimization Journey - Deep dive into each optimization step with performance analysis
- Goto & van de Geijn - Anatomy of High-Performance Matrix Multiplication
- Intel Intrinsics Guide
MIT