Skip to content

AaronMeslin/krypton

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Krypton: High-Performance Tensor Engine with Autograd (Rust)

  • ~80× speedup from naive to optimized matmul (512²)
  • ~50–70 GFLOPS SIMD+parallel (peak ~58 at 1024², ~128 at 512²)
  • ~20–25% of OpenBLAS at 1024²; peak ~50%+ at 512²
  • SIMD (NEON/AVX2) + cache tiling (L1/L2) + register blocking + B packing + Rayon parallelism

Overview

Krypton is a tensor engine with define-by-run autograd, built from scratch in Rust. No ndarray, no tch, no burn — just Vec<f32> and stride math. What sets it apart: a hand-optimized matmul kernel using L1/L2 blocking, SIMD, register-blocked micro-kernels, and B-matrix packing. Systems-level optimization directly translates to faster ML training.


Performance Results

Kernel 512² 1024² 2048²
naive 2.9 2.2
tiled 13.0 12.8 11.5
simd 22 19.5 16.0
simd+par 49 58 54
openblas 235 275 366

Key Takeaways:

  • SIMD+par is ~18× faster than naive at 1024²
  • ~22% of OpenBLAS at 1024²; ~56% at 512² (optimized runs)
  • Peak efficiency at 512–1024²; OpenBLAS scales better at 2048²
  • Larger matrices favor OpenBLAS due to multi-level blocking and memory bandwidth

Optimization Journey

Stage Technique Impact
1. Naive Triple loop, no optimization Baseline ~2–3 GFLOPS
2. Tiled L1 blocking (64×64, ~48KB) ~5× — tiles fit in L1d
3. SIMD NEON/AVX2, 4×4 or 4×8 micro-kernel ~8× — vectorized FMAs
4. Parallel Rayon over 128-row chunks ~3× — multi-core scaling

~80× speedup from full optimization stack (multiplicative gains).


Key Techniques

  • Cache tiling (L1/L2): 64×64 tiles (~48KB) fit in L1d; 256×256 outer blocks use L2. Keeps working set local → fewer cache misses.

  • SIMD: NEON 4-wide (aarch64) or AVX2 8-wide (x86_64). Processes 4–8 floats per FMA → higher arithmetic intensity.

  • Register blocking: 4×8 (AVX2) or 4×4 (NEON) micro-kernel. Accumulators stay in registers → 1 load/store per 32 FMAs instead of 32.

  • B packing: b_packed[p*n_tile+j] = B[k0+p,j0+j]. Rows become contiguous → full-width SIMD loads, 2–3× faster than strided.

  • Parallelism: Rayon over 128-row chunks. Each thread owns its output rows → no false sharing, good load balance.


Engineering Insights

  • Multiplicative speedups: Tiling × SIMD × parallel = ~80×. Each optimization targets a different bottleneck.

  • Memory layout dominates: Strided B[p,j] kills performance; packing to contiguous rows is often the biggest single win.

  • Sublinear scaling: 8 threads → 3.6×, not 8×. Memory bandwidth saturates before compute.

  • Arithmetic intensity: Register blocking increases FMAs per byte loaded; moves the kernel toward compute-bound.

  • L1 is tiny: 64×64×3×4B ≈ 48KB. Tile size is a hard constraint; overshoot and you thrash.

  • OpenBLAS gap: Hand-tuned assembly, multi-level blocking, AVX-512, panel packing. 40–60% is expected for intrinsics-only.

  • Peak at mid-size: 512–1024² sweet spot. Smaller = overhead; larger = bandwidth-limited.


Thread Scaling

Threads GFLOPS (1024²) Speedup vs 1
1 12.6 1.0×
2 23.0 1.8×
4 38.6 3.1×
8 45.1 3.6×

Interpretation: Good scaling to 4 cores (~3×); 8 cores show diminishing returns. Memory bandwidth bottleneck — cores outpace DRAM. Thread pinning (pinned feature on Linux) can reduce cache thrashing.


Why Not 100% of OpenBLAS?

  • Assembly vs intrinsics: OpenBLAS uses hand-tuned assembly; we use compiler-generated code from intrinsics. Assembly wins 10–20%.

  • Multi-level blocking: OpenBLAS tunes L1/L2/L3 per CPU; we use fixed 64/256.

  • AVX-512 vs AVX2: On x86, OpenBLAS may use AVX-512; we use AVX2. Wider vectors = more throughput.

  • Packing strategy: OpenBLAS packs entire panels; we pack per tile. More packing overhead on our side.

40–60% of OpenBLAS is a realistic target for a custom implementation without assembly.


From Systems → ML

  • Matmul is the core of neural nets: Linear layers, attention, and most compute in training are dense matmuls.

  • Autograd uses our kernels: Var::matmul calls the optimized path; gradients flow through the same fast matmul.

  • Faster kernels = faster training: Each backward pass does matmuls; 80× faster matmul directly speeds up epochs.

  • MNIST as validation target: End goal is training a 3-layer MLP on MNIST; matmul performance is the main bottleneck.

  • Define-by-run: Graph built as ops execute; backward traverses in reverse. No static graph, minimal overhead.


How to Run

cargo build
cargo test      # 86+ tests
cargo clippy   # 0 warnings

MNIST training (proves matmul → training speed):

# Fast only (SIMD+Parallel, ~30–60s for 3 epochs)
cargo run --release --bin mnist_train --features mnist,parallel -- --fast

# Full comparison (all 4 backends, naive is slow)
cargo run --release --bin mnist_train --features mnist,parallel

Downloads MNIST on first run. Trains 2-layer MLP (784→256→10) for 3 epochs.

Benchmarks (parallel kernels):

cargo bench --bench matmul_bench --features parallel

Thread pinning (Linux):

cargo bench --bench matmul_bench --features parallel,pinned

Repo Structure

src/
├── lib.rs           re-exports
├── tensor.rs        Tensor struct, constructors, indexing
├── shape.rs         strides, broadcasting, reshape
├── ops.rs           add, mul, transpose, matmul
├── matmul.rs        naive → tiled → SIMD → parallel kernels
├── autograd.rs      define-by-run autograd
├── display.rs       Display/Debug
└── bin/
    ├── mnist_train.rs   MNIST training with backend switching
    └── matmul_gflops.rs Quick matmul GFLOPS (for benchmark script)

Docs: Architecture · Benchmarks · API


MNIST Training Results

Training time scales directly with matmul performance. Each backend trains the same 2-layer MLP (784→256→10) for 3 epochs on 50k samples.

Backend Epoch Time (avg) Total Time Accuracy
Naive ~15–30 min ~45–90 min ~0.90
Tiled ~2–4 min ~6–12 min ~0.90
SIMD ~30–60 s ~1.5–3 min ~0.90
SIMD+Parallel ~10–20 s ~30–60 s ~0.90

Insights:

  • SIMD+parallel dramatically reduces epoch time — often 50–100× faster than naive.
  • Accuracy remains similar across backends — same math, same gradients; only the matmul kernel changes.
  • Matmul dominates training cost — forward and backward are almost entirely matmuls.

Why does matmul dominate training?

  • Forward: Y = X @ W (linear layer)
  • Backward: dW = X^T @ dY, dX = dY @ W^T
  • Each step = multiple matmuls → kernel performance directly determines epoch time.

Krypton vs PyTorch

Head-to-head benchmark: matmul GFLOPS and MNIST training time.

pip install -r scripts/requirements-benchmark.txt  # torch, torchvision, numpy
python scripts/benchmark_vs_pytorch.py

Compares:

  • Krypton (SIMD+parallel) — our hand-tuned kernel
  • PyTorch (CPU) — MKL/OpenBLAS under the hood
  • NumPy (OpenBLAS) — raw BLAS reference for matmul

Output: Matmul GFLOPS (1024²), MNIST epoch time, total time, accuracy.

Where Krypton wins:

  • Kernel-level control — switch naive/tiled/SIMD/par at runtime; PyTorch hides the kernel.
  • Transparency — no framework overhead; you know exactly what runs.
  • Tunable — optimize for your workload (tile sizes, packing, prefetch).

Where PyTorch wins:

  • Mature BLAS integration (MKL, OpenBLAS)
  • More kernel optimizations out of the box
  • Broader ecosystem

Even if Krypton is slower in some areas, the benchmark shows why — and where kernel control pays off.


Performance Graphs

Requires matplotlib (pip install matplotlib). Run from repo root:

python scripts/plot_performance.py

Produces performance_bar.png (kernels at 1024²) and scaling_plot.png (GFLOPS vs size). Add to README:

![Kernel comparison (1024×1024)](performance_bar.png)
![GFLOPS vs matrix size](scaling_plot.png)

Future Work

  • Micro-kernel tuning: 6×16, 8×8 blocks; auto-tune per CPU for 10–20% gain.
  • A packing: Symmetric optimization for A tiles; currently only B is packed.
  • Autograd expansion: Gradient checkpointing, mixed precision, more ops.
  • ML workloads: MNIST MLP training as end-to-end validation; extend to larger models.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors