This repository collects my solutions and writeups for the NVIDIA SOL-ExecBench benchmark.
- Build a structured set of SOL-ExecBench solutions.
- Provide reproducible implementations with clear code comments.
- Document transferable GPU kernel optimization patterns.
Each problem writeup will typically include:
- Problem understanding and constraints
- Baseline implementation
- Optimized versions (e.g., memory access, parallel strategy, fusion)
- Performance comparison and key takeaways
This repository is a work in progress and will be updated continuously.
- 001_attn_bwd: Backward pass for attention softmax, dropout, and value matmul.
- 002_vae_conv2d: Fused VAE residual block with Conv3x3, GroupNorm, SiLU, and residual addition.
The .claude/skills/ directory contains model-invoked skills for this project:
| Skill | Triggers when… |
|---|---|
new-kernel |
Creating a new kernel implementation from a torch reference |
b200-tuning |
Optimizing for B200/Blackwell performance (tiles, TMA, WGMMA, pipeline) |
kernel-testing |
Running test.py, diagnosing failures, or using Triton IR debug flags |
- Benchmark: https://research.nvidia.com/benchmarks/sol-execbench
- Original repository: https://github.com/nvidia/sol-execbench