GPU Kernel Performance Benchmarks

Optimized GPU compute kernels for AMD MI300X
Focus: FP8 GEMM, MoE Inference, MLA Decode (KV cache + MHA)

Was part of team Shinsato / Akash — Blog

Check out his Github Solution as Well : Akash-Github

FP8 GEMM

Metric	Description
Latency	120 µs
Details	FP8 blockwise matmul with MFMA
	Double-buffered shared memory pipeline
	Vectorized tile access for peak throughput

MoE Inference

Metric	Description
Latency	8.75 ms
Details	Fused routing, matmul, activation
	Expert-parallel batching
	Shared workspace reuse

MLA Decode

Metric	Description
Latency	2.68 ms
Details	KV cache + Multi-Head Attention
	Fused FP8 projection
	Vectorized access pattern, shared memory

Platform: AMD MI300X — HIP / ROCm — FP8 (E4M3)

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
GPU_MODE		GPU_MODE
README.md		README.md

Provide feedback