Skip to content

ShlokVFX/Solutions-AMD-Challenge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

GPU Kernel Performance Benchmarks

Optimized GPU compute kernels for AMD MI300X
Focus: FP8 GEMM, MoE Inference, MLA Decode (KV cache + MHA)

Was part of team Shinsato / AkashBlog

Check out his Github Solution as Well : Akash-Github


FP8 GEMM

Metric Description
Latency 120 µs
Details FP8 blockwise matmul with MFMA
Double-buffered shared memory pipeline
Vectorized tile access for peak throughput

MoE Inference

Metric Description
Latency 8.75 ms
Details Fused routing, matmul, activation
Expert-parallel batching
Shared workspace reuse

MLA Decode

Metric Description
Latency 2.68 ms
Details KV cache + Multi-Head Attention
Fused FP8 projection
Vectorized access pattern, shared memory

Platform: AMD MI300X — HIP / ROCm — FP8 (E4M3)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published