[CUDA] fp and int4 quants for qmm_sm80 by zcbenz · Pull Request #3268 · ml-explore/mlx

zcbenz · 2026-03-17T11:12:27Z

Followup to #3255 finishing the missing quants, was blocked by NVIDIA/cutlass#3111.

We can't support int2/3/5/6 in this version unfortunately since the lengths supported by cp.async instruction are limited.

angeloskath

Looks great!

How 's the initial perf looking?

zcbenz · 2026-03-18T23:16:26Z

The memory bandwidth of int4 is about 50% of int8 on A100:

M	N	K	QMM (GiB/s)	CUBLAS (GiB/s)	QMM (TF/s)	CUBLAS (TF/s)	Speedup (x)
16	4096	4096	174.3	1397.5	9.65	22.19	0.43

I think it means that the kernel is bottlenecked at loading quantized weights.

[CUDA] fp and int4 quants for qmm_sm80

fe9b8ed

angeloskath approved these changes Mar 18, 2026

View reviewed changes

zcbenz merged commit dbfbc0f into ml-explore:main Mar 19, 2026
16 checks passed

zcbenz deleted the qmm-sm80-update branch March 19, 2026 00:38

Provide feedback