[CUDA] columnwise quantize with tma by nastya236 · Pull Request #3157 · ml-explore/mlx

nastya236 · 2026-02-23T01:06:30Z

Columnwise quantization with tma (mxfp8), bfloat16:

Size	With tma (ms)	Without tma (ms)
4096×4096	68.48	77.21
4096×8192	78.74	102.58
8192×4096	80.73	100.61
8192×8192	100.67	145.16
4096×16384	97.08	144.45
16384×4096	99.50	137.13

This PR:

Adds PTX instructions for asynchronous copy with TMA
Adds fp_quantize_columnwise_mxfp8 kernel for columnwise MXFP8 quantization using TMA on SM100+
Splits fp_quantize.cu into fp_quantize.cu (dispatch) and fp_quantize.cuh (kernels) to reduce file size
Moves swizzle_scales constants and get_swizzle_launch_args into cu:: namespace for consistency

TODO:
nvfp4 requires a separate columnwise kernel due to TMA tile size constraints. In the proposed kernel each thread processing a tile of size (N, M) and store a transposed result. M is equal to group_size -- 32 bytes formcfp8, but only 8 bytes for nvfp4. Since TMA requires the innermost tile dimension to be at least 128 bits (16 bytes), for nvfp4 kernel would need to load a larger tile and iterate over multiple groups.

zcbenz · 2026-03-17T23:41:43Z

mlx/backend/cuda/ptx.cuh

+#if (CUDART_VERSION >= 12080) && (__CUDA_ARCH__ >= 1000) && \
+    defined(__CUDA_ARCH_SPECIFIC__)
+
+__device__ __forceinline__ void mbarrier_init(uint64_t* mbar, uint32_t count) {


Should we use the cuda::ptx APIs like cuda::ptx::mbarrier_init API instead? They don't have good documentation and you would have to search https://github.com/NVIDIA/cccl to find out API names though.

zcbenz · 2026-03-17T23:46:40Z

mlx/backend/cuda/quantized/qqmm_utils.cu

@@ -10,11 +10,6 @@ namespace mlx::core {

 namespace cg = cooperative_groups;


This should be moved to namespace cu too.

zcbenz · 2026-03-18T00:24:35Z

mlx/backend/cuda/quantized/fp_quantize.cuh

+  auto tidy = block_idx.y * block_size.y + idx_in_block.y;
+  auto grid_dim_x = cg::this_grid().dim_blocks().x * block_size.x;
+
+  size_t thread_idx = tidx + grid_dim_x * size_t(tidy);


I think we can just use cg::this_grid().thread_rank()?

zcbenz · 2026-03-18T00:38:10Z

mlx/backend/cuda/quantized/fp_quantize.cu

+    int in_size_bytes, // itemsize
+    int bits) {
+  dim3 grid;
+  grid.x = (grid_dim_x_size + block_size_x - 1) / block_size_x;


Can you use cuda::ceil_div when possible?

Suggested change

grid.x = (grid_dim_x_size + block_size_x - 1) / block_size_x;

grid.x = cuda::ceil_div(grid_dim_x_size, block_size_x);

zcbenz · 2026-03-18T00:39:44Z

mlx/backend/cuda/quantized/fp_quantize.cuh

+
+  constexpr size_t out_tile_elems = BUFF_ELEMS / elem_per_byte;
+  constexpr size_t out_tile_size = out_tile_elems;
+  constexpr size_t out_buff_size_aligned =


This is not used anywhere.

zcbenz · 2026-03-18T00:43:20Z

mlx/backend/cuda/quantized/fp_quantize.cuh

+      (reinterpret_cast<uintptr_t>(shared_mem) + TMA_SHMEM_ALIGNMENT - 1) &
+      ~(static_cast<uintptr_t>(TMA_SHMEM_ALIGNMENT - 1));
+
+  T* in_sh = reinterpret_cast<T*>(aligned_shared);


You can make sure you get necessary alignment with dynamic allocated shared memory with this:

extern __shared__ uint128_t shared_mem[]; T* in_sh = reinterpret_cast<T*>(shared_mem);

or:

extern __shared__ alignas(128) char shared_mem[];

Also, I think in_smem would be an easier to understand name.

zcbenz · 2026-03-18T01:06:35Z

mlx/backend/cuda/quantized/fp_quantize.cu

+      ((out_tile_elems * BUFFS_NUM + TMA_SHMEM_ALIGNMENT - 1) /
+       TMA_SHMEM_ALIGNMENT) *
+      TMA_SHMEM_ALIGNMENT;
+  const size_t smem_size =


It appears that the size of shared memory is static? I don't think you need to use dynamic shared memory in this case, you can ensure alignment with:

__shared__ alignas(128) T smem[SIZE];

nastya236 added 18 commits February 15, 2026 23:26

[wip] tma

4a5ebd8

Merge remote-tracking branch 'upstream/main' into tma_load

4d7b8d8

tma columnwise quantize

b03cc9d

refactoring: separate fp_quant kernels from dispatch

252c686

fixed gpu_ptr

6248b16

[wip] rowwise tma, columnwise tma

a2e0f4d

fixed batched case

5a7668c

[wip] bank conflict

145f6e6

drop rowwise for now

66fb9eb

fix kernel names

4a6e1b8

Merge branch 'main' into tma_load

c2f2a97

refactor

1c87d0e

move common.h in quantized_utils.cuh

08d49af

fixed example

56f25b3

fix qqmm example

3b10cb4

Merge branch 'main' into tma_load

8882fa6

merge main, fix conflicts

8376a1e

fix old scale convertion

4c4f3a8

nastya236 requested a review from zcbenz March 17, 2026 15:42

nastya236 marked this pull request as ready for review March 17, 2026 15:43

nastya236 changed the title ~~[WIP] columnwise quantize with tma~~ [CUDA] columnwise quantize with tma Mar 17, 2026

zcbenz reviewed Mar 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] columnwise quantize with tma#3157

[CUDA] columnwise quantize with tma#3157
nastya236 wants to merge 18 commits intoml-explore:mainfrom
nastya236:tma_load

nastya236 commented Feb 23, 2026 •

edited

Loading

Uh oh!

zcbenz Mar 17, 2026

Uh oh!

zcbenz Mar 17, 2026

Uh oh!

zcbenz Mar 18, 2026

Uh oh!

zcbenz Mar 18, 2026

Uh oh!

zcbenz Mar 18, 2026

Uh oh!

zcbenz Mar 18, 2026 •

edited

Loading

Uh oh!

zcbenz Mar 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -10,11 +10,6 @@ namespace mlx::core {

		namespace cg = cooperative_groups;

	grid.x = (grid_dim_x_size + block_size_x - 1) / block_size_x;
	grid.x = cuda::ceil_div(grid_dim_x_size, block_size_x);

Conversation

nastya236 commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zcbenz Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

zcbenz Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

zcbenz Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

zcbenz Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

zcbenz Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

zcbenz Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zcbenz Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nastya236 commented Feb 23, 2026 •

edited

Loading

zcbenz Mar 18, 2026 •

edited

Loading

zcbenz Mar 18, 2026 •

edited

Loading