[Experiment] ROCm backend by NripeshN · Pull Request #2300 · ml-explore/mlx

NripeshN · 2025-06-16T21:43:44Z

Experiment with ROCm backend.

install MLX with ROCm backend using:

mkdir build && cd build
cmake -DMLX_BUILD_ROCM=ON \
      -DCMAKE_PREFIX_PATH=/opt/rocm \
      -DCMAKE_HIP_ARCHITECTURES="gfx90a;gfx1100" \
      ..
make -j$(nproc)

closes #2556

Inspired by @zcbenz

lin72h · 2025-06-17T07:07:21Z

What an unexpected and amazing surprise! I'm absolutely thrilled.

NripeshN · 2025-06-18T23:51:44Z

@awni
What do you think of this PR? Does this have the potential to be merged into main? I can turn this PR from experimental to WIP if so.

angeloskath · 2025-06-24T00:38:27Z

I think this is good to stay as an experiment branch for some time while we work on core and CUDA. I don't think we have the bandwidth to merge this for a few months at least. Sorry if this is disappointing @NripeshN I don't mean to discourage you working on it.

akshat2602 · 2025-08-18T17:56:41Z

I would love to see the ROCm backend get more traction. The new AI series of processors by AMD have a similar advantage to Apple Silicon with unified memory and getting MLX to run on those processors would be neat.

countradooku · 2026-01-04T20:27:49Z

Stole my idea :(

goniz · 2026-01-22T15:20:15Z

How is this even possible for such an awesome PR to be left like this?

Copilot

Pull request overview

This PR adds experimental ROCm backend support to MLX, enabling execution on AMD GPUs. The implementation mirrors the CUDA backend structure, providing HIP-based implementations of core operations, memory management, and device handling.

Changes:

Added ROCm backend infrastructure with device management, memory allocation, and stream handling
Implemented HIP kernels for unary, binary, ternary operations, reductions, normalization (softmax, layer_norm, rms_norm), RoPE, and sorting
Updated build system (CMake) to support ROCm compilation with configurable GPU architectures

Reviewed changes

Copilot reviewed 59 out of 59 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
CMakeLists.txt	Added MLX_BUILD_ROCM option and ROCm library detection
mlx/CMakeLists.txt	Integrated ROCm backend build configuration
mlx/device.cpp	Added ROCm device availability checks
mlx/backend/rocm/*.hip	HIP kernel implementations for various operations
mlx/backend/rocm/device.*	ROCm device and stream management
mlx/backend/rocm/allocator.*	ROCm-specific memory allocator using HIP unified memory
mlx/backend/rocm/worker.*	Async task execution worker for stream synchronization
mlx/backend/rocm/utils.*	HIP utility functions and error handling
mlx/backend/rocm/jit_module.*	JIT compilation support using HIPRTC
mlx/backend/rocm/device/*.hpp	Device-side utility functions and type definitions
mlx/backend/rocm/CMakeLists.txt	ROCm backend build configuration

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mlx/backend/rocm/softmax.hip

mlx/backend/rocm/device.cpp

mlx/backend/rocm/layer_norm.hip

mlx/backend/rocm/rope.hip

mlx/backend/rocm/softmax.hip

mlx/backend/rocm/allocator.cpp

CMakeLists.txt

mlx/backend/rocm/binary.hip

mlx/backend/rocm/rms_norm.hip

mlx/backend/rocm/layer_norm.hip

…ather, scatter, logsumexp, random bits generation, and sorting. Introduce new kernels for efficient computation and integrate with existing ROCm utilities. Update CMake configuration to include new source files and dependencies. Enhance error handling and ensure compatibility with different data types. This commit significantly expands the functionality of the ROCm backend.

goniz · 2026-01-24T17:42:45Z

👑👑👑

NripeshN · 2026-01-24T18:12:04Z

Can anyone run

CMAKE_ARGS="-DMLX_BUILD_ROCM=ON" pip install -e .
CMAKE_ARGS="-DMLX_BUILD_ROCM=ON -DMLX_ROCM_ARCHITECTURES={based on your GPU}" pip install -e .

Replace {based on your GPU} with your GPU architecture

You can run

rocm-smi

to get your GPU information

goniz · 2026-01-24T18:49:41Z

I'm getting this CMake error:

CMAKE_ARGS="-DMLX_BUILD_ROCM=ON -DMLX_ROCM_ARCHITECTURES=gfx1151" pip install -e .

      -- Configuring done (4.8s)
      CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
      Please set them or make sure they are set and tested correctly in the CMake files:
      /home/goniz/Work/mlx/LAPACK_INCLUDE_DIRS
         used as include directory in directory /home/goniz/Work/mlx
      
      CMake Error in CMakeLists.txt:
        HIP_ARCHITECTURES is empty for target "mlx".
      
      
      CMake Error in CMakeLists.txt:
        HIP_ARCHITECTURES is empty for target "mlx".
      
      
      -- Generating done (0.0s)
      CMake Generate step failed.  Build files cannot be regene
rated correctly.

Running on Strix Halo (gfx1151)

NripeshN · 2026-01-25T00:54:26Z

I'm getting this CMake error:

CMAKE_ARGS="-DMLX_BUILD_ROCM=ON -DMLX_ROCM_ARCHITECTURES=gfx1151" pip install -e .

     -- Configuring done (4.8s)
     CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
     Please set them or make sure they are set and tested correctly in the CMake files:
     /home/goniz/Work/mlx/LAPACK_INCLUDE_DIRS
        used as include directory in directory /home/goniz/Work/mlx
     
     CMake Error in CMakeLists.txt:
       HIP_ARCHITECTURES is empty for target "mlx".
     
     
     CMake Error in CMakeLists.txt:
       HIP_ARCHITECTURES is empty for target "mlx".
     
     
     -- Generating done (0.0s)
     CMake Generate step failed.  Build files cannot be regene
rated correctly.

Running on Strix Halo (gfx1151)

Could you retry with the latest push please (p.s. keep your fingers crossed while it compiles, worked for me 138th time)😅

… string formatting, replacing fmt library usage. Remove unused event.cpp file. Update kernel name generation and parameter formatting for consistency.

goniz · 2026-01-25T02:18:36Z

  Created wheel for mlx: filename=mlx-0.30.4.dev20260125+cadf18c1-0.editable-cp314-cp314-linux_x86_64.whl size=4722 sha256=72c664adbfc4fb9ec317522a8d83b84f85d599d08bd691d7fec3abfdb6f3a5e9
  Stored in directory: /tmp/pip-ephem-wheel-cache-nt7w6bq0/wheels/8a/63/d1/d7d629a5ff73457822bb71aa527c083674bb19ca314735cd05
Successfully built mlx
Installing collected packages: mlx
Successfully installed mlx-0.30.4.dev20260125+cadf18c1

Now what can I test? 😍

goniz · 2026-01-25T02:21:36Z

I'm getting this:

ImportError: /home/goniz/Work/mlx/python/mlx/lib/libmlx.so: undefined symbol: _ZN3mlx4core11Convolution8eval_gpuERKSt6vectorINS0_5arrayESaIS3_EERS3_

NripeshN · 2026-01-26T04:32:13Z

I'm getting this:

ImportError: /home/goniz/Work/mlx/python/mlx/lib/libmlx.so: undefined symbol: _ZN3mlx4core11Convolution8eval_gpuERKSt6vectorINS0_5arrayESaIS3_EERS3_

I forgot to test the Python build my bad, can you try it now?

Unfortunately I might not be able to help after it compiles, I don't have an AMD GPU to run tests😔 I've tried replicating most things from cuda, so hopefully it works

Use a hash of the module name for hiprtcCreateProgram to avoid filesystem filename limits when HIP runtime compiler creates temporary files. Also add get_hsaco_path() helper to split long module names into nested directories for disk caching. This fixes JIT compilation failures with complex fused kernels that generate very long module names (>255 chars).

HIP doesn't provide native math functions for hip_bfloat16 and __half, so add device function overloads that convert to float, compute, and convert back. This enables JIT-compiled kernels to use math operations on reduced-precision tensors. Functions added: abs, exp, log, sqrt, rsqrt, sin, cos, tan, sinh, cosh, tanh, asin, acos, atan, asinh, acosh, atanh, ceil, floor, rint, log2, log10, log1pf, expm1f, erff, erfinvf, powf, fmodf, truncf, atan2f.

Add has_only_singleton_batch_dims() helper to correctly detect when broadcasted singleton dimensions can be treated as non-batched matrices, fixing page faults and incorrect results in certain quantized matmul cases.

- Add qmv_warp_shared_batched_kernel to optimize batched QMV with singleton dimensions. - Add gather_qmv_warp_shared_kernel to accelerate MoE gather operations during decode. - Update dispatch logic in QuantizedMatmul::eval_gpu and GatherQMM::eval_gpu to use these fast paths.

Improves decoding speed for 4-bit and 6-bit quantized models by 10-15%. By reading up to 8 quantized values at once using uint32_t vector loads, we better saturate the memory bandwidth instead of doing multiple byte-sized loads. Also unskips passing tests in rocm_skip.py.

Tuning the number of threads per column to 16 rather than full WARP_SIZE significantly improves decoding generation performance (from 14.5 to 18.2 TPS on GLM-4 6bit) due to better hardware occupancy and register usage.

- Use sincosf() instead of separate cosf() + sinf() calls for better performance - Add optimized 1D kernels (rope_single_1d, rope_single_freqs_1d) for single-token decode - Use 256-thread 1D blocks instead of 16x16 2D blocks for small workloads - Inline implementation in 1D kernels to reduce function call overhead The decode case (B=1, T=1) now uses flat indexing which provides better occupancy for the small number of elements typical in LLM decode steps.

Tune quantized matmul path selection for decode/prefill shapes, add bounded dequant cache with safe source retention, and wire QMV block sizing heuristics. Extend ROCm SDPA/flash dispatch to head dim 256 and add a pointwise conv fast path to reduce launch overhead in decode-like workloads.

Key dequant-cache entries by GPU buffer pointers to avoid stale hits from array-id reuse, and align QMV thread/column defaults with architecture-aware warp sizing across both QMM and GatherQMM paths.

Prefer flash SDPA for decode-like BF16/F16 configurations with long KV cache and no masks, while preserving vector fallback behavior. Also skip the AddMM input copy when beta is zero to eliminate redundant device-to-device copy work.

Allow strided-batched GEMM when collapsed batch dimensions are uniformly strided (including flattened multi-dimensional batches) instead of restricting to single-dimension batches only. This reduces fallback per-batch launch overhead and keeps more matmuls on the rocBLAS batched path.

Add env-configurable rocBLAS solution-index selection for float32 and bfloat16 GEMM/strided-batched GEMM paths across matmul, quantized QMM dequant GEMM, and shared rocBLAS wrappers. Keep default behavior unchanged (index 0), and automatically fall back to standard algorithms if a configured solution index fails.

Select QMV threads-per-column based on problem size instead of forcing warp-size on RDNA, and tune cols-per-block accordingly for 8-bit paths. This restores better out-of-box decode throughput on smaller models while preserving faster large-model defaults.

Use a larger shared-memory chunk (2048 vs 1024) in QMV warp-shared kernels to reduce chunk loop overhead and synchronization frequency. This improves out-of-box decode throughput on Qwen3.5 models without requiring runtime tuning knobs.

Deduplicate temporary buffer keepalive entries per command buffer to lower host-side bookkeeping and callback payload size, and raise the default max-ops-per-buffer threshold to reduce commit frequency on decode workloads.

Geramy · 2026-03-27T15:33:31Z

I have a lot of changes to merge in I am testing my port of mlx-swift-lm https://github.com/lemonade-sdk/lemon-mlx-engine against the mlx rocm core, https://github.com/lemonade-sdk/lemon-mlx-core-amd I got qwen3 working. I am working on Qwen3Next right now, its having weird issues. There are tons of problems with the rocm backend that I have traced to "different rounding" causing unstable outputs. But a lot of it is fixed now at least regarding qwen models. Once I get your changes merged into my repo I will then push a PR into yours with my changes. I have made optimizations as well, there are problems with the fallback system when functions in rocBLAS arn't compatible or existent for the architecture.

Geramy · 2026-03-27T15:34:01Z

Once I get Qwen3Next working at a reasonable speed I will do the PR.

NripeshN · 2026-03-28T18:18:14Z

I have a lot of changes to merge

I have added you as a collaborator on my fork, you should be able to push changes directly to this branch(should be able to push changes directly to this PR). Again amazing work🚀

[Experiment] ROCM backend initial push

8bb8b76

NripeshN changed the title ~~[Experiment] ROCm backend initial push~~ [Experiment] ROCm backend Jun 16, 2025

NripeshN added 2 commits June 19, 2025 00:33

increment 1: few ops and jit update

ac5adfa

Increment 2: Implement major ops and add structure similar to cuda

cc4de6a

NripeshN mentioned this pull request Sep 12, 2025

Add ROCm Support for AMD GPUs #2556

Open

Merge remote-tracking branch 'upstream/main' into rocm-support

1163da1

Copilot AI review requested due to automatic review settings January 24, 2026 17:08

Copilot started reviewing on behalf of NripeshN January 24, 2026 17:09 View session

Copilot AI reviewed Jan 24, 2026

View reviewed changes

NripeshN added 2 commits January 24, 2026 17:29

rocm yaay

667cd9b

NripeshN and others added 2 commits January 24, 2026 18:03

chore fix cmake

63d6b6a

Merge branch 'main' into rocm-support

7c1b29d

compile fix

ee8b705

NripeshN added 2 commits January 25, 2026 01:18

Refactor error handling in ROCm backend to use std::ostringstream for…

9aa0f5c

… string formatting, replacing fmt library usage. Remove unused event.cpp file. Update kernel name generation and parameter formatting for consistency.

lint

cadf18c

add more features

6fa7c7c

goniz and others added 27 commits March 1, 2026 15:46

ROCm: Fix quantized GEMM fallback correctness

b44396a

ROCm: fix 5/6-bit affine quantized matmul page faults

f1687cc

ROCm: Fix quantized matmul with singleton batch dimensions

108195a

Add has_only_singleton_batch_dims() helper to correctly detect when broadcasted singleton dimensions can be treated as non-batched matrices, fixing page faults and incorrect results in certain quantized matmul cases.

ROCm: Set default THREADS_PER_COL to 16 for qmv warp kernels

a69c471

Tuning the number of threads per column to 16 rather than full WARP_SIZE significantly improves decoding generation performance (from 14.5 to 18.2 TPS on GLM-4 6bit) due to better hardware occupancy and register usage.

ROCm: vectorize 6-bit fallback QMV kernels

4353b1b

ROCm: harden QMM cache keys and tune QMV launch defaults

b38695f

Key dequant-cache entries by GPU buffer pointers to avoid stale hits from array-id reuse, and align QMV thread/column defaults with architecture-aware warp sizing across both QMM and GatherQMM paths.

ROCm: increase shared QMV tile size for decode

c6883ca

Use a larger shared-memory chunk (2048 vs 1024) in QMV warp-shared kernels to reduce chunk loop overhead and synchronization frequency. This improves out-of-box decode throughput on Qwen3.5 models without requiring runtime tuning knobs.

ROCm: reduce command-encoder scheduling overhead

d5d8b31

Deduplicate temporary buffer keepalive entries per command buffer to lower host-side bookkeeping and callback payload size, and raise the default max-ops-per-buffer threshold to reduce commit frequency on decode workloads.

ROCm: add sorted-rhs gather scheduling fast path

7bca990

ROCm: extend sorted-rhs gather schedule across QMV dispatch

20bcdd2

Benchmarks: route Qwen3.5 vision models through mlx-vlm

d07f6a5

ROCm: add architecture-aware QMV crossover and tiny-K dispatch

1c93a6f

ROCm: add alignment-aware QMV variant selection

6be6435

ROCm: fix no-shared QMV accumulator shadowing

3ca29dc

Merge branch 'main' of github.com:ml-explore/mlx into rocm-support-fixes

879a200

Merge upstream main into rocm-support

b48adae

Merge goniz/rocm-support-fixes with extensive ROCm optimizations

fe75135

Conversation

NripeshN commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lin72h commented Jun 17, 2025

Uh oh!

NripeshN commented Jun 18, 2025

Uh oh!

angeloskath commented Jun 24, 2025

Uh oh!

akshat2602 commented Aug 18, 2025

Uh oh!

countradooku commented Jan 4, 2026

Uh oh!

goniz commented Jan 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

goniz commented Jan 24, 2026

Uh oh!

NripeshN commented Jan 24, 2026

Uh oh!

goniz commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NripeshN commented Jan 25, 2026

Uh oh!

goniz commented Jan 25, 2026

Uh oh!

goniz commented Jan 25, 2026

Uh oh!

NripeshN commented Jan 26, 2026

Uh oh!

Geramy commented Mar 27, 2026

Uh oh!

Geramy commented Mar 27, 2026

Uh oh!

NripeshN commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

NripeshN commented Jun 16, 2025 •

edited

Loading

goniz commented Jan 24, 2026 •

edited

Loading