Skip to content

Support for ROCm for AMD cards.#29

Open
Mateusz-Dera wants to merge 18 commits intoekwek1:mainfrom
Mateusz-Dera:main
Open

Support for ROCm for AMD cards.#29
Mateusz-Dera wants to merge 18 commits intoekwek1:mainfrom
Mateusz-Dera:main

Conversation

@Mateusz-Dera
Copy link
Copy Markdown

I added support for AMD cards.
While doing so, I also tested the installation using uv.
I additionally added a simple interface written in Gradio.
Everything was tested on Ubuntu 24.04 Podman container with ROCm 7.1.1 on an AMD Radeon 7900 XTX.

@ekwek1
Copy link
Copy Markdown
Owner

ekwek1 commented Jan 10, 2026

Hi @Mateusz-Dera, thank you for the PR! I combined your gradio interface with #10 and pushed it onto the repo already. Regarding ROCm, I do not have an AMD card, so I am unable to test this. What kind of speeds are you getting? I looked through your PR and it appears like it is still using CPU.

@Mateusz-Dera
Copy link
Copy Markdown
Author

@ekwek1
Yes, it uses a GPU. However, I decided to add LMDeploy support; in my case, the gains are on average around one third. Additionally, I implemented backend selection and the display of execution time and the device being used.

lmdeploy transformers

@Mateusz-Dera
Copy link
Copy Markdown
Author

Mateusz-Dera commented Jan 10, 2026

For comparison, using my interface I tested the CPU-only version: #15

Standalone Ryzen 9950X3D I obtain execution times that are about 2× higher compared to ROCm with plain Transformers, and almost 3× higher than with LMDeploy.

cpu

@TommyBark
Copy link
Copy Markdown

I tried to make this work on my Strix Halo gfx1151 but it doesn't seem to work out of box. I can provide details later.

I am also confused that you mention that you are using ROCm 7.1.1 but the pytorch-rocm wheels url points to version 6.4

@Mateusz-Dera
Copy link
Copy Markdown
Author

@TommyBark

I tried to make this work on my Strix Halo gfx1151 but it doesn't seem to work out of box. I can provide details later.

I am also confused that you mention that you are using ROCm 7.1.1 but the pytorch-rocm wheels url points to version 6.4

What I meant was that ROCm 7.1.1 is installed as a complete ROCm package (/opt/rocm-7.1.1).
The Python package itself is for version 6.4, because version 7.x is not yet marked as stable.

From what I can see, gfx1151 is supported only starting with 7.1.1, and the problem will be PyTorch 6.4 — most likely it will be necessary to switch to a nightly build.
rocm

It is also possible that some environment variable export is missing, other than HSA_OVERRIDE_GFX_VERSION=x.x.x (I do not know which version applies to gfx1151) and TORCH_COMPILE_DISABLE=1.

In general, I am testing everything in a Podman container (https://github.com/Mateusz-Dera/ROCm-AI-Installer).
There are a few more ROCm-related variables defined there; possibly one of them affects how the whole installation works, but I doubt it.

"I’ll try adding the rocm-nightly version of PyTorch, though I see the website lists it as 7.1 rather than 7.1.1, so I'm not sure if it’ll actually help."

pytorch

@Mateusz-Dera
Copy link
Copy Markdown
Author

Mateusz-Dera commented Jan 15, 2026

@TommyBark
Okay, I've added rocm-nightly with PyTorch 7.1. Could you check if it helped?

pip install -e .[rocm-nightly]

@Mateusz-Dera
Copy link
Copy Markdown
Author

@ekwek1

I updated everything to the current version of the repository.
While doing so, I noticed that the previous version with LMDeploy enabled was generating a lot of noise at the end of sentences and often truncating the beginning; I’ve fixed that.
I also updated the installation instructions. They are now more consistent with the rest of the documentation and added two PyTorch variants for ROCm: 6.4 stable and 7.1 nightly.
Additionally, I fixed the webui so that when running under ROCm it no longer throws an error when LMDeploy is enabled.

@TommyBark
I think you can try this version with PyTorch 7.1; if it doesn’t work, I’ll continue looking for a solution.

The card is correctly detected as a CUDA device (this is normal with ROCm). Below I have attached screenshots comparing: 7900 XTX with the fixed LMDeploy, 7900 XTX with Transformers, and a Ryzen 9950X3D CPU with Transformers.

cpu transformers lmdeploy

Everything was tested in a Podman container running Ubuntu 24.04 with ROCm 7.1.1 installed and an AMD Radeon 7900 XTX passed through. On this setup, both PyTorch versions (6.4 and 7.1) were tested with Python 3.12.3 (a clean Python installation using a standard venv, without uv).

@TommyBark
Copy link
Copy Markdown

Thank you! Unfortunately there is another unrelated bug ROCm/TheRock#2850 and I don't want to downgrade just yet so I can't test this just now.

@Koko2110
Copy link
Copy Markdown

In my testing on a 7900 XTX and a RX 6800, I saw good performance over the web UI, but issues with the CLI and also with streaming, especially with lmdeploy. I am in no way qualified to go into detail as to what is going wrong, since I had Claude fix it for me, but it was something about torch compile and torch dynamo using a wrong instruction set on ROCm unless you use one of the provided lmdeploy docker images. Disabling them with environment variables seemed to work.

Then for soprano itself there was an issue with the hidden_state tensors. I think it boiled down to hidden_state often being 'None' on ROCm, but that not being considered in the soprano pipeline.

Funnily, I got better performance in streaming (>7x realtime) with an Nvidia GTX 1660 than with my AMD RX 6800. I also had slightly better performance with transformers on the 6800. Though, this shouldn't be taken as fact, as my machine with both those GPUs is most definitely CPU bound.

These tests were on python 3.10, 3.11 and on pytorch 2.5.1 and 2.9 on both ROCm 6.2 and 6.4. Versions are unusual since I was working on making a pipeline from soprano to RVC.

Apologies for lacking the competence to explain precisely what issues I encountered, but I thought sharing my experience with streaming from an AMD card would help either way, as it doesn't seem you guys have discussed that yet.

I could submit the edits to the code Claude came up with to get it working if you're interested. Not going ahead with that on my own, though, as I am apprehensive about submitting code I don't fully understand myself.

@TommyBark
Copy link
Copy Markdown

TommyBark commented Jan 16, 2026

I actually made ROCm work despite the bugs, by using specific working nightly of torch ROCm 7.1.1, but so far only using transformers backend not lmdeploy yet.

But still great, I got about 20x speedup compared to CPU on longer texts. So I would say that this PR works well but is quite sensitive to specificities of the hardware which is on AMD.

@Mateusz-Dera
Copy link
Copy Markdown
Author

@Koko2110

Thanks for the feedback!
The poor performance of the RX 6800 on LMDeploy is probably due to the fact that it does not support bfloat16, so everything is likely being converted to float16, although I may be mistaken.

In general, the LMDeploy implementation is quite problematic, and I am not sure whether it will work on GPUs older than RDNA 3.
Overall, I am wondering whether there is any library that could be used instead of LMDeploy specifically for AMD GPUs.

If you can share the code, I would be happy to analyze it.
By the way, could you tell me what average generation times you are seeing on the different hardware configurations, with and without LMDeploy?
If possible, how does the RX 6800 running the transformer compare to CPU-only performance?

@Mateusz-Dera
Copy link
Copy Markdown
Author

@TommyBark
I’m glad you managed to get it running!
Could you tell me which version of PyTorch you are using (and possibly provide a link to that version)? Also, did you compile it from source?

@Koko2110
Copy link
Copy Markdown

@Mateusz-Dera , thanks for the response.

You're right, I definitely saw a bunch of warnings about bfloat16 and did come to the conclusion it was some kind of bottleneck.

I will try to review the code this evening and send over what ended up working.

I will do some tests too, but I'm not certain how conclusive they'd be. The RX 6800 is in a PC with an AMD FX processor, you see. I could run some tests on my desktop with a Ryzen 5600X, though, as well as with the 7900XTX.

Quite sobering to see a GTX 1660 beat a GPU that is 3 times its size, that's for sure.

@Mateusz-Dera
Copy link
Copy Markdown
Author

@Koko2110

Warnings are popping up even on the 7900 XTX, which supports bfloat16.
I had to implement a bit of a workaround because it was generating a lot of audio distortion otherwise (I used Claude Code to help with that part, so there might still be some bugs), but overall it solved my audio issues on LMDeploy.
Does it show '✓ ROCm Hook Registered successfully (bfloat16 capable)' for you after the warnings?"

bfloat16

I think Vulkan(mentioned here #46) might be a good solution for AMD cards (and others).
I’ve tested it a few times, and it often yields good results when native ROCm isn’t working properly.
I think I can try adding an alternative Vulkan-based solution next week, but I can’t promise it’ll work.
I’ve never ported anything strictly for Vulkan before, only for ROCm.

@TommyBark
Copy link
Copy Markdown

@TommyBark I’m glad you managed to get it running! Could you tell me which version of PyTorch you are using (and possibly provide a link to that version)? Also, did you compile it from source?

No I didn't build from source. I basically looked at newest workflow runs in therock repo and took the first one I saw passed for gfx1151 and pytorch 2.9.1, which is 2.9.1+devrocm7.11.0.dev0.5ea8103cfef6709bd676f728088c51ea6b9545a3 from the TheRock staging index https://rocm.devreleases.amd.com/v2-staging/gfx1151/

@davidsdearaujo
Copy link
Copy Markdown

@Mateusz-Dera

I think Vulkan(mentioned here #46) might be a good solution for AMD cards (and others).

It would allow to run it on Windows in my strix halo I guess 🤔

@Mateusz-Dera
Copy link
Copy Markdown
Author

Mateusz-Dera commented Jan 23, 2026

One of my experiments looks promising.
It is an implementation of a hybrid vLLM + Transformers backend, where in some cases the generation time drops to even below 0.2, which was impossible even with LMDeploy.
For now, this is a heavily developmental version, and the installation itself is perhaps not complicated, but it does require several additional steps. The good news is that the whole setup works on PyTorch ROCm 7.2 (https://repo.radeon.com/rocm/manylinux/rocm-rel-7.2/).
According to the vLLM website, it should work on the following cards:

  • MI200s (gfx90a)
  • MI300 (gfx942)
  • MI350 (gfx950)
  • Radeon RX 7900 series (gfx1100/1101)
  • Radeon RX 9000 series (gfx1200/1201)
  • Ryzen AI MAX / AI 300 Series (gfx1151/1150)

First, however, I want to review the entire codebase and apply optimizations (a significant part of it was written by Claude Code, and I want to be sure there are no errors), as well as simplify the installation process. If that works out, I would also like to make it run entirely on vLLM, without Transformers support, for even better optimization.

vllm

Edit:
A simple fix allowed me to save even more time. In the end, one can say that this is a bit closer to the performance of NVIDIA cards. I will update the instructions and add a test version.

vllm2

@Mateusz-Dera
Copy link
Copy Markdown
Author

@TommyBark The vLLM backend should run on gfx1151. Please check if you can.
@Koko2110 I suspect it won't be compatible with 6000 series GPUs, but feel free to verify.

vLLM requires Python 3.13 to run properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants