Support for ROCm for AMD cards.#29
Conversation
|
Hi @Mateusz-Dera, thank you for the PR! I combined your gradio interface with #10 and pushed it onto the repo already. Regarding ROCm, I do not have an AMD card, so I am unable to test this. What kind of speeds are you getting? I looked through your PR and it appears like it is still using CPU. |
|
@ekwek1
|
|
For comparison, using my interface I tested the CPU-only version: #15 Standalone Ryzen 9950X3D I obtain execution times that are about 2× higher compared to ROCm with plain Transformers, and almost 3× higher than with LMDeploy.
|
|
I tried to make this work on my Strix Halo gfx1151 but it doesn't seem to work out of box. I can provide details later. I am also confused that you mention that you are using ROCm 7.1.1 but the |
What I meant was that ROCm 7.1.1 is installed as a complete ROCm package (/opt/rocm-7.1.1). From what I can see, gfx1151 is supported only starting with 7.1.1, and the problem will be PyTorch 6.4 — most likely it will be necessary to switch to a nightly build. It is also possible that some environment variable export is missing, other than HSA_OVERRIDE_GFX_VERSION=x.x.x (I do not know which version applies to gfx1151) and TORCH_COMPILE_DISABLE=1. In general, I am testing everything in a Podman container (https://github.com/Mateusz-Dera/ROCm-AI-Installer). "I’ll try adding the rocm-nightly version of PyTorch, though I see the website lists it as 7.1 rather than 7.1.1, so I'm not sure if it’ll actually help."
|
|
@TommyBark pip install -e .[rocm-nightly] |
|
I updated everything to the current version of the repository. @TommyBark The card is correctly detected as a CUDA device (this is normal with ROCm). Below I have attached screenshots comparing: 7900 XTX with the fixed LMDeploy, 7900 XTX with Transformers, and a Ryzen 9950X3D CPU with Transformers.
Everything was tested in a Podman container running Ubuntu 24.04 with ROCm 7.1.1 installed and an AMD Radeon 7900 XTX passed through. On this setup, both PyTorch versions (6.4 and 7.1) were tested with Python 3.12.3 (a clean Python installation using a standard venv, without uv). |
Cleaning
|
Thank you! Unfortunately there is another unrelated bug ROCm/TheRock#2850 and I don't want to downgrade just yet so I can't test this just now. |
|
In my testing on a 7900 XTX and a RX 6800, I saw good performance over the web UI, but issues with the CLI and also with streaming, especially with lmdeploy. I am in no way qualified to go into detail as to what is going wrong, since I had Claude fix it for me, but it was something about torch compile and torch dynamo using a wrong instruction set on ROCm unless you use one of the provided lmdeploy docker images. Disabling them with environment variables seemed to work. Then for soprano itself there was an issue with the hidden_state tensors. I think it boiled down to hidden_state often being 'None' on ROCm, but that not being considered in the soprano pipeline. Funnily, I got better performance in streaming (>7x realtime) with an Nvidia GTX 1660 than with my AMD RX 6800. I also had slightly better performance with transformers on the 6800. Though, this shouldn't be taken as fact, as my machine with both those GPUs is most definitely CPU bound. These tests were on python 3.10, 3.11 and on pytorch 2.5.1 and 2.9 on both ROCm 6.2 and 6.4. Versions are unusual since I was working on making a pipeline from soprano to RVC. Apologies for lacking the competence to explain precisely what issues I encountered, but I thought sharing my experience with streaming from an AMD card would help either way, as it doesn't seem you guys have discussed that yet. I could submit the edits to the code Claude came up with to get it working if you're interested. Not going ahead with that on my own, though, as I am apprehensive about submitting code I don't fully understand myself. |
|
I actually made ROCm work despite the bugs, by using specific working nightly of torch ROCm 7.1.1, but so far only using But still great, I got about 20x speedup compared to CPU on longer texts. So I would say that this PR works well but is quite sensitive to specificities of the hardware which is on AMD. |
|
Thanks for the feedback! In general, the LMDeploy implementation is quite problematic, and I am not sure whether it will work on GPUs older than RDNA 3. If you can share the code, I would be happy to analyze it. |
|
@TommyBark |
|
@Mateusz-Dera , thanks for the response. You're right, I definitely saw a bunch of warnings about bfloat16 and did come to the conclusion it was some kind of bottleneck. I will try to review the code this evening and send over what ended up working. I will do some tests too, but I'm not certain how conclusive they'd be. The RX 6800 is in a PC with an AMD FX processor, you see. I could run some tests on my desktop with a Ryzen 5600X, though, as well as with the 7900XTX. Quite sobering to see a GTX 1660 beat a GPU that is 3 times its size, that's for sure. |
|
Warnings are popping up even on the 7900 XTX, which supports bfloat16.
I think Vulkan(mentioned here #46) might be a good solution for AMD cards (and others). |
No I didn't build from source. I basically looked at newest workflow runs in therock repo and took the first one I saw passed for gfx1151 and pytorch 2.9.1, which is |
It would allow to run it on Windows in my strix halo I guess 🤔 |
|
One of my experiments looks promising.
First, however, I want to review the entire codebase and apply optimizations (a significant part of it was written by Claude Code, and I want to be sure there are no errors), as well as simplify the installation process. If that works out, I would also like to make it run entirely on vLLM, without Transformers support, for even better optimization.
Edit:
|
|
@TommyBark The vLLM backend should run on gfx1151. Please check if you can. vLLM requires Python 3.13 to run properly. |











I added support for AMD cards.
While doing so, I also tested the installation using uv.
I additionally added a simple interface written in Gradio.
Everything was tested on Ubuntu 24.04 Podman container with ROCm 7.1.1 on an AMD Radeon 7900 XTX.