Support for ROCm for AMD cards. by Mateusz-Dera · Pull Request #29 · ekwek1/soprano

Mateusz-Dera · 2026-01-05T06:20:27Z

I added support for AMD cards.
While doing so, I also tested the installation using uv.
I additionally added a simple interface written in Gradio.
Everything was tested on Ubuntu 24.04 Podman container with ROCm 7.1.1 on an AMD Radeon 7900 XTX.

ekwek1 · 2026-01-10T09:14:18Z

Hi @Mateusz-Dera, thank you for the PR! I combined your gradio interface with #10 and pushed it onto the repo already. Regarding ROCm, I do not have an AMD card, so I am unable to test this. What kind of speeds are you getting? I looked through your PR and it appears like it is still using CPU.

Mateusz-Dera · 2026-01-10T11:24:29Z

@ekwek1
Yes, it uses a GPU. However, I decided to add LMDeploy support; in my case, the gains are on average around one third. Additionally, I implemented backend selection and the display of execution time and the device being used.

Mateusz-Dera · 2026-01-10T12:33:35Z

For comparison, using my interface I tested the CPU-only version: #15

Standalone Ryzen 9950X3D I obtain execution times that are about 2× higher compared to ROCm with plain Transformers, and almost 3× higher than with LMDeploy.

TommyBark · 2026-01-15T12:37:59Z

I tried to make this work on my Strix Halo gfx1151 but it doesn't seem to work out of box. I can provide details later.

I am also confused that you mention that you are using ROCm 7.1.1 but the pytorch-rocm wheels url points to version 6.4

Mateusz-Dera · 2026-01-15T13:17:28Z

@TommyBark

I tried to make this work on my Strix Halo gfx1151 but it doesn't seem to work out of box. I can provide details later.

I am also confused that you mention that you are using ROCm 7.1.1 but the pytorch-rocm wheels url points to version 6.4

What I meant was that ROCm 7.1.1 is installed as a complete ROCm package (/opt/rocm-7.1.1).
The Python package itself is for version 6.4, because version 7.x is not yet marked as stable.

From what I can see, gfx1151 is supported only starting with 7.1.1, and the problem will be PyTorch 6.4 — most likely it will be necessary to switch to a nightly build.

It is also possible that some environment variable export is missing, other than HSA_OVERRIDE_GFX_VERSION=x.x.x (I do not know which version applies to gfx1151) and TORCH_COMPILE_DISABLE=1.

In general, I am testing everything in a Podman container (https://github.com/Mateusz-Dera/ROCm-AI-Installer).
There are a few more ROCm-related variables defined there; possibly one of them affects how the whole installation works, but I doubt it.

"I’ll try adding the rocm-nightly version of PyTorch, though I see the website lists it as 7.1 rather than 7.1.1, so I'm not sure if it’ll actually help."

Mateusz-Dera · 2026-01-15T13:42:35Z

@TommyBark
Okay, I've added rocm-nightly with PyTorch 7.1. Could you check if it helped?

pip install -e .[rocm-nightly]

Mateusz-Dera · 2026-01-15T23:12:00Z

@ekwek1

I updated everything to the current version of the repository.
While doing so, I noticed that the previous version with LMDeploy enabled was generating a lot of noise at the end of sentences and often truncating the beginning; I’ve fixed that.
I also updated the installation instructions. They are now more consistent with the rest of the documentation and added two PyTorch variants for ROCm: 6.4 stable and 7.1 nightly.
Additionally, I fixed the webui so that when running under ROCm it no longer throws an error when LMDeploy is enabled.

@TommyBark
I think you can try this version with PyTorch 7.1; if it doesn’t work, I’ll continue looking for a solution.

The card is correctly detected as a CUDA device (this is normal with ROCm). Below I have attached screenshots comparing: 7900 XTX with the fixed LMDeploy, 7900 XTX with Transformers, and a Ryzen 9950X3D CPU with Transformers.

Everything was tested in a Podman container running Ubuntu 24.04 with ROCm 7.1.1 installed and an AMD Radeon 7900 XTX passed through. On this setup, both PyTorch versions (6.4 and 7.1) were tested with Python 3.12.3 (a clean Python installation using a standard venv, without uv).

Cleaning

TommyBark · 2026-01-16T02:09:51Z

Thank you! Unfortunately there is another unrelated bug ROCm/TheRock#2850 and I don't want to downgrade just yet so I can't test this just now.

Koko2110 · 2026-01-16T12:26:15Z

In my testing on a 7900 XTX and a RX 6800, I saw good performance over the web UI, but issues with the CLI and also with streaming, especially with lmdeploy. I am in no way qualified to go into detail as to what is going wrong, since I had Claude fix it for me, but it was something about torch compile and torch dynamo using a wrong instruction set on ROCm unless you use one of the provided lmdeploy docker images. Disabling them with environment variables seemed to work.

Then for soprano itself there was an issue with the hidden_state tensors. I think it boiled down to hidden_state often being 'None' on ROCm, but that not being considered in the soprano pipeline.

Funnily, I got better performance in streaming (>7x realtime) with an Nvidia GTX 1660 than with my AMD RX 6800. I also had slightly better performance with transformers on the 6800. Though, this shouldn't be taken as fact, as my machine with both those GPUs is most definitely CPU bound.

These tests were on python 3.10, 3.11 and on pytorch 2.5.1 and 2.9 on both ROCm 6.2 and 6.4. Versions are unusual since I was working on making a pipeline from soprano to RVC.

Apologies for lacking the competence to explain precisely what issues I encountered, but I thought sharing my experience with streaming from an AMD card would help either way, as it doesn't seem you guys have discussed that yet.

I could submit the edits to the code Claude came up with to get it working if you're interested. Not going ahead with that on my own, though, as I am apprehensive about submitting code I don't fully understand myself.

TommyBark · 2026-01-16T12:45:06Z

I actually made ROCm work despite the bugs, by using specific working nightly of torch ROCm 7.1.1, but so far only using transformers backend not lmdeploy yet.

But still great, I got about 20x speedup compared to CPU on longer texts. So I would say that this PR works well but is quite sensitive to specificities of the hardware which is on AMD.

Mateusz-Dera · 2026-01-16T12:46:03Z

@Koko2110

Thanks for the feedback!
The poor performance of the RX 6800 on LMDeploy is probably due to the fact that it does not support bfloat16, so everything is likely being converted to float16, although I may be mistaken.

In general, the LMDeploy implementation is quite problematic, and I am not sure whether it will work on GPUs older than RDNA 3.
Overall, I am wondering whether there is any library that could be used instead of LMDeploy specifically for AMD GPUs.

If you can share the code, I would be happy to analyze it.
By the way, could you tell me what average generation times you are seeing on the different hardware configurations, with and without LMDeploy?
If possible, how does the RX 6800 running the transformer compare to CPU-only performance?

Mateusz-Dera · 2026-01-16T12:53:10Z

@TommyBark
I’m glad you managed to get it running!
Could you tell me which version of PyTorch you are using (and possibly provide a link to that version)? Also, did you compile it from source?

Koko2110 · 2026-01-16T13:03:52Z

@Mateusz-Dera , thanks for the response.

You're right, I definitely saw a bunch of warnings about bfloat16 and did come to the conclusion it was some kind of bottleneck.

I will try to review the code this evening and send over what ended up working.

I will do some tests too, but I'm not certain how conclusive they'd be. The RX 6800 is in a PC with an AMD FX processor, you see. I could run some tests on my desktop with a Ryzen 5600X, though, as well as with the 7900XTX.

Quite sobering to see a GTX 1660 beat a GPU that is 3 times its size, that's for sure.

Mateusz-Dera · 2026-01-16T13:18:31Z

@Koko2110

Warnings are popping up even on the 7900 XTX, which supports bfloat16.
I had to implement a bit of a workaround because it was generating a lot of audio distortion otherwise (I used Claude Code to help with that part, so there might still be some bugs), but overall it solved my audio issues on LMDeploy.
Does it show '✓ ROCm Hook Registered successfully (bfloat16 capable)' for you after the warnings?"

I think Vulkan(mentioned here #46) might be a good solution for AMD cards (and others).
I’ve tested it a few times, and it often yields good results when native ROCm isn’t working properly.
I think I can try adding an alternative Vulkan-based solution next week, but I can’t promise it’ll work.
I’ve never ported anything strictly for Vulkan before, only for ROCm.

TommyBark · 2026-01-16T15:16:28Z

@TommyBark I’m glad you managed to get it running! Could you tell me which version of PyTorch you are using (and possibly provide a link to that version)? Also, did you compile it from source?

No I didn't build from source. I basically looked at newest workflow runs in therock repo and took the first one I saw passed for gfx1151 and pytorch 2.9.1, which is 2.9.1+devrocm7.11.0.dev0.5ea8103cfef6709bd676f728088c51ea6b9545a3 from the TheRock staging index https://rocm.devreleases.amd.com/v2-staging/gfx1151/

davidsdearaujo · 2026-01-23T05:11:16Z

@Mateusz-Dera

I think Vulkan(mentioned here #46) might be a good solution for AMD cards (and others).

It would allow to run it on Windows in my strix halo I guess 🤔

Mateusz-Dera · 2026-01-23T21:35:35Z

One of my experiments looks promising.
It is an implementation of a hybrid vLLM + Transformers backend, where in some cases the generation time drops to even below 0.2, which was impossible even with LMDeploy.
For now, this is a heavily developmental version, and the installation itself is perhaps not complicated, but it does require several additional steps. The good news is that the whole setup works on PyTorch ROCm 7.2 (https://repo.radeon.com/rocm/manylinux/rocm-rel-7.2/).
According to the vLLM website, it should work on the following cards:

MI200s (gfx90a)
MI300 (gfx942)
MI350 (gfx950)
Radeon RX 7900 series (gfx1100/1101)
Radeon RX 9000 series (gfx1200/1201)
Ryzen AI MAX / AI 300 Series (gfx1151/1150)

First, however, I want to review the entire codebase and apply optimizations (a significant part of it was written by Claude Code, and I want to be sure there are no errors), as well as simplify the installation process. If that works out, I would also like to make it run entirely on vLLM, without Transformers support, for even better optimization.

Edit:
A simple fix allowed me to save even more time. In the end, one can say that this is a bit closer to the performance of NVIDIA cards. I will update the instructions and add a test version.

Mateusz-Dera · 2026-01-24T13:14:25Z

@TommyBark The vLLM backend should run on gfx1151. Please check if you can.
@Koko2110 I suspect it won't be compatible with 6000 series GPUs, but feel free to verify.

vLLM requires Python 3.13 to run properly.

Mateusz-Dera added 6 commits January 5, 2026 06:11

test

60c7700

gitignore

257a154

fix

fa13368

gradio

ebc43eb

gradio

5b6fa5d

fix

1200522

ekwek1 mentioned this pull request Jan 10, 2026

Add Gradio interface for Soprano TTS #10

Closed

lmdeploy

662aa82

Mateusz-Dera added 2 commits January 10, 2026 13:06

gpu

ef51b11

If device is CPU

1a0193e

PyTorch ROCm 7.1

05b86ba

Mateusz-Dera and others added 2 commits January 15, 2026 23:34

LMDeploy noises fix

d116ee6

Merge branch 'main' into main

d66fd29

Update lmdeploy.py

390aee1

Cleaning

Mateusz-Dera added 4 commits January 23, 2026 23:20

basic

3bb4762

README

aebfb69

vLLM

924c840

python 3.13

99ff48a

Initial run

e4b3dd6

Conversation

Mateusz-Dera commented Jan 5, 2026

Uh oh!

ekwek1 commented Jan 10, 2026

Uh oh!

Mateusz-Dera commented Jan 10, 2026

Uh oh!

Mateusz-Dera commented Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TommyBark commented Jan 15, 2026

Uh oh!

Mateusz-Dera commented Jan 15, 2026

Uh oh!

Mateusz-Dera commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mateusz-Dera commented Jan 15, 2026

Uh oh!

TommyBark commented Jan 16, 2026

Uh oh!

Koko2110 commented Jan 16, 2026

Uh oh!

TommyBark commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mateusz-Dera commented Jan 16, 2026

Uh oh!

Mateusz-Dera commented Jan 16, 2026

Uh oh!

Koko2110 commented Jan 16, 2026

Uh oh!

Mateusz-Dera commented Jan 16, 2026

Uh oh!

TommyBark commented Jan 16, 2026

Uh oh!

davidsdearaujo commented Jan 23, 2026

Uh oh!

Mateusz-Dera commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mateusz-Dera commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Mateusz-Dera commented Jan 10, 2026 •

edited

Loading

Mateusz-Dera commented Jan 15, 2026 •

edited

Loading

TommyBark commented Jan 16, 2026 •

edited

Loading

Mateusz-Dera commented Jan 23, 2026 •

edited

Loading