[Feat] Add support for RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES (Fix AMD support) #1465

HollowMan6 · 2025-05-09T12:49:41Z

Checklist Before Starting

Search for similar PR(s).

What does this PR do?

Add support for RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES, also Fix AMD support

High-Level Design

Current approach for supporting AMD in verl is fundamentally not correct, and is just working out of the luck:

Calls such as torch.cuda.is_available() or torch.cuda.get_device_name() will initialize the CUDA/ROCm environment:
https://github.com/pytorch/pytorch/blob/c65ee728f069ea9544bdcac815eb0825f45d1633/torch/cuda/__init__.py#L342-L392

Setting CUDA/HIP/ROCR_VISIBLE_DEVICES after CUDA/ROCm is initialized will not take effect (Please check pytorch/pytorch#141678), which means that all current code that wrapped inside [SUPPORT AMD: torch] are mostly noops.

CUDA_VISIBLE_DEVICES also works for AMD, but it's because that a lot of AMD migrated software call those torch.cuda.* during importing, e.g.:

While ray/vllm manipulates those *_VISIBLE_DEVICES during runtime, which cause those torch.cuda.* to poison the current process if the CUDA/ROCm environment is initialized before the manipulation happens.

So, here, it would be a good solution to use only one environment variable for all (CUDA_VISIBLE_DEVICES) for consistency and hardware-agnostic, move all the other *_VISIBLE_DEVICES to the CUDA one. Note that we must pay attention if both HIP/CUDA and ROCR env vars are set as they have different meanings. Both env vars accept either a list of ints or a list of UUIDs. The ROCR env var is processed first which then reduces the number of GPUs that HIP can select from. (Refering to pytorch/pytorch#144026) To avoid the complexity of this, we simply gives out error if both are set (Also to keep consistency with ray's practice with 2.45.0).

For the poisoning issue, before those 2 PRs are merged, we will need to ask the users to set RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES or RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES, so that ray no longer manipulates these variables, and make verl workable when there is no *_VISIBLE_DEVICES.

Note that for latest ray (after their switch to HIP_VISIBLE_DEVICES), we also need this patch: ray-project/ray#52794

Test

Tested manually on both megatron and fsdp beckend with vllm.

Additional Info.

Issue Number: none
Training: both FSDP and Megatron
Inference: both vLLM and SGLang

Checklist Before Submitting

Read the Contribute Guide.
Apply pre-commit checks.
Add [BREAKING] to the PR title if it breaks any API.
Update the documentation about your changes in the docs.
Add CI test(s) if neccessary.

HollowMan6 · 2025-05-09T12:51:24Z

cc @yushengsu-thu and @eric-haibin-lin for awareness, as I saw another ongoing PR #1453

eric-haibin-lin

Thanks. Is there a unit test that can easily reproduce the issue?

HollowMan6 · 2025-05-09T15:33:28Z

Thanks. Is there a unit test that can easily reproduce the issue?

For the AMD part, since those poisonings are AMD-specific, we would need AMD devices in CI/CD to easily reproduce the issue. I saw there are some ongoing discussions in #1453 about CI setup for AMD devices for verl. A quick demo of why the current approach is wrong can be found in pytorch/pytorch#141678

import os
import torch

os.environ["CUDA_VISIBLE_DEVICES"] = ""
print(f"torch.cuda.is_available(): {torch.cuda.is_available()}")

os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"
print(f"torch.cuda.is_available(): {torch.cuda.is_available()}")
print(f"torch.cuda.device_count(): {torch.cuda.set_device('cuda:0')}")

For the RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES part, for NVIDIA ones, on top of existing test cases, we can of course have another full test with RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1 set, but I'm not sure if this is necessary and will need your advice.

yushengsu-thu · 2025-05-09T22:16:02Z

@HollowMan6, thank you, and I am also aware of this issue. My colleague - vicky, also sent this patch to Ray. If you will, could we arrange an online quick call (This is my mail: yushengsu.thu@gmail.com) to review this PR.
It'd be helpful for us if you could provide us with more suggestions for maintenance and modification.

vickytsang · 2025-05-12T14:56:48Z

verl/single_controller/base/worker.py

+        is_ray_noset_visible_devices = ray_noset_visible_devices()
+
+        # Prevent use of clashing `{CUDA/HIP/ROCR}_VISIBLE_DEVICES``
+        rocr_val = os.environ.get("ROCR_VISIBLE_DEVICES", None)


Ray >=2.45 will throw an error if ROCR_VISIBLE_DEVICES is used. In general, it can cause unexpected behavior in pytorch distributed.

This is just for compatibility for ray < 2.45, and I don't think it will cause issues for Ray >=2.45, as ray will do that check if ROCR_VISIBLE_DEVICES is used from the beginning, and we just take its values to CUDA_VISIBLE_DEVICES and unset it for Ray < 2.45.

verl/single_controller/base/worker.py

vickytsang · 2025-05-12T22:04:34Z

@HollowMan6 can you share any passing example or tests you have done for these changes?

HollowMan6 · 2025-05-13T11:14:21Z

@HollowMan6 can you share any passing example or tests you have done for these changes?

I just did the end-to-end training with my own use cases on AMD MI250x clusters, with ray 2.45 and a patch similar to yours ray-project/ray#52794, my ROCm version is 6.2.4, and my Pytorch version is torch-2.8.0.dev20250417+rocm6.2.4, all the rest are with the latest version. I tested using both FSDP and Megatron separately, with vLLM, which are similar to these examples:

And I tried both settings with or without RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES, and applied patches here, all of them work fine:

Without the above 2 patches, we had errors when we didn't set RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES.

I don't have more resources to test this PR more comprehensively, but if you see any errors, please let me know, and I can help check and figure it out.

yushengsu-thu · 2025-05-13T18:25:17Z

@HollowMan6 can you share any passing example or tests you have done for these changes?

I just did the end-to-end training with my own use cases on AMD MI250x clusters, with ray 2.45 and a patch similar to yours ray-project/ray#52794, my ROCm version is 6.2.4, and my Pytorch version is torch-2.8.0.dev20250417+rocm6.2.4, all the rest are with the latest version. I tested using both FSDP and Megatron separately, with vLLM, which are similar to these examples:

https://github.com/volcengine/verl/blob/main/examples/ppo_trainer/run_qwen2-7b_math_gsm8k_megatron.sh

https://github.com/volcengine/verl/blob/main/examples/ppo_trainer/run_qwen2.5-32b.sh

And I tried both settings with or without RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES, and applied patches here, all of them work fine:

Avoid poisoning process with CUDA calls as soon as importing ROCm/TransformerEngine#183

[ROCm] Get rid of RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES vllm-project/vllm#15246

Without the above 2 patches, we had errors when we didn't set RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES.

I don't have more resources to test this PR more comprehensively, but if you see any errors, please let me know, and I can help check and figure it out.

@HollowMan6 thanks,
Vicky and I will test this on our side.

yushengsu-thu · 2025-05-13T23:05:40Z

Update @HollowMan6
Regarding PR1, I've contacted AMD megatron-lm's owner and waited for their merging now link

wenchenvincent · 2025-05-14T15:26:12Z

Hi @HollowMan6 I would like to understand the motivation of setting HIP_VISIBLE_DEVICES after ROCm is initialized. Usually these variables are set before the runs start. Any particular reason to manipulate them during the run?

HollowMan6 · 2025-05-14T16:38:38Z

@wenchenvincent It's mainly because the vLLM needs to manipulate CUDA_VISIBLE_DEVICES after the process starts (so that they have more flexibility), but the manipulation will only be successful before CUDA/ROCm is initialized.

Similar problems are already well-known in the vLLM community side, you can check vllm-project/vllm#6056 for reference.

…AMD support) Current approach for supporting AMD in verl is fundamentally not correct, and is just working out of the luck: Calls such as `torch.cuda.is_available()` or `torch.cuda.get_device_name()` will initialize the CUDA/ROCm environment: https://github.com/pytorch/pytorch/blob/c65ee728f069ea9544bdcac815eb0825f45d1633/torch/cuda/__init__.py#L342-L392 Setting CUDA/HIP/ROCR_VISIBLE_DEVICES after CUDA/ROCm is initialized will not take effect (Please check pytorch/pytorch#141678), which means that all current code that wrapped inside `[SUPPORT AMD: torch]` are mostly noops. CUDA_VISIBLE_DEVICES also works for AMD, but it's because that a lot of AMD migrated software call those `torch.cuda.*` during importing, e.g.: - ROCm/TransformerEngine#183 - vllm-project/vllm#15246 While ray/vllm manipulates those *_VISIBLE_DEVICES during runtime, which cause those `torch.cuda.*` to poison the current process if the CUDA/ROCm environment is initialized before the manipulation happens. So, here, it would be a good solution to use only one environment variable for all (`CUDA_VISIBLE_DEVICES`) for consistency and hardware-agnostic, move all the other `*_VISIBLE_DEVICES` to the CUDA one. Note that we must pay attention if both HIP/CUDA and ROCR env vars are set as they have different meanings. Both env vars accept either a list of ints or a list of UUIDs. The ROCR env var is processed first which then reduces the number of GPUs that HIP can select from. (Refering to pytorch/pytorch#144026) To avoid the complexity of this, we simply gives out error if both are set (Also to keep consistency with ray's practice with 2.45.0). For the poisoning issue, before those 2 PRs are merged, we will need to ask the users to set `RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES` or `RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES`, so that ray no longer manipulates these variables, and make verl workable when there is no `*_VISIBLE_DEVICES`. Tested manually on both megatron and fsdp beckend with vllm. Signed-off-by: Hollow Man <hollowman@opensuse.org>

vermouth1992 requested a review from yushengsu-thu May 9, 2025 12:58

eric-haibin-lin reviewed May 9, 2025

View reviewed changes

HollowMan6 force-pushed the amd branch 2 times, most recently from df26023 to 1a2791b Compare May 9, 2025 17:17

yushengsu-thu self-assigned this May 9, 2025

HollowMan6 force-pushed the amd branch from 1a2791b to 28f71a8 Compare May 10, 2025 08:42

vickytsang reviewed May 12, 2025

View reviewed changes

verl/single_controller/base/worker.py Show resolved Hide resolved

yushengsu-thu added the AMD label May 19, 2025

HollowMan6 force-pushed the amd branch from 28f71a8 to ef0e394 Compare May 23, 2025 15:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feat] Add support for RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES (Fix AMD support) #1465

[Feat] Add support for RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES (Fix AMD support) #1465

Uh oh!

HollowMan6 commented May 9, 2025 •

edited

Loading

Uh oh!

HollowMan6 commented May 9, 2025

Uh oh!

eric-haibin-lin left a comment

Uh oh!

HollowMan6 commented May 9, 2025 •

edited

Loading

Uh oh!

yushengsu-thu commented May 9, 2025 •

edited

Loading

Uh oh!

vickytsang May 12, 2025

Uh oh!

HollowMan6 May 12, 2025 •

edited

Loading

Uh oh!

Uh oh!

vickytsang commented May 12, 2025

Uh oh!

HollowMan6 commented May 13, 2025

Uh oh!

yushengsu-thu commented May 13, 2025

Uh oh!

yushengsu-thu commented May 13, 2025 •

edited

Loading

Uh oh!

wenchenvincent commented May 14, 2025

Uh oh!

HollowMan6 commented May 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

[Feat] Add support for RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES (Fix AMD support) #1465

Are you sure you want to change the base?

[Feat] Add support for RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES (Fix AMD support) #1465

Uh oh!

Conversation

HollowMan6 commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist Before Starting

What does this PR do?

High-Level Design

Test

Additional Info.

Checklist Before Submitting

Uh oh!

HollowMan6 commented May 9, 2025

Uh oh!

eric-haibin-lin left a comment

Choose a reason for hiding this comment

Uh oh!

HollowMan6 commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yushengsu-thu commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vickytsang May 12, 2025

Choose a reason for hiding this comment

Uh oh!

HollowMan6 May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vickytsang commented May 12, 2025

Uh oh!

HollowMan6 commented May 13, 2025

Uh oh!

yushengsu-thu commented May 13, 2025

Uh oh!

yushengsu-thu commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wenchenvincent commented May 14, 2025

Uh oh!

HollowMan6 commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

HollowMan6 commented May 9, 2025 •

edited

Loading

HollowMan6 commented May 9, 2025 •

edited

Loading

yushengsu-thu commented May 9, 2025 •

edited

Loading

HollowMan6 May 12, 2025 •

edited

Loading

yushengsu-thu commented May 13, 2025 •

edited

Loading

HollowMan6 commented May 14, 2025 •

edited

Loading