Skip to content

[AMD] Unify run-qwen3-4B.sh to support both AMD and NVIDIA GPUs#597

Open
lizamd wants to merge 3 commits intoradixark:mainfrom
lizamd:unify-qwen3-4b-amd-nvidia
Open

[AMD] Unify run-qwen3-4B.sh to support both AMD and NVIDIA GPUs#597
lizamd wants to merge 3 commits intoradixark:mainfrom
lizamd:unify-qwen3-4b-amd-nvidia

Conversation

@lizamd
Copy link
Contributor

@lizamd lizamd commented Feb 13, 2026

Auto-detect GPU vendor (/dev/kfd or torch.version.hip for AMD, nvidia-smi for NVIDIA) and conditionally apply platform-specific settings:

  • AMD: HIP_VISIBLE_DEVICES, RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES, --no-gradient-accumulation-fusion, --no-offload-train/rollout
  • NVIDIA: NVLink detection, NCCL_NVLS_ENABLE
  • Both: dynamic Megatron-LM path detection, configurable MODEL_DIR/DATA_DIR

This eliminates the need for a separate run-qwen3-4B-amd.sh script.

Auto-detect GPU vendor (/dev/kfd or torch.version.hip for AMD,
nvidia-smi for NVIDIA) and conditionally apply platform-specific
settings:
- AMD: HIP_VISIBLE_DEVICES, RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES,
  --no-gradient-accumulation-fusion, --no-offload-train/rollout
- NVIDIA: NVLink detection, NCCL_NVLS_ENABLE
- Both: dynamic Megatron-LM path detection, configurable MODEL_DIR/DATA_DIR

This eliminates the need for a separate run-qwen3-4B-amd.sh script.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @lizamd, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the run-qwen3-4B.sh script by introducing robust GPU vendor detection and applying tailored configurations for both AMD and NVIDIA environments. The changes streamline the workflow for users operating on different hardware, making the script more versatile and user-friendly. It also improves path management by making model and data directories configurable and dynamically locating the Megatron-LM installation.

Highlights

  • Unified GPU Support: The run-qwen3-4B.sh script now automatically detects the GPU vendor (AMD or NVIDIA) and applies platform-specific configurations, eliminating the need for separate scripts.
  • Platform-Specific Optimizations: For AMD GPUs, the script sets HIP_VISIBLE_DEVICES, RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES, and disables gradient accumulation fusion and offloading. For NVIDIA, it continues to detect NVLink and configure NCCL_NVLS_ENABLE.
  • Configurable Paths: Introduced MODEL_DIR and DATA_DIR environment variables, allowing users to easily configure the paths for model checkpoints and prompt data.
  • Dynamic Megatron-LM Path Detection: The script now dynamically detects the installation path of Megatron-LM, improving flexibility and reducing hardcoded dependencies.
Changelog
  • scripts/run-qwen3-4B.sh
    • Implemented GPU vendor detection logic (AMD via /dev/kfd or torch.version.hip, NVIDIA via nvidia-smi).
    • Introduced and utilized MODEL_DIR and DATA_DIR environment variables for configurable paths.
    • Added AMD-specific environment variables (RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES, HIP_VISIBLE_DEVICES) and training arguments (--no-gradient-accumulation-fusion, --no-offload-train, --no-offload-rollout).
    • Updated ray start and ray job submit commands to use dynamically determined GPU counts (NUM_GPUS).
    • Modified PYTHONPATH in the Ray runtime environment to dynamically detect the Megatron-LM installation path.
    • Updated checkpoint, rollout, and evaluation arguments to reference the new MODEL_DIR and DATA_DIR variables.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request does a great job of unifying the run script to support both AMD and NVIDIA GPUs, which improves maintainability by removing a duplicate script. The introduction of platform detection, configurable paths, and dynamic discovery of Megatron-LM is a solid improvement.

I have a couple of suggestions to make the script even more robust and readable:

  • The logic for determining NUM_GPUS for NVIDIA is currently hardcoded, unlike the dynamic approach for AMD. I've suggested a change to determine this dynamically, which would make the script more flexible across different NVIDIA hardware setups.
  • I've also suggested a minor formatting change to an if statement to improve readability.

Overall, these are excellent changes that make the script more generic and easier to use.

HAS_NVLINK=0
else
NVLINK_COUNT=$(nvidia-smi topo -m 2>/dev/null | grep -o 'NV[0-9][0-9]*' | wc -l)
if [ "$NVLINK_COUNT" -gt 0 ]; then HAS_NVLINK=1; else HAS_NVLINK=0; fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better readability and maintainability, it's recommended to expand this compact if-then-else statement into a multi-line block. This makes the logic clearer at a glance.

Suggested change
if [ "$NVLINK_COUNT" -gt 0 ]; then HAS_NVLINK=1; else HAS_NVLINK=0; fi
if [ "$NVLINK_COUNT" -gt 0 ]; then
HAS_NVLINK=1
else
HAS_NVLINK=0
fi

NVLINK_COUNT=$(nvidia-smi topo -m 2>/dev/null | grep -o 'NV[0-9][0-9]*' | wc -l)
if [ "$NVLINK_COUNT" -gt 0 ]; then HAS_NVLINK=1; else HAS_NVLINK=0; fi
echo "HAS_NVLINK: $HAS_NVLINK (detected $NVLINK_COUNT NVLink references)"
NUM_GPUS=8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Hardcoding NUM_GPUS=8 for NVIDIA is less flexible and inconsistent with the dynamic calculation for AMD GPUs. It's better to determine the number of GPUs dynamically for NVIDIA as well. This can be done by checking the CUDA_VISIBLE_DEVICES environment variable or using nvidia-smi. This makes the script more robust and adaptable to different environments.

Suggested change
NUM_GPUS=8
if [ -n "${CUDA_VISIBLE_DEVICES-}" ]; then
NUM_GPUS=$(echo "${CUDA_VISIBLE_DEVICES}" | tr ',' '\n' | wc -l)
else
# Fallback to nvidia-smi if CUDA_VISIBLE_DEVICES is not set, with a final fallback to 8.
NUM_GPUS=$(nvidia-smi --query-gpu=count --format=csv,noheader 2>/dev/null || echo 8)
fi

@yushengsu-thu yushengsu-thu self-assigned this Feb 13, 2026
@yushengsu-thu yushengsu-thu self-requested a review February 13, 2026 19:47
PLATFORM_TRAIN_ARGS=()
if [ "$GPU_VENDOR" = "amd" ]; then
# Apex not available on ROCm
MISC_ARGS+=(--no-gradient-accumulation-fusion)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need this: MISC_ARGS+=(--no-gradient-accumulation-fusion)
The megatron and related stuff within the AMD docker img is already supported.

cc. @zyzshishui to confirm this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, no need

# Apex not available on ROCm
MISC_ARGS+=(--no-gradient-accumulation-fusion)
# Disable offloading (torch_memory_saver may not support ROCm; MI300X has 192GB HBM)
PLATFORM_TRAIN_ARGS+=(--no-offload-train --no-offload-rollout)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need this: PLATFORM_TRAIN_ARGS+=(--no-offload-train --no-offload-rollout)
The torch_memory_saver has already resolved this issue,e and the AMD docker img is already supported.

cc. @zyzshishui to confirm this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct, can be removed

lizamd and others added 2 commits February 16, 2026 19:03
- Use dynamic NVIDIA GPU count via nvidia-smi -L instead of hardcoded 8
- Remove --no-gradient-accumulation-fusion (AMD Docker now supports it)
- Remove --no-offload-train/rollout (torch_memory_saver resolved for ROCm)
- Expand compact if/else to multi-line for readability

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prevent driver-level deadlocks when offload is enabled on AMD GPUs,
consistent with PR radixark#588 changes to run-qwen3-4B-amd.sh.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants