Fix transformers MoE compatibility with VLLM #42429

3outeille · 2025-11-26T16:58:10Z

Since v5 pushes, it introduces MoE refactor which break compatibility with VLLM. I dont think torch.zeros_like(router_logits).scatter_(1, router_indices, router_top_value) is needed (argument was it is needed during EP but if EP is enabled, the whole function is ignore anyway)

NOTE: EP was broken already so need to check following PR #42183

Take away:

we don't want to use zeros_like + scatter so find another solution and fix breaking tests (cf https://github.com/huggingface/transformers/actions/runs/19738722845/job/56557321765) => Follow up PR Fix MoE for V5 #42456
unrelated to current change, vLLM needs fixing to adjust to transformers weight loading cc: @hmellor

Test results

Testing methodology

mkdir tests_results && ./test_moe_inference.sh tests_results/

#!/bin/bash

# Script to test vLLM inference with transformers backend for MoE models
# Each model is tested in parallel on separate GPUs

# Define colors for output
GREEN='\033[0;32m'
RED='\033[0;31m'
YELLOW='\033[1;33m'
DIM='\033[0;90m'
NC='\033[0m' # No Color

# Define models to test (architecture -> HuggingFace checkpoint)
declare -A MODELS=(
    ["flex_olmo"]="hf-internal-testing/tiny-random-FlexOlmoForCausalLM"
    ["gpt_oss"]="hf-internal-testing/tiny-random-GptOssForCausalLM"
    ["minimax"]="hf-internal-testing/tiny-random-MiniMaxForCausalLM"
    ["mixtral"]="hf-internal-testing/tiny-random-MixtralForCausalLM"
    ["olmoe"]="hf-internal-testing/tiny-random-OlmoeForCausalLM"
    ["qwen2_moe"]="hf-internal-testing/tiny-random-Qwen2MoeForCausalLM"
    ["qwen3_moe"]="hf-internal-testing/tiny-random-Qwen3MoeForCausalLM"
    ["qwen3_next"]="hf-internal-testing/tiny-random-Qwen3NextForCausalLM"
    ["qwen3_vl_moe"]="hf-internal-testing/tiny-random-Qwen3VLMoeForConditionalGeneration"
)

# Get number of available GPUs
NUM_GPUS=$(nvidia-smi -L 2>/dev/null | wc -l)
if [ "$NUM_GPUS" -eq 0 ]; then
    echo "No GPUs detected!"
    exit 1
fi
echo "Detected $NUM_GPUS GPUs"

# Handle results directory - use provided path or create temp directory
if [ -n "$1" ]; then
    RESULTS_DIR="$1"
    mkdir -p "$RESULTS_DIR"
    CLEANUP_RESULTS=false
elif [ -n "$RESULTS_DIR" ]; then
    # RESULTS_DIR already set via environment variable
    mkdir -p "$RESULTS_DIR"
    CLEANUP_RESULTS=false
else
    RESULTS_DIR=$(mktemp -d)
    CLEANUP_RESULTS=true
fi

# Only cleanup if we created a temp directory
if [ "$CLEANUP_RESULTS" = true ]; then
    trap "rm -rf $RESULTS_DIR" EXIT
fi

echo "Results directory: $RESULTS_DIR"

echo "=========================================="
echo "  MoE Models Inference Test Script"
echo "  (Parallel execution on $NUM_GPUS GPUs)"
echo "=========================================="
echo ""

# Function to run inference test on a specific GPU
run_test() {
    local model_name=$1
    local model_checkpoint=$2
    local gpu_id=$3
    local result_file="$RESULTS_DIR/${model_name}.result"
    
    echo -e "${YELLOW}[GPU $gpu_id] Starting: ${model_name} (${model_checkpoint})${NC}"
    
    # Set environment and run the command on specific GPU
    CUDA_VISIBLE_DEVICES=$gpu_id VLLM_ENABLE_V1_MULTIPROCESSING=0 \
        python vllm/examples/offline_inference/basic/generate.py \
        --model "$model_checkpoint" \
        --model-impl transformers \
        --enforce-eager \
        --no-enable-prefix-caching \
        > "$RESULTS_DIR/${model_name}.log" 2>&1
    
    local exit_code=$?
    
    # Write result to file (for collection later)
    if [ $exit_code -eq 0 ]; then
        echo "SUCCESS" > "$result_file"
        echo -e "${GREEN}✓ [GPU $gpu_id] ${model_name}: SUCCESS${NC}"
    else
        echo "FAILED (exit code: $exit_code)" > "$result_file"
        echo -e "${RED}✗ [GPU $gpu_id] ${model_name}: FAILED (exit code: $exit_code)${NC}"
    fi
}

# Convert associative array keys to indexed array for scheduling
MODEL_NAMES=(${!MODELS[@]})
NUM_MODELS=${#MODEL_NAMES[@]}

# Track PIDs for waiting
declare -a PIDS=()

# Launch tests in parallel, cycling through available GPUs
for i in "${!MODEL_NAMES[@]}"; do
    model_name="${MODEL_NAMES[$i]}"
    model_checkpoint="${MODELS[$model_name]}"
    gpu_id=$((i % NUM_GPUS))
    
    # If we've used all GPUs, wait for a slot to free up
    if [ ${#PIDS[@]} -ge $NUM_GPUS ]; then
        # Wait for any one process to complete
        wait -n 2>/dev/null || wait "${PIDS[0]}"
        # Remove completed PIDs (simplified: just clear and rebuild)
        NEW_PIDS=()
        for pid in "${PIDS[@]}"; do
            if kill -0 "$pid" 2>/dev/null; then
                NEW_PIDS+=("$pid")
            fi
        done
        PIDS=("${NEW_PIDS[@]}")
    fi
    
    run_test "$model_name" "$model_checkpoint" "$gpu_id" &
    PIDS+=($!)
done

# Wait for all remaining background jobs to complete
echo ""
echo "Waiting for all tests to complete..."
wait

# Print summary
echo ""
echo "=========================================="
echo "  SUMMARY"
echo "=========================================="
echo ""

success_count=0
fail_count=0

for model_name in "${MODEL_NAMES[@]}"; do
    result_file="$RESULTS_DIR/${model_name}.result"
    if [ -f "$result_file" ]; then
        result=$(cat "$result_file")
        if [[ "$result" == "SUCCESS" ]]; then
            echo -e "${GREEN}✓ ${model_name}: ${result}${NC}"
            ((success_count++))
        else
            echo -e "${RED}✗ ${model_name}: ${result}${NC}"
            # Show last few lines of error
            echo -e "${DIM}  Error snippet:"
            tail -n 3 "$RESULTS_DIR/${model_name}.log" | while read -r line; do echo -e "    ${DIM}${line}${NC}"; done
            ((fail_count++))
        fi
    else
        echo -e "${RED}✗ ${model_name}: NO RESULT (test may have crashed)${NC}"
        ((fail_count++))
    fi
done

echo ""
echo "-------------------------------------------"
echo -e "Total: ${GREEN}${success_count} passed${NC}, ${RED}${fail_count} failed${NC}"
echo "=========================================="

# Show logs for failed tests
if [ $fail_count -gt 0 ]; then
    echo ""
    echo "Failed test logs available in: $RESULTS_DIR"
    echo "To view: cat $RESULTS_DIR/<model_name>.log"
fi

github-actions · 2025-11-26T16:59:08Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: flex_olmo, gpt_oss, minimax, mixtral, olmoe, qwen2_moe, qwen3_moe, qwen3_next, qwen3_omni_moe, qwen3_vl_moe

HuggingFaceDocBuilderDev · 2025-11-26T17:06:49Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Ok! We can prob ably just put router_scores = torch.zeros_like(router_logits).scatter_(1, router_indices, router_top_value) for ep?

EP testing needs to be done with GPT OSS rather than any other model, or you can use mixtral ....

TLDR As long as it has:

    base_model_tp_plan = {
        "layers.*.self_attn.q_proj": "colwise",
        "layers.*.self_attn.k_proj": "colwise",
        "layers.*.self_attn.v_proj": "colwise",
        "layers.*.self_attn.o_proj": "rowwise",
        "layers.*.mlp.gate": "ep_router",  # we need to replicate here to correctly route experts
        "layers.*.mlp.experts.gate_up_proj": "local_colwise",
        "layers.*.mlp.experts.down_proj": "local_rowwise",
        "layers.*.mlp.experts": "gather",
    }

ArthurZucker · 2025-11-27T13:20:58Z

test_moe_inference.sh

to remove when we merge!

ArthurZucker · 2025-11-27T14:00:16Z

run-slow: flex_olmo, gpt_oss, minimax, mixtral, olmoe, qwen2_moe, qwen3_moe, qwen3_next, qwen3_omni_moe, qwen3_vl_moe

github-actions · 2025-11-27T14:01:26Z

This comment contains run-slow, running the specified jobs:

models: ["models/flex_olmo", "models/gpt_oss", "models/minimax", "models/mixtral", "models/olmoe", "models/qwen2_moe", "models/qwen3_moe", "models/qwen3_next", "models/qwen3_omni_moe", "models/qwen3_vl_moe"]
quantizations: []

github-actions · 2025-11-27T17:36:02Z

CI Results

Workflow Run ⚙️

✅ No failing test specific to this PR 🎉 !

remove zero scatter

bc9b7c1

3outeille mentioned this pull request Nov 26, 2025

fix hf vllm mapping for mixtral that was breaking due to v5 coming vllm-project/vllm#29507

Draft

ArthurZucker added the for_v5? label Nov 27, 2025

ArthurZucker reviewed Nov 27, 2025

View reviewed changes

test_moe_inference.sh

Copy link

Collaborator

ArthurZucker Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to remove when we merge!

3outeille reacted with thumbs up emoji

ArthurZucker marked this pull request as ready for review November 27, 2025 14:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix transformers MoE compatibility with VLLM #42429

Fix transformers MoE compatibility with VLLM #42429

Uh oh!

3outeille commented Nov 26, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 26, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Nov 26, 2025

Uh oh!

ArthurZucker left a comment

Uh oh!

ArthurZucker Nov 27, 2025

Uh oh!

ArthurZucker commented Nov 27, 2025

Uh oh!

github-actions bot commented Nov 27, 2025

Uh oh!

github-actions bot commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix transformers MoE compatibility with VLLM #42429

Are you sure you want to change the base?

Fix transformers MoE compatibility with VLLM #42429

Uh oh!

Conversation

3outeille commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test results

Testing methodology

Uh oh!

github-actions bot commented Nov 26, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Nov 26, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker commented Nov 27, 2025

Uh oh!

github-actions bot commented Nov 27, 2025

Uh oh!

github-actions bot commented Nov 27, 2025

CI Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

3outeille commented Nov 26, 2025 •

edited

Loading