Skip to content

Conversation

@3outeille
Copy link
Member

@3outeille 3outeille commented Nov 26, 2025

  • Since v5 pushes, it introduces MoE refactor which break compatibility with VLLM. I dont think torch.zeros_like(router_logits).scatter_(1, router_indices, router_top_value) is needed (argument was it is needed during EP but if EP is enabled, the whole function is ignore anyway)

NOTE: EP was broken already so need to check following PR #42183

Take away:

Test results

image

Testing methodology

mkdir tests_results && ./test_moe_inference.sh tests_results/

#!/bin/bash

# Script to test vLLM inference with transformers backend for MoE models
# Each model is tested in parallel on separate GPUs

# Define colors for output
GREEN='\033[0;32m'
RED='\033[0;31m'
YELLOW='\033[1;33m'
DIM='\033[0;90m'
NC='\033[0m' # No Color

# Define models to test (architecture -> HuggingFace checkpoint)
declare -A MODELS=(
    ["flex_olmo"]="hf-internal-testing/tiny-random-FlexOlmoForCausalLM"
    ["gpt_oss"]="hf-internal-testing/tiny-random-GptOssForCausalLM"
    ["minimax"]="hf-internal-testing/tiny-random-MiniMaxForCausalLM"
    ["mixtral"]="hf-internal-testing/tiny-random-MixtralForCausalLM"
    ["olmoe"]="hf-internal-testing/tiny-random-OlmoeForCausalLM"
    ["qwen2_moe"]="hf-internal-testing/tiny-random-Qwen2MoeForCausalLM"
    ["qwen3_moe"]="hf-internal-testing/tiny-random-Qwen3MoeForCausalLM"
    ["qwen3_next"]="hf-internal-testing/tiny-random-Qwen3NextForCausalLM"
    ["qwen3_vl_moe"]="hf-internal-testing/tiny-random-Qwen3VLMoeForConditionalGeneration"
)

# Get number of available GPUs
NUM_GPUS=$(nvidia-smi -L 2>/dev/null | wc -l)
if [ "$NUM_GPUS" -eq 0 ]; then
    echo "No GPUs detected!"
    exit 1
fi
echo "Detected $NUM_GPUS GPUs"

# Handle results directory - use provided path or create temp directory
if [ -n "$1" ]; then
    RESULTS_DIR="$1"
    mkdir -p "$RESULTS_DIR"
    CLEANUP_RESULTS=false
elif [ -n "$RESULTS_DIR" ]; then
    # RESULTS_DIR already set via environment variable
    mkdir -p "$RESULTS_DIR"
    CLEANUP_RESULTS=false
else
    RESULTS_DIR=$(mktemp -d)
    CLEANUP_RESULTS=true
fi

# Only cleanup if we created a temp directory
if [ "$CLEANUP_RESULTS" = true ]; then
    trap "rm -rf $RESULTS_DIR" EXIT
fi

echo "Results directory: $RESULTS_DIR"

echo "=========================================="
echo "  MoE Models Inference Test Script"
echo "  (Parallel execution on $NUM_GPUS GPUs)"
echo "=========================================="
echo ""

# Function to run inference test on a specific GPU
run_test() {
    local model_name=$1
    local model_checkpoint=$2
    local gpu_id=$3
    local result_file="$RESULTS_DIR/${model_name}.result"
    
    echo -e "${YELLOW}[GPU $gpu_id] Starting: ${model_name} (${model_checkpoint})${NC}"
    
    # Set environment and run the command on specific GPU
    CUDA_VISIBLE_DEVICES=$gpu_id VLLM_ENABLE_V1_MULTIPROCESSING=0 \
        python vllm/examples/offline_inference/basic/generate.py \
        --model "$model_checkpoint" \
        --model-impl transformers \
        --enforce-eager \
        --no-enable-prefix-caching \
        > "$RESULTS_DIR/${model_name}.log" 2>&1
    
    local exit_code=$?
    
    # Write result to file (for collection later)
    if [ $exit_code -eq 0 ]; then
        echo "SUCCESS" > "$result_file"
        echo -e "${GREEN}✓ [GPU $gpu_id] ${model_name}: SUCCESS${NC}"
    else
        echo "FAILED (exit code: $exit_code)" > "$result_file"
        echo -e "${RED}✗ [GPU $gpu_id] ${model_name}: FAILED (exit code: $exit_code)${NC}"
    fi
}

# Convert associative array keys to indexed array for scheduling
MODEL_NAMES=(${!MODELS[@]})
NUM_MODELS=${#MODEL_NAMES[@]}

# Track PIDs for waiting
declare -a PIDS=()

# Launch tests in parallel, cycling through available GPUs
for i in "${!MODEL_NAMES[@]}"; do
    model_name="${MODEL_NAMES[$i]}"
    model_checkpoint="${MODELS[$model_name]}"
    gpu_id=$((i % NUM_GPUS))
    
    # If we've used all GPUs, wait for a slot to free up
    if [ ${#PIDS[@]} -ge $NUM_GPUS ]; then
        # Wait for any one process to complete
        wait -n 2>/dev/null || wait "${PIDS[0]}"
        # Remove completed PIDs (simplified: just clear and rebuild)
        NEW_PIDS=()
        for pid in "${PIDS[@]}"; do
            if kill -0 "$pid" 2>/dev/null; then
                NEW_PIDS+=("$pid")
            fi
        done
        PIDS=("${NEW_PIDS[@]}")
    fi
    
    run_test "$model_name" "$model_checkpoint" "$gpu_id" &
    PIDS+=($!)
done

# Wait for all remaining background jobs to complete
echo ""
echo "Waiting for all tests to complete..."
wait

# Print summary
echo ""
echo "=========================================="
echo "  SUMMARY"
echo "=========================================="
echo ""

success_count=0
fail_count=0

for model_name in "${MODEL_NAMES[@]}"; do
    result_file="$RESULTS_DIR/${model_name}.result"
    if [ -f "$result_file" ]; then
        result=$(cat "$result_file")
        if [[ "$result" == "SUCCESS" ]]; then
            echo -e "${GREEN}✓ ${model_name}: ${result}${NC}"
            ((success_count++))
        else
            echo -e "${RED}✗ ${model_name}: ${result}${NC}"
            # Show last few lines of error
            echo -e "${DIM}  Error snippet:"
            tail -n 3 "$RESULTS_DIR/${model_name}.log" | while read -r line; do echo -e "    ${DIM}${line}${NC}"; done
            ((fail_count++))
        fi
    else
        echo -e "${RED}✗ ${model_name}: NO RESULT (test may have crashed)${NC}"
        ((fail_count++))
    fi
done

echo ""
echo "-------------------------------------------"
echo -e "Total: ${GREEN}${success_count} passed${NC}, ${RED}${fail_count} failed${NC}"
echo "=========================================="

# Show logs for failed tests
if [ $fail_count -gt 0 ]; then
    echo ""
    echo "Failed test logs available in: $RESULTS_DIR"
    echo "To view: cat $RESULTS_DIR/<model_name>.log"
fi

@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: flex_olmo, gpt_oss, minimax, mixtral, olmoe, qwen2_moe, qwen3_moe, qwen3_next, qwen3_omni_moe, qwen3_vl_moe

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! We can prob ably just put router_scores = torch.zeros_like(router_logits).scatter_(1, router_indices, router_top_value) for ep?

EP testing needs to be done with GPT OSS rather than any other model, or you can use mixtral ....

TLDR As long as it has:

    base_model_tp_plan = {
        "layers.*.self_attn.q_proj": "colwise",
        "layers.*.self_attn.k_proj": "colwise",
        "layers.*.self_attn.v_proj": "colwise",
        "layers.*.self_attn.o_proj": "rowwise",
        "layers.*.mlp.gate": "ep_router",  # we need to replicate here to correctly route experts
        "layers.*.mlp.experts.gate_up_proj": "local_colwise",
        "layers.*.mlp.experts.down_proj": "local_rowwise",
        "layers.*.mlp.experts": "gather",
    }

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to remove when we merge!

@ArthurZucker ArthurZucker marked this pull request as ready for review November 27, 2025 14:00
@ArthurZucker
Copy link
Collaborator

run-slow: flex_olmo, gpt_oss, minimax, mixtral, olmoe, qwen2_moe, qwen3_moe, qwen3_next, qwen3_omni_moe, qwen3_vl_moe

@github-actions
Copy link
Contributor

This comment contains run-slow, running the specified jobs:

models: ["models/flex_olmo", "models/gpt_oss", "models/minimax", "models/mixtral", "models/olmoe", "models/qwen2_moe", "models/qwen3_moe", "models/qwen3_next", "models/qwen3_omni_moe", "models/qwen3_vl_moe"]
quantizations: []

@github-actions
Copy link
Contributor

CI Results

Workflow Run ⚙️

✅ No failing test specific to this PR 🎉 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants