A comprehensive SLURM-based deployment system for running large language models with vLLM on HPC clusters, supporting multi-node distributed inference with proper vLLM v0.15.0 configurations.
- Features
- Requirements
- Installation
- Configuration
- Usage
- Multi-Node Deployment
- Available Models
- Troubleshooting
- Best Practices
- Multi-Node Support: Distributed inference across multiple SLURM nodes using torchrun
- vLLM v0.15.0: Latest vLLM features including improved MoE support, FP8 quantization, and multi-modal capabilities
- Centralized Configuration: All settings in
config.yaml- no hardcoded paths in job scripts - SLURM Integration: Automatic job submission with configurable time limits, partitions, and accounts
- Multiple Models: Pre-configured support for GPT-OSS, GLM-4, Qwen3-VL, Llama-4 models
- Quantization Support: FP8, MXFP4, BF16 quantization for efficient inference
- OpenAI-Compatible API: Drop-in replacement for OpenAI API endpoints
# Python >= 3.10
python --version
# Required Python packages
vllm>=0.15.0 # Latest vLLM with multi-node improvements
transformers>=4.51.0 # For Llama-4 and latest model support
torch>=2.10.0 # PyTorch with distributed improvements- GPU: NVIDIA H100 (80GB) or A100 (40GB/80GB) recommended
- Network: InfiniBand or high-speed Ethernet for multi-node deployments
- Storage: Shared filesystem accessible from all compute nodes
- CUDA: 12.1 or later
- SLURM workload manager
- NCCL 2.18+ for multi-GPU communication
- Shared cache directories across nodes
cd /data/horse/ws/YOUR_USERNAME
git clone <repository_url> host_vllm
cd host_vllm# Create venv
python3 -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install --upgrade pip
pip install vllm>=0.15.0
pip install transformers>=4.51.0
pip install torch>=2.10.0
pip install pyyaml openaiEdit config.yaml to set your paths and SLURM settings:
slurm:
account: "p_scads_pathology" # Your SLURM account
partition: "capella" # Your partition name
time_default: "12:00:00" # Default job time limit
cache:
xdg_cache_home: "/path/to/your/cache" # Or leave empty for default
triton_cache_dir: "" # Leave empty for auto
paths:
venv_dir: "/data/horse/ws/YOUR_USERNAME/host_vllm/.venv"slurm:
partition: "capella" # SLURM partition
account: "p_scads_pathology" # SLURM account (required)
time_default: "12:00:00" # Default time limit (HH:MM:SS)
cpus_per_task: 16 # CPUs per GPU task
mem_per_gpu: "64G" # Memory per GPUcache:
xdg_cache_home: "" # Leave empty for $HOME/.cache
triton_cache_dir: "" # Leave empty for auto ($XDG_CACHE_HOME/triton)
huggingface_cache: "" # Optional: HF model cache directoryImportant: Leave cache paths empty to use system defaults. This prevents exposing personal paths in the configuration.
paths:
venv_dir: "" # Leave empty to auto-detect from current environment
# Or set to: /path/to/your/.venvvllm:
gpu_memory_utilization: 0.90 # Fraction of GPU memory (0.0-1.0)
max_model_len_default: 32768 # Default context length
max_num_seqs: 16 # Parallel sequences
dtype: "auto" # Data type: auto, bfloat16, float16distributed:
rdzv_backend: "c10d" # PyTorch rendezvous backend
rdzv_timeout: 1200 # Timeout in seconds
master_port_base: 29500 # Base port for communication
nccl:
socket_ifname: "" # Network interface (e.g., ib0, eth0)
blocksize: 1048576 # NCCL block size
ib_timeout: 22 # InfiniBand timeout
debug: "WARN" # Debug level: TRACE, INFO, WARN, ERRORmodels:
llama-4-maverick:
name: "Llama-4-Maverick-17B-128E-Instruct"
huggingface_id: "meta-llama/Llama-4-Maverick-17B-128E-Instruct"
job_script: "jobs/llama4_maverick_2nodes_8gpus.sh"
port: 8006
gpus: 8 # Total GPUs across all nodes
nodes: 2 # Number of SLURM nodes
quantization: "fp8"
vision: true
time: "24:00:00" # Override default time limit
# partition: "long" # Optional: override partition- Select Model: Edit
config.yamland setactive_model:
active_model: llama-4-scout # Choose from available models- Submit Job:
./run.sh- Monitor Job:
# Check job status
squeue -u $USER
# View logs
tail -f logs/llama4_scout_*.outUse a custom config file:
./run.sh -c my_config.yamlOnce the job is running, query via OpenAI-compatible API:
# Set environment variables
export VLLM_HOST=<node_name> # From job output
export VLLM_PORT=8006
# Query with Python script
python scripts/query_openai_compatible.py \
--host $VLLM_HOST \
--port $VLLM_PORT \
--prompt "Explain quantum computing"
# Or use main.py
python main.py \
--model llama-4-scout \
--host $VLLM_HOST \
--prompt "Your prompt here"vLLM multi-node deployment uses:
- Tensor Parallelism (TP): Model weights distributed across GPUs
- torchrun: PyTorch distributed launcher
- c10d backend: Reliable rendezvous for node coordination
- NCCL: High-performance GPU communication
For models requiring multiple nodes (e.g., Llama-4-Maverick, GLM-4.7):
models:
model-name:
nodes: 2 # Number of SLURM nodes
gpus: 8 # Total GPUs (nodes × gpus-per-node)
time: "24:00:00" # Longer time for large modelsIf you have InfiniBand:
nccl:
socket_ifname: "ib0" # InfiniBand interface
ib_timeout: 22 # IB-specific timeoutFor Ethernet-based clusters:
nccl:
socket_ifname: "eth0" # Ethernet interface
# Or leave empty for auto-detectionAfter job submission, check logs for:
Master address: node001
Master port: 29542
Starting <Model> on 2 nodes (8 GPUs)...
Tensor Parallel Size: 8
Successful multi-node communication shows:
INFO: Started server process [PID]
INFO: Waiting for application startup.
INFO: Application startup complete.
| Model | GPUs | VRAM | Context | Description |
|---|---|---|---|---|
| GPT-OSS-20B | 1 | 24GB | 128K | Fast reasoning with tool use |
| GPT-OSS-120B | 1 | 80GB | 128K | High-reasoning production model |
| GLM-4.7-Flash | 2 | 160GB | 128K | MoE with fast inference |
| GLM-4.6V (FP8) | 2 | 160GB | 128K | Vision-language MoE |
| GLM-4.6V (FP16) | 4 | 320GB | 128K | Full precision VLM |
| Llama-4-Scout | 2 | 160GB | 10M | Multimodal with 16 experts |
| Model | Nodes | GPUs | VRAM | Context | Description |
|---|---|---|---|---|---|
| GLM-4.7 | 2 | 8 | 640GB | 131K | 358B MoE language model |
| Qwen3-VL-235B | 2 | 8 | 640GB | 256K | Vision-language agent |
| Llama-4-Maverick | 2 | 8 | 640GB | 1M | 128 experts, multimodal |
- FP8: 8-bit floating point, ~2x memory reduction
- MXFP4: 4-bit MX format for GPT-OSS models
- BF16: 16-bit bfloat, full precision baseline
Symptom: Job remains in pending (PD) state
Solution:
# Check job details
scontrol show job <JOB_ID>
# Verify account access
sacctmgr show user $USER
# Check partition availability
sinfo -p capellaSymptom: Timeout errors, "No available node types" message
Solution:
# In config.yaml, try different network interface
nccl:
socket_ifname: "ib0" # Or "eth0", or leave empty
# Increase timeout
distributed:
rdzv_timeout: 1800 # 30 minutesSymptom: CUDA out of memory errors
Solution:
# Reduce GPU memory utilization
vllm:
gpu_memory_utilization: 0.85 # From 0.90
# Reduce batch size
vllm:
max_num_seqs: 8 # From 16
# Reduce context length
vllm:
max_model_len_default: 16384 # From 32768Symptom: "Address already in use" error
Solution:
# Change port in config.yaml
models:
model-name:
port: 8007 # Different portSymptom: Permission denied writing to cache
Solution:
# Set explicit cache paths
cache:
xdg_cache_home: "/data/horse/ws/$USER/cache"
triton_cache_dir: "/data/horse/ws/$USER/cache/triton"nccl:
debug: "INFO" # Or "TRACE" for maximum verbosity# In job output, look for:
# "NCCL INFO Ring 00 : ..."
# Indicates successful GPU ring formation
# Check network connectivity
srun --nodes=2 --ntasks=2 hostname# Inside running job
nvidia-smi
# Should show allocated GPUs
# Check that all expected GPUs are visibleThe system uses SLURM/torchrun, not Ray, but for Ray-based deployments:
# Ray dashboard (if using Ray)
ray dashboard- Time Limits: Add 20% buffer for model loading and initialization
- Memory: Start with
gpu_memory_utilization: 0.90, reduce if OOM - CPUs: Use 16 CPUs per GPU for optimal data preprocessing
# Optimal NCCL settings for Ethernet
nccl:
socket_ifname: "" # Auto-detect
blocksize: 1048576 # 1MB blocks
debug: "WARN" # Minimal overhead
# Increase timeouts for large models
distributed:
rdzv_timeout: 1800 # 30 min for slow networks- Development: Start with smaller models (GPT-OSS-20B, Llama-4-Scout)
- Production: Use FP8 quantization for 2x throughput
- Vision tasks: GLM-4.6V or Qwen3-VL for multimodal
# Clean old model caches periodically
rm -rf $XDG_CACHE_HOME/huggingface/hub/*
# Set explicit HF cache to control disk usage
export HF_HOME=/scratch/$USER/hf_cache# Create monitoring script
watch -n 5 'squeue -u $USER && nvidia-smi'
# Check logs in real-time
tail -f logs/*.out# Fix random seeds in model config
vllm:
seed: 42 # Reproducible generation-
v2.0 (2026-02-03): Complete refactoring
- vLLM v0.15.0 support
- Removed hardcoded paths
- Enhanced multi-node configuration
- Comprehensive NCCL settings
- Centralized configuration management
-
v1.0: Initial release
- Basic SLURM integration
- Single-node deployments
For issues:
- Check job logs in
logs/directory - Verify SLURM account access
- Test with single-node models first
- Consult vLLM Documentation