A flexible benchmarking framework for text language models with automated Docker orchestration and MLPerf-compliant evaluation.
- Zero-setup benchmarking - Automatic Docker container orchestration
- Universal hardware support - Auto-detects CUDA, ROCm, ARM, and CPU devices
- MLPerf-compliant scenarios - Server, Offline, and SingleStream inference modes
- Performance & accuracy evaluation - Comprehensive metrics with built-in datasets
- QPS sweep mode - Automatic performance curve discovery
- Existing server integration - Connect to your running vLLM server
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install via clone
git clone https://github.com/flexaihq/flexbench.git
cd flexbench
uv venv
source .venv/bin/activate
uv pip install -e .
# Install via git URL
uv venv
source .venv/bin/activate
uv pip install git+https://github.com/flexaihq/flexbench.git
- Docker and Docker Compose (or
docker compose
) - NVIDIA Docker runtime (for GPU support)
FlexBench provides a single command with smart defaults for immediate benchmarking:
# View all available options
flexbench --help
# Basic benchmark with default model (HuggingFaceTB/SmolLM2-135M-Instruct) and dataset (ctuning/MLPerf-OpenOrca)
flexbench # lightweight model for quick testing
# Gated models (requires HuggingFace token)
export HF_TOKEN=your_hf_token_here
flexbench --model-path meta-llama/Llama-3.2-1B-Instruct # supports --hf-token argument as well
# Larger model with multi-GPU support
flexbench --model-path meta-llama/Llama-3.2-70B-Instruct --gpu-devices "0,1" --tensor-parallel-size 2 # or use CUDA_VISIBLE_DEVICES environment variable
# Force CPU mode
flexbench --model-path meta-llama/Llama-3.2-1B-Instruct --device-type cpu
# Specify target QPS (queries per second)
flexbench --model-path meta-llama/Llama-3.2-1B-Instruct --target-qps 5
# QPS sweep to find performance limits
flexbench --model-path meta-llama/Llama-3.2-1B-Instruct --sweep
# Accuracy evaluation mode (default is performance)
flexbench --model-path meta-llama/Llama-3.2-1B-Instruct --mode accuracy
# Full benchmark with both performance and accuracy in sequence
flexbench --model-path meta-llama/Llama-3.2-1B-Instruct --mode all
# Use existing vLLM server
flexbench --model-path meta-llama/Llama-3.2-1B-Instruct --vllm-server http://localhost:8000 # assumes vLLM server is running
FlexBench automatically handles Docker container orchestration, model loading, benchmarking, and result collection with zero manual setup.
FlexBench uses Docker Compose to orchestrate two containers that communicate over a dedicated network:
flowchart TD
A[FlexBench CLI] -->|Orchestrates| B[Docker Compose Network]
subgraph B[flexbench-network]
C[vLLM Server Container]
D[FlexBench Runner Container]
D <-->|API Calls| C
end
subgraph Host Machine
E[HuggingFace Cache]
F[Results Directory]
G[GPU Devices]
end
C -->|Loads Models| E
D -->|Saves Results| F
C -->|Uses| G
H[Existing vLLM Server] -.->|Optional| D
H -.->|Bypasses container orchestration| B
Container Orchestration:
- vLLM Server Container: Loads and serves the model via OpenAI-compatible API
- FlexBench Runner Container: Generates load, collects metrics, and saves results
- Automatic networking: Containers communicate over a dedicated Docker network
- GPU allocation: Automatic device detection and resource management
External Server Option:
- Use
--vllm-server
to connect to an existing vLLM server - Bypasses vLLM container creation for maximum flexibility
FlexBench supports multiple inference scenarios based on MLPerf standards:
For more details on the MLPerf Inference Benchmark and the design of modes and metrics, refer to the MLPerf Inference Benchmark paper.
Whenever running without specifying a vLLM server, FlexBench automatically detects your hardware with --device-type auto
(default):
Detection Priority: CUDA → ROCm → ARM → CPU
Device Type | Default vLLM Image | Build Method | Hardware |
---|---|---|---|
auto | Auto-detected | Varies by detected device | Automatic hardware detection |
cuda | vllm/vllm-openai:latest |
Pull from registry | NVIDIA GPUs |
rocm (WIP) | rocm/vllm:latest |
Pull from registry | AMD GPUs |
arm | vllm-arm-local:latest |
Built from source | ARM processors |
cpu | public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.9.1 |
Pull from registry | CPU-only systems |
Note: ARM devices require building vLLM from source since no pre-built ARM images are available. FlexBench automatically clones the vLLM repository and builds the image locally.
Force specific device:
# Force CPU even with GPUs available
flexbench --device-type cpu
# Force CUDA with a specific GPU
flexbench --model-path meta-llama/Llama-3.2-1B-Instruct --device-type cuda --gpu-devices "1"
# equivalent to setting CUDA_VISIBLE_DEVICES=1 env variable
# Run multiple GPUs with tensor parallelism
flexbench --model-path meta-llama/Llama-3.2-70B-Instruct --device-type cuda --gpu-devices "0,1" --tensor-parallel-size 2
FlexBench supports multiple evaluation modes via --mode
:
Mode | Description | Usage |
---|---|---|
performance | Benchmark throughput and latency (default) | --mode performance |
accuracy | Evaluate model outputs against reference data | --mode accuracy |
all | Run performance benchmark, then accuracy evaluation | --mode all |
Examples:
# Performance only (default)
flexbench --model-path HuggingFaceTB/SmolLM2-135M-Instruct
# Accuracy evaluation
flexbench --model-path HuggingFaceTB/SmolLM2-135M-Instruct --mode accuracy
# Both modes sequentially
flexbench --model-path HuggingFaceTB/SmolLM2-135M-Instruct --mode all
FlexBench uses the cTuning/MLPerf-OpenOrca dataset by default - the official MLPerf dataset for text inference benchmarking. Pre-configured column mappings:
- Input column:
question
- Output column:
response
(used for accuracy evaluation) - System prompt:
system_prompt
Override defaults:
# Use custom dataset
flexbench --model-path HuggingFaceTB/SmolLM2-135M-Instruct \
--dataset-path your-org/your-dataset \
--dataset-input-column your_input_column \
--dataset-output-column your_output_column
Sweep mode automatically discovers your model's performance characteristics by testing multiple QPS levels. It first starts by finding the maximum QPS your model can handle, then runs benchmarks at evenly spaced QPS points between 0 and the maximum QPS + 20%.
Usage:
# Basic sweep with 10 QPS points (default)
flexbench --model-path meta-llama/Llama-3.2-1B-Instruct --sweep
# Custom sweep with 5 QPS points
flexbench --model-path meta-llama/Llama-3.2-1B-Instruct --sweep --num-sweep-points 5
The results will all be saved in a single file.
Note: Sweep mode is incompatible with --target-qps
(automatically determines QPS range) and --mode accuracy
(performance analysis only).
We are developing MLCommons CMX automations to help users prepare, validate, and submit official MLPerf inference results using FlexBench. These automations are based on our MLPerf inference v5.0 submission, featuring DeepSeek-R1-Distill-Llama-8B and vLLM.
This project is licensed under the Apache License 2.0.
© 2025 FlexAI
Portions of the code were adapted from the following MLCommons repositories, which are also licensed under the Apache 2.0 license:
Daniel Altunay and Grigori Fursin (FCS Labs)
We would like to thank Dali Kilani, Venkataraju Koppada, Rahul Thangallapally, and other colleagues for their valuable discussions and feedback.
We welcome contributions to this project!
If you have ideas, bug reports, or feature requests, please open an issue. To contribute code, feel free to submit a pull request. By contributing, you agree that your contributions will be licensed under the same Apache License 2.0.