Skip to content

Benchmark OpenAI-compatible AI endpoints and AI Accelerators in a reproducible structured way

License

Notifications You must be signed in to change notification settings

flexaihq/flexbench

Repository files navigation

FlexBench

A flexible benchmarking framework for text language models with automated Docker orchestration and MLPerf-compliant evaluation.

Features

  • Zero-setup benchmarking - Automatic Docker container orchestration
  • Universal hardware support - Auto-detects CUDA, ROCm, ARM, and CPU devices
  • MLPerf-compliant scenarios - Server, Offline, and SingleStream inference modes
  • Performance & accuracy evaluation - Comprehensive metrics with built-in datasets
  • QPS sweep mode - Automatic performance curve discovery
  • Existing server integration - Connect to your running vLLM server

Installation

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install via clone
git clone https://github.com/flexaihq/flexbench.git
cd flexbench
uv venv
source .venv/bin/activate
uv pip install -e .

# Install via git URL
uv venv
source .venv/bin/activate
uv pip install git+https://github.com/flexaihq/flexbench.git

Prerequisites

  • Docker and Docker Compose (or docker compose)
  • NVIDIA Docker runtime (for GPU support)

Quick Start

FlexBench provides a single command with smart defaults for immediate benchmarking:

# View all available options
flexbench --help

# Basic benchmark with default model (HuggingFaceTB/SmolLM2-135M-Instruct) and dataset (ctuning/MLPerf-OpenOrca)
flexbench  # lightweight model for quick testing

# Gated models (requires HuggingFace token)
export HF_TOKEN=your_hf_token_here
flexbench --model-path meta-llama/Llama-3.2-1B-Instruct  # supports --hf-token argument as well

# Larger model with multi-GPU support
flexbench --model-path meta-llama/Llama-3.2-70B-Instruct --gpu-devices "0,1" --tensor-parallel-size 2  # or use CUDA_VISIBLE_DEVICES environment variable

# Force CPU mode
flexbench --model-path meta-llama/Llama-3.2-1B-Instruct --device-type cpu

# Specify target QPS (queries per second)
flexbench --model-path meta-llama/Llama-3.2-1B-Instruct --target-qps 5

# QPS sweep to find performance limits
flexbench --model-path meta-llama/Llama-3.2-1B-Instruct --sweep

# Accuracy evaluation mode (default is performance)
flexbench --model-path meta-llama/Llama-3.2-1B-Instruct --mode accuracy

# Full benchmark with both performance and accuracy in sequence
flexbench --model-path meta-llama/Llama-3.2-1B-Instruct --mode all

# Use existing vLLM server
flexbench --model-path meta-llama/Llama-3.2-1B-Instruct --vllm-server http://localhost:8000  # assumes vLLM server is running

FlexBench automatically handles Docker container orchestration, model loading, benchmarking, and result collection with zero manual setup.

Architecture

FlexBench uses Docker Compose to orchestrate two containers that communicate over a dedicated network:

flowchart TD
    A[FlexBench CLI] -->|Orchestrates| B[Docker Compose Network]

    subgraph B[flexbench-network]
        C[vLLM Server Container]
        D[FlexBench Runner Container]
        D <-->|API Calls| C
    end

    subgraph Host Machine
        E[HuggingFace Cache]
        F[Results Directory]
        G[GPU Devices]
    end

    C -->|Loads Models| E
    D -->|Saves Results| F
    C -->|Uses| G

    H[Existing vLLM Server] -.->|Optional| D
    H -.->|Bypasses container orchestration| B
Loading

Container Orchestration:

  • vLLM Server Container: Loads and serves the model via OpenAI-compatible API
  • FlexBench Runner Container: Generates load, collects metrics, and saves results
  • Automatic networking: Containers communicate over a dedicated Docker network
  • GPU allocation: Automatic device detection and resource management

External Server Option:

  • Use --vllm-server to connect to an existing vLLM server
  • Bypasses vLLM container creation for maximum flexibility

Inference Scenarios

FlexBench supports multiple inference scenarios based on MLPerf standards:

Scenario Description Load Generation Use Case
Server Queries arrive following a Poisson distribution, mimicking real-world load. Server load generation Online serving, latency testing
Offline All queries are sent at once, maximizing throughput. Offline load generation Throughput benchmarking
SingleStream Queries are processed one at a time, measuring sequential latency (90th percentile). Single stream load generation Real-time, interactive, or mobile inference (e.g., autocomplete, AR)

For more details on the MLPerf Inference Benchmark and the design of modes and metrics, refer to the MLPerf Inference Benchmark paper.

Device Support

Whenever running without specifying a vLLM server, FlexBench automatically detects your hardware with --device-type auto (default):

Detection Priority: CUDA → ROCm → ARM → CPU

Device Type Default vLLM Image Build Method Hardware
auto Auto-detected Varies by detected device Automatic hardware detection
cuda vllm/vllm-openai:latest Pull from registry NVIDIA GPUs
rocm (WIP) rocm/vllm:latest Pull from registry AMD GPUs
arm vllm-arm-local:latest Built from source ARM processors
cpu public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.9.1 Pull from registry CPU-only systems

Note: ARM devices require building vLLM from source since no pre-built ARM images are available. FlexBench automatically clones the vLLM repository and builds the image locally.

Force specific device:

# Force CPU even with GPUs available
flexbench --device-type cpu

# Force CUDA with a specific GPU
flexbench --model-path meta-llama/Llama-3.2-1B-Instruct --device-type cuda --gpu-devices "1"
# equivalent to setting CUDA_VISIBLE_DEVICES=1 env variable

# Run multiple GPUs with tensor parallelism
flexbench --model-path meta-llama/Llama-3.2-70B-Instruct --device-type cuda --gpu-devices "0,1" --tensor-parallel-size 2

Benchmark Modes

FlexBench supports multiple evaluation modes via --mode:

Mode Description Usage
performance Benchmark throughput and latency (default) --mode performance
accuracy Evaluate model outputs against reference data --mode accuracy
all Run performance benchmark, then accuracy evaluation --mode all

Examples:

# Performance only (default)
flexbench --model-path HuggingFaceTB/SmolLM2-135M-Instruct

# Accuracy evaluation
flexbench --model-path HuggingFaceTB/SmolLM2-135M-Instruct --mode accuracy

# Both modes sequentially
flexbench --model-path HuggingFaceTB/SmolLM2-135M-Instruct --mode all

Default Dataset

FlexBench uses the cTuning/MLPerf-OpenOrca dataset by default - the official MLPerf dataset for text inference benchmarking. Pre-configured column mappings:

  • Input column: question
  • Output column: response (used for accuracy evaluation)
  • System prompt: system_prompt

Override defaults:

# Use custom dataset
flexbench --model-path HuggingFaceTB/SmolLM2-135M-Instruct \
  --dataset-path your-org/your-dataset \
  --dataset-input-column your_input_column \
  --dataset-output-column your_output_column

Sweep Mode

Sweep mode automatically discovers your model's performance characteristics by testing multiple QPS levels. It first starts by finding the maximum QPS your model can handle, then runs benchmarks at evenly spaced QPS points between 0 and the maximum QPS + 20%.

Usage:

# Basic sweep with 10 QPS points (default)
flexbench --model-path meta-llama/Llama-3.2-1B-Instruct --sweep

# Custom sweep with 5 QPS points
flexbench --model-path meta-llama/Llama-3.2-1B-Instruct --sweep --num-sweep-points 5

The results will all be saved in a single file.

Note: Sweep mode is incompatible with --target-qps (automatically determines QPS range) and --mode accuracy (performance analysis only).

Using MLCommons CMX automation language

We are developing MLCommons CMX automations to help users prepare, validate, and submit official MLPerf inference results using FlexBench. These automations are based on our MLPerf inference v5.0 submission, featuring DeepSeek-R1-Distill-Llama-8B and vLLM.

License and Copyright

This project is licensed under the Apache License 2.0.

© 2025 FlexAI

Portions of the code were adapted from the following MLCommons repositories, which are also licensed under the Apache 2.0 license:

Authors and maintaners

Daniel Altunay and Grigori Fursin (FCS Labs)

We would like to thank Dali Kilani, Venkataraju Koppada, Rahul Thangallapally, and other colleagues for their valuable discussions and feedback.

Contributing

We welcome contributions to this project!

If you have ideas, bug reports, or feature requests, please open an issue. To contribute code, feel free to submit a pull request. By contributing, you agree that your contributions will be licensed under the same Apache License 2.0.

About

Benchmark OpenAI-compatible AI endpoints and AI Accelerators in a reproducible structured way

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 6