A production-grade benchmark harness that rigorously compares vLLM and SGLang LLM inference engines across latency, throughput, KV-cache efficiency, structured generation, and speculative decoding.
I benchmarked 16 models total (2B–9B parameters) on a single NVIDIA A10G 24 GB GPU, running 5 scenarios across both engines. The 14-model core baseline (table below) drove the headline comparison; the two later-arriving Gemma 4 models (E2B, E4B) were added in a separate block, so they don't appear in the "X / 14" tallies. The baseline plus speculative-decoding suite produced 152 result files at 100% success rate; follow-on phases (variance, concurrency-64, decode sweep, Gemma 4 baseline + ngram) add another ~380 files. Every cell is now complete. Speculative decoding: Ngram worked on Llama 3.1 8B, Qwen3 8B, and Gemma 4 E2B/E4B across both engines; Eagle3 worked on Llama 3.1 8B with vLLM only (SGLang OOM on A10G; Qwen3 8B draft model not yet published). See Benchmark Execution Status below for the per-phase breakdown.
| Metric | vLLM | SGLang |
|---|---|---|
| Lower TTFT (single request) | 13 / 14 models | 1 / 14 |
| Higher throughput (≤4B) | 5 / 6 models | 1 / 6 (Gemma 3) |
| Higher throughput (7–9B) | — | — (tied within 3%) |
| Structured generation wins | 12 / 14 | 2 / 14 |
| Prefix-sharing TTFT wins | 4 / 14 | 10 / 14 |
| Best single-request TTFT | 20 ms (Gemma 2 2B) | 30 ms (Gemma 2 2B) |
| Peak throughput | 265 tok/s (Gemma 2 2B) | 258 tok/s (Gemma 2 2B) |
Bottom line: vLLM is the stronger general-purpose default on A10G-class hardware — wins TTFT on nearly every model, wins small-model throughput by 3–12%, and dominates structured generation. SGLang matches vLLM on 7–9B throughput, has a decisive advantage on Gemma 3 4B (+77% throughput), and wins prefix-sharing TTFT on 10/14 models.
Hardware: AWS g5.2xlarge (NVIDIA A10G 24 GB), sequential execution, one engine at a time
Full reports: reports/final_benchmark_report_2026-03-31.md (latest) · 2026-03-28 · 2026-03-22 · HTML: 03-31 · 03-28 · dated snapshots: 03-31 · 03-28 · 03-22 · all charts/tables: reports/index.html
Supporting analyses: variance · TPOT · goodput · decode-length sweep · decode-length deep-dive · concurrency-64 ramp · cross-model summary · blog companion guides
Figures: spec-dec · decode-length sweep · variance CV · goodput · concurrency-64 throughput · tradeoff map — all regenerable via python -m analysis.generate_*_figure.
Benchmark status: 152 headline result files (140 baseline + 12 speculative-decoding) plus ~380 files from the extended phases (variance, concurrency-64, decode-length sweep, Gemma 4). Two known open items tracked below: Llama 3.1 8B SGLang-Eagle3 (retired nightly image) and a Gemma 4 E2B single/throughput rerun. Full matrix reproduces via scripts/run_all_benchmarks.sh (baseline) and scripts/run_new_benchmarks.sh (extended).
| Phase | Description | Status | Result Files |
|---|---|---|---|
| Baseline | 14 models × 5 scenarios × 2 engines | ✅ Complete | 152 / 152 |
| Speculative decoding | Llama 3.1 8B (Ngram + Eagle3), Qwen3 8B (Ngram) | ✅ Complete (except Llama sglang-eagle3 — blocked on missing nightly image) | In results/ |
| Variance subset | 4 models × 5 scenarios × 2 engines × 5 iterations | ✅ Complete | 201 / 200 — CV chart: reports/figures/variance_cv.svg |
| Concurrency-64 ramp | 4 models × throughput_ramp_extended × 2 engines × 1 iteration |
✅ Complete | 8 / 8 (0% error rate) |
| Decode-length sweep (4-model base) | 4 models × 4 lengths × 2 engines × 3 iterations | ✅ Complete | 96 / 96 |
| Decode-length sweep (Gemma 4) | 2 models × 4 lengths × 2 engines × 3 iterations | ✅ Complete | 48 / 48 |
| Gemma 4 baseline + ngram | 2 models (E2B, E4B) × 5 scenarios × 2 engines + ngram spec-dec | ✅ Complete | 28 / 28 |
Prompt ≈ 512 tokens, max_output_tokens ∈ {64, 256, 1024, 4096}, concurrency 8, 180 requests/run. Mean across iterations. Full table: reports/decode_length_sweep_summary.md.
All cells at n=3 iterations after 2026-04-19 top-ups.
| Model | Decode | Engine | n | Tokens/s | TTFT p50 (ms) | TTFT p99 (ms) | Latency p99 (ms) | Err |
|---|---|---|---|---|---|---|---|---|
| gemma-2-2b-it | 64 | sglang | 3 | 519.1 | 39.4 | 67.9 | 918 | 0.009 |
| gemma-2-2b-it | 64 | vllm | 3 | 523.0 | 42.1 | 188.7 | 1108 | 0.000 |
| gemma-2-2b-it | 256 | sglang | 3 | 484.3 | 41.9 | 70.5 | 3577 | 0.004 |
| gemma-2-2b-it | 256 | vllm | 3 | 493.8 | 36.5 | 60.4 | 3587 | 0.000 |
| gemma-2-2b-it | 1024 | sglang | 3 | 469.7 | 37.9 | 56.3 | 12742 | 0.000 |
| gemma-2-2b-it | 1024 | vllm | 3 | 458.0 | 37.3 | 57.4 | 12864 | 0.000 |
| gemma-2-2b-it | 4096 | sglang | 3 | 467.0 | 37.9 | 56.7 | 11044 | 0.000 |
| gemma-2-2b-it | 4096 | vllm | 3 | 459.2 | 37.5 | 53.7 | 12779 | 0.000 |
| phi-4-mini-instruct | 64 | sglang | 3 | 340.1 | 49.2 | 105.2 | 1378 | 0.000 |
| phi-4-mini-instruct | 64 | vllm | 3 | 354.4 | 55.1 | 82.7 | 1321 | 0.000 |
| phi-4-mini-instruct | 256 | sglang | 3 | 333.4 | 46.8 | 76.2 | 5350 | 0.000 |
| phi-4-mini-instruct | 256 | vllm | 3 | 346.2 | 56.0 | 70.5 | 5269 | 0.000 |
| phi-4-mini-instruct | 1024 | sglang | 3 | 322.6 | 47.6 | 73.9 | 22881 | 0.000 |
| phi-4-mini-instruct | 1024 | vllm | 3 | 304.7 | 56.4 | 80.5 | 23149 | 0.000 |
| phi-4-mini-instruct | 4096 | sglang | 3 | 293.5 | 48.3 | 99.9 | 87423 | 0.000 |
| phi-4-mini-instruct | 4096 | vllm | 3 | 287.2 | 56.8 | 74.5 | 79221 | 0.000 |
| gemma-3-4b-it | 64 | sglang | 3 | 280.8 | 128.8 | 155.3 | 1598 | 0.006 |
| gemma-3-4b-it | 64 | vllm | 3 | 146.3 | 128.2 | 2827.0 | 5758 | 0.000 |
| gemma-3-4b-it | 256 | sglang | 3 | 289.0 | 126.3 | 153.4 | 6101 | 0.004 |
| gemma-3-4b-it | 256 | vllm | 3 | 156.7 | 126.8 | 149.5 | 11259 | 0.000 |
| gemma-3-4b-it | 1024 | sglang | 3 | 274.9 | 100.1 | 162.9 | 25977 | 0.000 |
| gemma-3-4b-it | 1024 | vllm | 3 | 152.7 | 122.6 | 150.2 | 45465 | 0.000 |
| gemma-3-4b-it | 4096 | sglang | 3 | 269.3 | 100.2 | 153.5 | 36119 | 0.000 |
| gemma-3-4b-it | 4096 | vllm | 3 | 149.4 | 123.5 | 1886.5 | 65409 | 0.000 |
| llama-3-1-8b-instruct | 64 | sglang | 3 | 191.9 | 69.1 | 108.9 | 2394 | 0.000 |
| llama-3-1-8b-instruct | 64 | vllm | 3 | 189.2 | 96.2 | 128.8 | 2417 | 0.000 |
| llama-3-1-8b-instruct | 256 | sglang | 3 | 190.3 | 69.7 | 111.7 | 9452 | 0.000 |
| llama-3-1-8b-instruct | 256 | vllm | 3 | 189.4 | 93.3 | 126.2 | 9489 | 0.000 |
| llama-3-1-8b-instruct | 1024 | sglang | 3 | 186.5 | 69.1 | 103.6 | 39165 | 0.000 |
| llama-3-1-8b-instruct | 1024 | vllm | 3 | 185.1 | 96.3 | 128.7 | 39359 | 0.000 |
| llama-3-1-8b-instruct | 4096 | sglang | 3 | 157.6 | 106.4 | 99231.8 | 301590 | 0.030 |
| llama-3-1-8b-instruct | 4096 | vllm | 3 | 158.5 | 113.6 | 36139.6 | 283530 | 0.000 |
Observations:
- SGLang has consistently lower TTFT (p50) than vLLM; vLLM edges ahead on small-model decode throughput at short outputs.
- Gemma 3 4B: SGLang ≈ 1.8× vLLM tokens/s at every decode length (n=3 now confirms the Phase 0 finding with tighter CIs). Analysis script reports no throughput crossover — SGLang leads throughout; 44.5% gap at max_tokens=4096.
- Llama 8B at 4096 tokens: p99 latency blows out to ~5 min and tail TTFT spikes to ~99 s (sglang) / ~36 s (vllm) — the A10G is queue-saturated at concurrency 8 for this size.
- Crossovers surfaced by the analysis script (
reports/decode_length_analysis.md): vllm→sglang at max_tokens=1024 for phi-4-mini and gemma-2-2b; sglang→vllm at max_tokens=4096 for Llama-3.1-8B (within CI).
Single-iteration runs at concurrency levels {1, 4, 8, 16, 32, 64}, 150 req/level (900 total). Prompt 128 tok, output 256 tok. Full table: reports/concurrency64_summary.md. Charts: throughput · TTFT p50 · TTFT p99 · end-to-end latency p99.
| Model | Engine | Succ | Tokens/s | TTFT p50 (ms) | TTFT p99 (ms) | Latency p50 (ms) | Latency p99 (ms) | Err |
|---|---|---|---|---|---|---|---|---|
| Mistral-7B-Instruct-v0.3 | vllm | 900/900 | 123.5 | 93.0 | 283.9 | 8686 | 10136 | 0.000 |
| Mistral-7B-Instruct-v0.3 | sglang | 900/900 | 123.6 | 69.0 | 195.1 | 8607 | 10145 | 0.000 |
| Llama-3.1-8B-Instruct | vllm | 900/900 | 117.8 | 97.2 | 235.0 | 9078 | 10516 | 0.000 |
| Llama-3.1-8B-Instruct | sglang | 900/900 | 118.0 | 71.0 | 188.2 | 8993 | 10574 | 0.000 |
| Qwen3-8B | vllm | 900/900 | 113.7 | 103.2 | 232.2 | 9430 | 11683 | 0.000 |
| Qwen3-8B | sglang | 900/900 | 113.9 | 73.5 | 403.0 | 9355 | 11663 | 0.000 |
| google/gemma-2-9b-it† | vllm | 900/900 | 92.2 | 130.9 | 1859.1 | 11723 | 23037 | 0.000 |
| google/gemma-2-9b-it† | sglang | 900/900 | 89.5 | 130.4 | 29426.6 | 11631 | 43557 | 0.000 |
† gemma-2-9b-it vLLM required --max-model-len 2048 --enforce-eager --gpu-memory-utilization 0.90 to fit the 9B param KV cache on A10G 24 GB; default --max-model-len 8192 fails engine-core init (OOM). SGLang fits under default flags.
Key findings:
- Zero errors across all 8 cells at concurrency=64 — A10G 24 GB sustains 7–9B-class models end-to-end at 128/256 prompt/output.
- SGLang has consistently lower median TTFT (p50 69–74 ms vs vLLM's 93–131 ms) across every 7–9B model — a ~25–30 ms edge from RadixAttention prefix lookup.
- vLLM has substantially tighter tail TTFT on gemma-2-9b-it: p99 1859 ms vs SGLang 29427 ms — a ~16× gap. SGLang's tail collapses on this specific model at high concurrency; vLLM is the safer choice for latency-SLO gemma-2-9b serving.
- Throughput is engine-agnostic on all 7–8B models (within 0.2 tok/s). Both engines saturate A10G equivalently on decode once KV cache is warm.
- Gemma-2-9b-it throughput is ~24% lower than Mistral-7B (92 vs 123 tok/s) — expected from the 9B/7B parameter ratio plus the smaller max-model-len for vLLM.
Gemma 4 landed in Transformers 5.5.0 and introduced QK-norm (k_norm/q_norm) on top of the Gemma 3 architecture. Both engines needed careful image selection:
- vLLM —
vllm/vllm-openai:latesthas a native Gemma 4 loader (it derives from the existing Gemma 3 class and handles QK-norm correctly). Works out of the box oncetransformersis upgraded inside the container. - SGLang —
lmsysorg/sglang:latest(Apr-09 snapshot) does not have a native Gemma 4 class yet. It falls back to the genericTransformersMultiModalForCausalLMwrapper and dies during weight load:Fix: pinValueError: No module or parameter named 'model.language_model.layers.15.self_attn.k_norm' in TransformersMultiModalForCausalLMGEMMA4_SGLANG_IMAGE="lmsysorg/sglang:dev-cu13"(Apr-16 snapshot offmain) inscripts/run_new_benchmarks.sh. That image ships the native Gemma 4 model class and loads the weights cleanly.
- Llama 3.1 8B SGLang-Eagle3 — still blocked on the retired
lmsysorg/sglang:nightly-dev-cu13-20260321-94194537image. Needs a new nightly pin before retry. - Gemma 4 E2B rerun —
single_request_latencyandthroughput_rampproduced 0 output tokens; other scenarios are clean. Rerun needed to publish E2B spec-dec throughput numbers.
Re-running any phase is idempotent: each phase block auto-skips cells whose result file already exists, so scripts/run_new_benchmarks.sh --all is safe to launch at any time. Analysis scripts (variance_analysis, tpot_analysis, decode_length_analysis, goodput) regenerate their markdown reports from whatever is on disk — see the Benchmark Scenarios section below for commands.
graph TB
subgraph Client["Benchmark Client (Python / asyncio)"]
CLI["run_experiment.py<br/>typer CLI"]
Runner["BenchmarkRunner<br/>asyncio.gather"]
Dashboard["FastAPI Dashboard<br/>port 3000"]
end
subgraph vLLM["vLLM Engine (port 8000)"]
VR["REST API<br/>/v1/completions"]
PA["PagedAttention<br/>Block Manager"]
PC["Prefix Cache<br/>(LRU block reuse)"]
VG["vLLM Scheduler<br/>Continuous Batching"]
VM["Prometheus /metrics"]
VR --> PA --> PC --> VG
end
subgraph SGLang["SGLang Engine (port 8001)"]
SR["REST API<br/>/v1/completions"]
RA["RadixAttention<br/>Trie KV Cache"]
FK["sgl.fork()<br/>Parallel Branches"]
CD["Constrained Decode<br/>regex / JSON schema"]
SI["/get_server_info"]
SR --> RA --> FK
SR --> CD
end
GPU["GPU (NVIDIA A10G 24 GB)<br/>CUDA"]
Runner -->|"httpx SSE"| VR
Runner -->|"httpx SSE"| SR
CLI --> Runner
CLI --> Dashboard
Dashboard -->|"httpx"| VR
Dashboard -->|"httpx"| SR
VG -->|"CUDA kernels"| GPU
FK -->|"CUDA kernels"| GPU
VM -->|"scrape"| Runner
SI -->|"poll"| Runner
inference-engine-benchmark-system/
├── engines/
│ ├── base_client.py # Abstract base + GenerationResult / EngineMetrics + retry helper
│ ├── vllm_client.py # vLLM OpenAI-compat client (SSE streaming, Prometheus metrics)
│ ├── sglang_client.py # SGLang client (REST + native sgl.Runtime support)
│ └── py.typed # PEP 561 type marker
│
├── benchmarks/
│ ├── metrics.py # LatencyStats, ThroughputStats, CDF, compare_metrics
│ ├── scenarios.py # Scenario configs + default prompt-pack mapping
│ ├── prompt_packs.py # Prompt-pack loaders (JSONL/JSON)
│ └── runner.py # BenchmarkRunner (asyncio.gather, metrics polling, JSON output)
│
├── sglang_programs/ # Reserved for native @sgl.function programs
│
├── dashboard/
│ └── app.py # FastAPI: REST API + WebSocket live metrics stream
│
├── analysis/
│ ├── report.py # HTML report generator (matplotlib CDF/throughput/KV charts)
│ ├── final_report.py # Aggregated markdown final summary across runs
│ ├── generate_final_benchmark_report.py # Public-facing dated final report builder
│ ├── variance_analysis.py # CV, 95% CI, t-distribution across 5-iteration variance runs
│ ├── tpot_analysis.py # Per-request TPOT P50/P95/P99 from saved result files
│ ├── decode_length_analysis.py # Decode-length sweep crossover + CI analysis
│ └── goodput.py # Joint (TTFT, TPOT) SLO goodput with configurable thresholds
│
├── prompts/
│ ├── short_chat.jsonl # Low-latency chat prompts
│ ├── long_generation.jsonl # Decode-heavy prompts
│ ├── long_context.jsonl # Context-stress prompts
│ ├── structured_json.jsonl # Schema-oriented extraction prompts
│ ├── reasoning.jsonl # Multi-step / reasoning prompts
│ ├── shared_prefix.json # Shared-prefix cache benchmark pack
│ └── schemas/ # JSON schemas referenced by structured prompts
│
├── tests/ # pytest suite (httpx mocking via respx, no live engines needed)
├── results/ # Raw JSON results — baseline (14 models × 5 scenarios × 2 engines)
├── results_variance/ # variance subset (5 iterations per scenario/engine/model)
├── results_concurrency64/ # concurrency-64 extended ramp (7–9B models)
├── results_decode_sweep/ # decode-length sweep (output tokens: 64/256/1024/4096)
├── reports/ # Generated reports and SVG figures
│ └── figures/ # SVG charts (TTFT, throughput, tradeoff)
├── docs/ # Detailed guides (getting started, spec-dec runbook, roadmap)
├── scripts/
│ ├── run_all_benchmarks.sh # 14-model baseline suite (the 152-file headline set)
│ ├── run_new_benchmarks.sh # extended suite: --variance, --concurrency, --decode-sweep, --gemma4
│ └── EXECUTION_GUIDE.md # prerequisites, env-var knobs, troubleshooting
├── deploy/
│ ├── ec2_deploy.sh # Self-contained bash AWS deployment
│ └── terraform/ # Terraform module for team/repeatable workflows
├── run_experiment.py # Typer CLI (run / compare / matrix / report / serve / health)
├── docker-compose.yml # 6 engine profiles: baseline + Eagle3 + Ngram for each engine
├── Dockerfile.dashboard # Lightweight dashboard container
└── pyproject.toml # Python 3.11+ project metadata
pip install -e ".[dev]"cp .env.example .env
# Edit .env — add your HUGGING_FACE_HUB_TOKEN for gated models (Llama, Gemma, Mistral)
mkdir -p model-cache# vLLM
docker compose --profile vllm up -d vllm
curl http://localhost:8000/health
# SGLang
docker compose --profile sglang up -d sglang
curl http://localhost:8001/healthOn a single A10G, run engines sequentially — start one, benchmark, stop, then switch.
python run_experiment.py health
python run_experiment.py health --engines vllm
python run_experiment.py health --engines sglangFor detailed setup guides see:
# Single scenario
python run_experiment.py run --scenario single_request_latency --engines vllm
# Both engines
python run_experiment.py run --scenario throughput_ramp --engines vllm,sglang
# Custom model + prompt pack
python run_experiment.py run \
--scenario prefix_sharing_benefit \
--engines vllm,sglang \
--model Qwen/Qwen2.5-7B-Instruct \
--prompt-pack shared_prefix
# Head-to-head comparison
python run_experiment.py compare --scenario structured_generation_speed
# Sequential matrix (scenario × engine × iteration)
python run_experiment.py matrix \
--model Qwen/Qwen2.5-7B-Instruct \
--scenarios single_request_latency,throughput_ramp \
--engines sglang,vllm \
--iterations 2 --cooldown-seconds 300
# Reports
python run_experiment.py report --output report.html
python run_experiment.py final-report --output final_report.md
# Dashboard
python run_experiment.py serve # http://localhost:3000
# List available scenarios and prompt packs
python run_experiment.py list-scenarios
python run_experiment.py list-prompt-packs| Scenario | Requests | Concurrency | Focus |
|---|---|---|---|
single_request_latency |
50 | 1 | P50/P95/P99 TTFT, pure engine overhead |
throughput_ramp |
100×7 levels | 1 → 32 | Max tokens/sec, saturation point |
long_context_stress |
20 | 4 | 8K-token prompts, GPU memory pressure |
prefix_sharing_benefit |
100 | 8 | 60% shared prefix, KV cache reuse |
structured_generation_speed |
200 | 16 | JSON schema-constrained decode |
throughput_ramp_extended |
150×6 levels | 1 → 64 | Extended ramp to concurrency 64 (saturation + OOM ceiling) |
decode_length_sweep_64 |
180 | 8 | 512-token prompt, 64-token decode |
decode_length_sweep_256 |
180 | 8 | 512-token prompt, 256-token decode |
decode_length_sweep_1024 |
180 | 8 | 512-token prompt, 1024-token decode |
decode_length_sweep_4096 |
180 | 8 | 512-token prompt, 4096-token decode |
Four additional benchmark blocks run on top of the baseline suite:
| Block | Script | Scenarios | Iterations | Output Dir | Status |
|---|---|---|---|---|---|
| Variance subset | scripts/run_new_benchmarks.sh --variance |
5 baseline | 5× | results_variance/ |
✅ Complete (201 files) |
| Concurrency-64 ramp | scripts/run_new_benchmarks.sh --concurrency |
throughput_ramp_extended |
1× | results_concurrency64/ |
✅ Complete (8 files) |
| Decode-length sweep | scripts/run_new_benchmarks.sh --decode-sweep |
decode_length_sweep_{64,256,1024,4096} |
3× | results_decode_sweep/ |
✅ Complete (144 files, incl. Gemma 4) |
| Gemma 4 baseline + ngram | scripts/run_new_benchmarks.sh --gemma4 |
5 baseline + 2 ngram | 1× | results/ |
✅ Complete (28 files) |
The concurrency-64 ramp pushes 7–9B models up to concurrency=64 to find the saturation point and OOM ceiling. The decode-length sweep fixes the prompt at ~512 tokens and sweeps max_output_tokens ∈ {64, 256, 1024, 4096} to isolate how output length affects throughput. The Gemma 4 block runs the 5 baseline scenarios plus ngram spec-dec on both engines for E2B and E4B.
Run all phases in the background:
nohup bash scripts/run_new_benchmarks.sh 2>&1 | tee logs/new_benchmarks_$(date +%Y%m%dT%H%M%S).log &Analyse results after completion:
python -m analysis.variance_analysis --results-dir results_variance
python -m analysis.tpot_analysis --results-dir results_variance
python -m analysis.decode_length_analysis --results-dir results_decode_sweep
python -m analysis.goodput --results-dir results_varianceDefault scenario → pack mapping (override with --prompt-pack):
| Pack | File | Used by |
|---|---|---|
short_chat |
prompts/short_chat.jsonl |
single_request_latency |
long_generation |
prompts/long_generation.jsonl |
throughput_ramp |
long_context |
prompts/long_context.jsonl |
long_context_stress |
shared_prefix |
prompts/shared_prefix.json |
prefix_sharing_benefit |
structured_json |
prompts/structured_json.jsonl |
structured_generation_speed |
Speculative decoding is an engine startup configuration, not a separate scenario. The same scenarios run against 6 engine variants for apples-to-apples comparison.
| Variant | Engine | Method | Draft model needed |
|---|---|---|---|
vllm |
vLLM | Baseline | No |
vllm-eagle3 |
vLLM | Eagle3 | Yes (~1–2 GB) |
vllm-ngram |
vLLM | Ngram | No |
sglang |
SGLang | Baseline | No |
sglang-eagle3 |
SGLang | Eagle3 | Yes (~1–2 GB) |
sglang-ngram |
SGLang | Ngram | No |
# Example: Eagle3 on Llama 3.1 8B with vLLM
export MODEL=meta-llama/Llama-3.1-8B-Instruct
docker compose --profile vllm-eagle3 up -d vllm-eagle3 && sleep 180
python run_experiment.py run -s single_request_latency -e vllm-eagle3 --model $MODEL
docker compose --profile vllm-eagle3 downFull runbook and draft model reference: docs/SPECULATIVE_DECODING.md
| Component | Details |
|---|---|
| GPU | NVIDIA A10G 24 GB |
| Instance | AWS g5.2xlarge (8 vCPU, 32 GB RAM) |
| vLLM | v0.18.0-cu130 |
| SGLang | nightly-dev-cu13-20260321 |
| Precision | bfloat16 |
| Model | Size | Category | Both Engines |
|---|---|---|---|
| google/gemma-2-2b-it | 2B | General | Yes |
| HuggingFaceTB/SmolLM3-3B | 3B | General | Yes |
| meta-llama/Llama-3.2-3B-Instruct | 3B | General | Yes |
| microsoft/Phi-3-mini-4k-instruct | 3.8B | General | Yes |
| google/gemma-3-4b-it | 4B | General | Yes |
| microsoft/Phi-4-mini-instruct | 4B | General | Yes |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-7B | 7B | Reasoning | Yes |
| Qwen/Qwen2.5-7B-Instruct | 7B | General | Yes |
| mistralai/Mistral-7B-Instruct-v0.3 | 7B | General | Yes |
| meta-llama/Llama-3.1-8B-Instruct | 8B | General | Yes |
| Qwen/Qwen3-8B | 8B | General | Yes |
| ibm-granite/granite-3.3-8b-instruct | 8B | Enterprise | Yes |
| deepseek-ai/DeepSeek-R1-Distill-Llama-8B | 8B | Reasoning | Yes |
| google/gemma-2-9b-it | 9B | General | Yes |
Not benchmarked: Qwen3-30B-A3B (~60 GB at bf16) and Gemma 3 12B (~24 GB weights, no KV cache headroom) exceed A10G capacity.
| Model | Min GPU | Notes |
|---|---|---|
| Mistral Small 3.2 24B | A100 40 GB | Strong multilingual |
| Qwen3 32B | A100 80 GB | Top open-weight at 32B |
| Llama 3.3 70B | 2× A100 80 GB | Full Eagle3 draft support |
Interactive version (hover for exact values, zoom, toggle engines):
reports/figures/throughput_tradeoff_interactive.htmlBubble size = model parameter count. Top-left is ideal: high throughput, low latency.
TTFT and per-request decode speed at concurrency 1. Lower TTFT is better; higher tok/s is better.
| Model | vLLM TTFT | SGLang TTFT | vLLM tok/s | SGLang tok/s |
|---|---|---|---|---|
| gemma-2-2b-it | 20 ms | 30 ms | 77.6 | 78.2 |
| smollm3-3b | 24 ms | 57 ms | 69.2 | 63.4 |
| llama-3.2-3b-instruct | 23 ms | 32 ms | 66.3 | 67.7 |
| phi-3-mini-4k-instruct | 25 ms | 43 ms | 57.8 | 55.7 |
| gemma-3-4b-it | 87 ms | 78 ms | 23.8 | 45.0 |
| phi-4-mini-instruct | 33 ms | 40 ms | 56.8 | 52.7 |
| deepseek-r1-distill-qwen-7b | 40 ms | 66 ms | 30.5 | 30.9 |
| qwen2.5-7b-instruct | 41 ms | 66 ms | 30.6 | 30.9 |
| mistral-7b-instruct-v0.3 | 41 ms | 62 ms | 31.8 | 31.8 |
| llama-3.1-8b-instruct | 43 ms | 69 ms | 30.3 | 30.3 |
| qwen3-8b | 44 ms | 72 ms | 29.2 | 29.4 |
| granite-3.3-8b-instruct | 46 ms | 76 ms | 27.7 | 27.6 |
| deepseek-r1-distill-llama-8b | 42 ms | 69 ms | 30.3 | 30.3 |
| gemma-2-9b-it | 74 ms | 83 ms | 24.0 | 24.1 |
vLLM wins TTFT on 13/14 models. The exception is Gemma 3 4B, where vLLM requires --enforce-eager (disabling CUDA graphs due to its sliding window + global attention interleaving), giving SGLang a 9 ms edge. At the decode speed level (tok/s), differences are negligible at concurrency 1 — engines are GPU-bound equally.
Peak tokens/second during throughput ramp (concurrency 1 → 32). Higher is better.
| Model | vLLM tok/s | SGLang tok/s | Winner |
|---|---|---|---|
| gemma-2-2b-it | 265 | 258 | vLLM +3% |
| smollm3-3b | 230 | 205 | vLLM +12% |
| llama-3.2-3b-instruct | 223 | 226 | SGLang +1% |
| phi-3-mini-4k-instruct | 191 | 187 | vLLM +2% |
| gemma-3-4b-it | 84 | 149 | SGLang +77% |
| phi-4-mini-instruct | 189 | 176 | vLLM +7% |
| deepseek-r1-distill-qwen-7b | 106 | 106 | Tie |
| qwen2.5-7b-instruct | 105 | 106 | SGLang +1% |
| mistral-7b-instruct-v0.3 | 107 | 107 | Tie |
| llama-3.1-8b-instruct | 102 | 102 | Tie |
| qwen3-8b | 98 | 99 | SGLang +1% |
| granite-3.3-8b-instruct | 93 | 93 | Tie |
| deepseek-r1-distill-llama-8b | 102 | 102 | Tie |
| gemma-2-9b-it | 80 | 78 | vLLM +3% |
vLLM wins on ≤4B models (3–12%). At 7–9B scale, engines converge to the same GPU-bottlenecked ceiling.
Anomaly — Gemma 3 4B vLLM (84 tok/s vs SGLang 149): vLLM must run with
--enforce-eager, disabling CUDA graph capture for Gemma 3's interleaved sliding-window attention. This prevents kernel fusion at high concurrency, causing 2,137 s total wall time vs SGLang's 1,200 s for the same 179K tokens. Not a scheduler issue — it's a CUDA graph incompatibility in vLLM 0.6.x with this architecture.Anomaly — SmolLM3 3B SGLang (205 vs vLLM 230): SGLang is slower for SmolLM3 3B despite generally winning at large-model scale. SmolLM3 uses an updated HuggingFace architecture that vLLM's kernel selection handles more efficiently at the time of benchmarking.
Anomaly — Gemma 2 9B SGLang p95 (46,027 ms vs vLLM 14,399 ms): The p95 tail latency under throughput ramp is ~3× worse for SGLang on Gemma 2 9B. Gemma 2 uses alternating local/global attention layers; SGLang's continuous batch scheduler appears to stall under high queue depth for this attention pattern, causing severe tail-latency spikes. The median and throughput numbers are comparable — this is a scheduling outlier under extreme concurrency, not a general regression.
Performance with 8,192-token input prompts (20 requests, ~4 concurrent). Tests KV cache handling under memory pressure. tok/s = total output tokens generated per second across all concurrent requests.
| Model | vLLM TTFT | SGLang TTFT | vLLM tok/s | SGLang tok/s |
|---|---|---|---|---|
| gemma-2-2b-it | 34 ms | 40 ms | 311.5 | 290.6 |
| smollm3-3b | 42 ms | 71 ms | 265.0 | 234.6 |
| llama-3.2-3b-instruct | 40 ms | 43 ms | 254.5 | 253.6 |
| phi-3-mini-4k-instruct | 53 ms | 48 ms | 231.3 | 211.6 |
| gemma-3-4b-it | 127 ms | 75 ms | 98.0 | 180.2 |
| phi-4-mini-instruct | 47 ms | 38 ms | 212.0 | 211.8 |
| deepseek-r1-distill-qwen-7b | 87 ms | 58 ms | 121.2 | 125.9 |
| qwen2.5-7b-instruct | 91 ms | 59 ms | 118.4 | 126.0 |
| mistral-7b-instruct-v0.3 | 94 ms | 93 ms | 117.4 | 112.8 |
| llama-3.1-8b-instruct | 90 ms | 63 ms | 115.7 | 115.6 |
| qwen3-8b | 102 ms | 70 ms | 110.7 | 115.5 |
| granite-3.3-8b-instruct | 110 ms | 113 ms | 100.7 | 99.1 |
| deepseek-r1-distill-llama-8b | 92 ms | 63 ms | 115.8 | 115.3 |
| gemma-2-9b-it | 125 ms | 125 ms | 80.8 | 82.5 |
SGLang wins long-context TTFT on 8/14 models, particularly at 7–9B scale. This contrasts with single-request latency where vLLM dominates. On decode throughput, vLLM leads for ≤3B models while SGLang edges ahead at 7–9B — consistent with the throughput ramp pattern. Gemma 3 4B is again the outlier: SGLang delivers 180 tok/s vs vLLM's 98 due to the same --enforce-eager constraint.
KV cache reuse across 100 requests with 60% shared prefix. tok/s = total output tokens per second across all concurrent requests.
| Model | vLLM TTFT | SGLang TTFT | vLLM tok/s | SGLang tok/s |
|---|---|---|---|---|
| gemma-2-2b-it | 44 ms | 40 ms | 567.2 | 557.9 |
| smollm3-3b | 42 ms | 72 ms | 522.0 | 398.6 |
| llama-3.2-3b-instruct | 50 ms | 41 ms | 489.1 | 489.3 |
| phi-3-mini-4k-instruct | 59 ms | 57 ms | 397.6 | 396.6 |
| gemma-3-4b-it | 121 ms | 103 ms | 178.6 | 326.9 |
| phi-4-mini-instruct | 51 ms | 56 ms | 403.9 | 375.6 |
| deepseek-r1-distill-qwen-7b | 87 ms | 78 ms | 235.7 | 235.1 |
| qwen2.5-7b-instruct | 90 ms | 92 ms | 228.9 | 233.0 |
| mistral-7b-instruct-v0.3 | 93 ms | 93 ms | 228.2 | 220.6 |
| llama-3.1-8b-instruct | 93 ms | 65 ms | 219.4 | 217.8 |
| qwen3-8b | 95 ms | 59 ms | 212.2 | 210.4 |
| granite-3.3-8b-instruct | 110 ms | 66 ms | 199.0 | 198.1 |
| deepseek-r1-distill-llama-8b | 94 ms | 55 ms | 219.5 | 225.8 |
| gemma-2-9b-it | 128 ms | 125 ms | 167.2 | 157.4 |
SGLang wins prefix-sharing TTFT on 10/14 models. Its radix-tree KV cache provides superior prefix reuse, shaving 20–40 ms at 7–9B scale. Decode throughput favours vLLM for most models once prefix overhead is amortised; Gemma 3 4B is again the exception (SGLang +83%) for the same --enforce-eager reason.
JSON-constrained generation throughput across 200 requests. Higher tok/s is better.
| Model | vLLM tok/s | SGLang tok/s | Winner |
|---|---|---|---|
| gemma-2-2b-it | 1,225 | 957 | vLLM +28% |
| smollm3-3b | 930 | 774 | vLLM +20% |
| llama-3.2-3b-instruct | 970 | 909 | vLLM +7% |
| phi-3-mini-4k-instruct | 785 | 783 | Tie |
| gemma-3-4b-it | 340 | 617 | SGLang +81% |
| phi-4-mini-instruct | 736 | 669 | vLLM +10% |
| deepseek-r1-distill-qwen-7b | 452 | 451 | Tie |
| qwen2.5-7b-instruct | 456 | 384 | vLLM +19% |
| mistral-7b-instruct-v0.3 | 440 | 411 | vLLM +7% |
| llama-3.1-8b-instruct | 426 | 423 | vLLM +1% |
| qwen3-8b | 398 | 398 | Tie |
| granite-3.3-8b-instruct | 368 | 379 | SGLang +3% |
| deepseek-r1-distill-llama-8b | 426 | 422 | vLLM +1% |
| gemma-2-9b-it | 317 | 290 | vLLM +9% |
vLLM dominates structured generation — wins 12/14 models. Most pronounced on smaller models (Gemma 2 2B: +28%, SmolLM3: +20%).
TPOT (Time Per Output Token) = inter-token decode latency after the first token: (total_ms − ttft_ms) / max(output_tokens − 1, 1). Computed per request from existing result data; no re-runs required. Full per-scenario tables: reports/tpot_analysis.md.
At serial load, TPOT reflects raw GPU decode speed — engines are near-identical for every model except Gemma 3 4B, where vLLM's --enforce-eager constraint doubles decode time.
| Model | vLLM P50 | vLLM P99 | SGLang P50 | SGLang P99 |
|---|---|---|---|---|
| gemma-2-2b-it | 12.9 | 13.1 | 12.8 | 12.9 |
| smollm3-3b | 14.6 | 14.7 | 15.7 | 15.7 |
| llama-3.2-3b-instruct | 15.0 | 15.0 | 14.7 | 14.7 |
| phi-3-mini-4k-instruct | 17.4 | 17.4 | 17.7 | 18.0 |
| gemma-3-4b-it | 41.0 | 42.3 | 21.7 | 21.9 |
| phi-4-mini-instruct | 17.8 | 18.0 | 18.8 | 19.3 |
| deepseek-r1-distill-qwen-7b | 32.7 | 33.1 | 32.4 | 32.4 |
| qwen2.5-7b-instruct | 32.7 | 33.3 | 32.3 | 32.8 |
| mistral-7b-instruct-v0.3 | 31.5 | 31.8 | 31.4 | 31.7 |
| llama-3.1-8b-instruct | 35.7 | 40.3 | 33.0 | 38.9 |
| qwen3-8b | 35.6 | 37.2 | 37.0 | 40.9 |
| granite-3.3-8b-instruct | 36.1 | 36.1 | 36.0 | 36.0 |
| deepseek-r1-distill-llama-8b | 33.1 | 33.1 | 32.9 | 33.0 |
| gemma-2-9b-it | 41.4 | 42.4 | 41.2 | 42.2 |
Takeaway: At concurrency 1, both engines are GPU-bound equally. TPOT tracks model size. The sole outlier is Gemma 3 4B: SGLang achieves 21.7 ms vs vLLM's 41.0 ms — the same CUDA graph incompatibility that drives the throughput gap.
Under high concurrency, TPOT P99 reveals scheduling behaviour. vLLM holds tail latency significantly better on larger models.
| Model | vLLM P99 | SGLang P99 | SGLang / vLLM |
|---|---|---|---|
| gemma-2-2b-it | 17.6 | 18.4 | 1.05× |
| smollm3-3b | 21.2 | 41.9 | 2.0× worse |
| llama-3.2-3b-instruct | 20.6 | 21.3 | 1.04× |
| phi-3-mini-4k-instruct | 30.7 | 29.1 | 0.95× |
| gemma-3-4b-it | 44.6 | 56.8 | 1.27× worse |
| phi-4-mini-instruct | 28.9 | 32.7 | 1.13× worse |
| qwen2.5-7b-instruct | 37.6 | 36.9 | 0.98× |
| mistral-7b-instruct-v0.3 | 39.0 | 39.1 | 1.00× |
| llama-3.1-8b-instruct | 92.7 | 220.8 | 2.4× worse |
| qwen3-8b | 54.3 | 256.2 | 4.7× worse |
| granite-3.3-8b-instruct | 47.5 | 48.7 | 1.03× |
| deepseek-r1-distill-llama-8b | 40.4 | 41.1 | 1.02× |
| deepseek-r1-distill-qwen-7b | 35.8 | 36.8 | 1.03× |
| gemma-2-9b-it | 72.1 | 80.1 | 1.11× worse |
Takeaway: vLLM tail latency is substantially more stable at 7–9B scale under high concurrency. SGLang P99 TPOT spikes to 4.7× vLLM on Qwen3 8B and 2.4× on Llama 3.1 8B — the same scheduler stall behaviour that produces its P95 tail-latency anomaly on Gemma 2 9B. At ≤4B, the gap closes to <5% for most models.
Goodput = requests/sec satisfying both SLOs simultaneously, summed across all scenarios. Re-run with different thresholds: python -m analysis.goodput --ttft-slo-ms <X> --tpot-slo-ms <Y>.
The chart uses the stricter
TTFT ≤ 100 ms, TPOT ≤ 35 mspair fromreports/goodput_slo100_35.md; the table below uses the200 ms / 40 mspair.
| Model | vLLM goodput | SGLang goodput | SLO pass % (vLLM / SGLang) |
|---|---|---|---|
| gemma-2-2b-it | 1.47 rps | 1.32 rps | 99.9% / 97.0% |
| smollm3-3b | 1.12 rps | 0.90 rps | 98.8% / 88.9% |
| llama-3.2-3b-instruct | 1.08 rps | 1.11 rps | 98.2% / 99.9% |
| phi-3-mini-4k-instruct | 0.90 rps | 0.92 rps | 96.0% / 99.8% |
| gemma-3-4b-it | 0.004 rps | 0.60 rps | 1.0% / 81.4% |
| phi-4-mini-instruct | 0.94 rps | 1.00 rps | 98.8% / 98.9% |
| deepseek-r1-distill-qwen-7b | 0.51 rps | 0.47 rps | 97.1% / 91.3% |
| qwen2.5-7b-instruct | 0.52 rps | 0.53 rps | 97.1% / 98.6% |
| mistral-7b-instruct-v0.3 | 0.51 rps | 0.53 rps | 95.7% / 99.9% |
| llama-3.1-8b-instruct | 0.28 rps | 0.42 rps | 60.4% / 71.7% |
| qwen3-8b | 0.34 rps | 0.37 rps | 78.4% / 50.9% |
| granite-3.3-8b-instruct | 0.25 rps | 0.24 rps | 54.3% / 53.8% |
| deepseek-r1-distill-llama-8b | 0.47 rps | 0.47 rps | 94.0% / 93.1% |
| gemma-2-9b-it | 0 rps | 0 rps | 0% / 0% — TPOT ~44 ms exceeds SLO |
Takeaways:
- Gemma 2 2B / SmolLM3 — vLLM leads by 10–12% in goodput (CUDA graphs, lower TPOT variance).
- Gemma 3 4B — SGLang delivers 150× the goodput of vLLM (0.60 vs 0.004 rps) because vLLM almost never meets the TPOT SLO without CUDA graphs.
- Llama 3.1 8B — SGLang wins goodput (0.42 vs 0.28 rps) despite vLLM's lower serial TTFT, because SGLang's better TTFT under concurrent load keeps more requests inside the 200 ms window.
- Gemma 2 9B — neither engine meets a 40 ms TPOT SLO (native TPOT ~44 ms); tighten to ≤50 ms to get meaningful results.
| Variant | Engine | TTFT (med) | Single tok/s | Peak Throughput |
|---|---|---|---|---|
| Baseline | vLLM | 43 ms | 30.3 | 102 tok/s |
| Baseline | SGLang | 67 ms | 30.3 | 102 tok/s |
| Ngram | vLLM | 42 ms | 27.8 | 96 tok/s |
| Ngram | SGLang | 39 ms | 26.0 | 73 tok/s |
| Eagle3 | vLLM | 48 ms | 24.6 | 82 tok/s |
| Eagle3 | SGLang | — | — | — |
Eagle3 draft model (vLLM):
RedHatAI/Llama-3.1-8B-Instruct-speculator.eagle3SGLang Eagle3 exceeds A10G 24 GB capacity (main + draft + KV cache). Future scope on ≥40 GB GPUs.
| Variant | Engine | TTFT (med) | Single tok/s | Peak Throughput |
|---|---|---|---|---|
| Baseline | vLLM | 44 ms | 29.2 | 98 tok/s |
| Baseline | SGLang | 72 ms | 29.4 | 99 tok/s |
| Ngram | vLLM | 44 ms | 26.8 | 91 tok/s |
| Ngram | SGLang | 40 ms | 25.6 | 64 tok/s |
Eagle3 not tested —
RedHatAI/Qwen3-8B-speculator.eagle3not yet published.
| Variant | Engine | TTFT p50 | Single tok/s | Peak Throughput |
|---|---|---|---|---|
| Baseline | vLLM | 84 ms | 24.5 | 83.8 tok/s |
| Baseline | SGLang | 87 ms | 24.8 | 81.3 tok/s |
| Ngram | vLLM | 47 ms | 20.7 | 77.1 tok/s |
| Ngram | SGLang | 48 ms | 23.2 | 58.0 tok/s |
Ngram cuts TTFT by ~45% on both engines for E4B; peak throughput slightly regresses (SGLang-ngram −29%), matching the "spec-dec hurts on A10G" pattern seen on Llama/Qwen.
TTFT numbers land in the same regime (vLLM baseline 71 ms, Ngram 41 ms; SGLang baseline 72 ms, Ngram 40 ms), but the single_request_latency and throughput_ramp result files for E2B report total_tokens_generated=0 — a data-quality issue in those two scenarios only (long-context, prefix-sharing, and structured-generation E2B runs are clean). Decode-throughput numbers for E2B spec-dec are therefore not published here and need a rerun. See docs/RUN_STATUS.md.
Regenerate with
python -m analysis.generate_spec_decoding_figureafter new spec-dec runs land. The plotly-based interactive variant is out of date (still Llama/Qwen only) and pending a refresh.
How key metrics scale with model size (best engine, single request):
| Size | TTFT range | tok/s range | Peak throughput |
|---|---|---|---|
| 2–3B | 20–57 ms | 62–78 tok/s | 230–265 tok/s |
| 3.8–4B | 25–87 ms | 24–57 tok/s | 84–191 tok/s |
| 7B | 40–66 ms | 30–32 tok/s | 105–107 tok/s |
| 8B | 42–76 ms | 28–30 tok/s | 93–102 tok/s |
| 9B | 74–83 ms | 24 tok/s | 78–80 tok/s |
TTFT grows ~4× from 2B to 9B. Throughput drops ~3×. The steepest jump is at 7B where VRAM pressure begins on 24 GB.
vLLM wins TTFT at low concurrency. 13/14 models, 20–60% faster to first token. CUDA graph execution eliminates kernel launch overhead.
Throughput converges at 7–9B. Both engines hit the same GPU-bottlenecked ceiling. Differences <3%.
vLLM wins small-model throughput. SmolLM3 3B: 230 vs 205 (+12%), Phi-4 mini: 189 vs 176 (+7%), Gemma 2 2B: 265 vs 258 (+3%).
Gemma 3 is SGLang's strongest case. vLLM requires --enforce-eager for hybrid attention, giving SGLang +77% throughput (149 vs 84 tok/s). Architectural compatibility issue, not fundamental engine difference.
SGLang wins prefix sharing. Radix-tree KV cache provides better prefix reuse — wins TTFT on 10/14 models.
vLLM dominates structured generation. 12/14 wins. Gap ranges from marginal (<1%) to substantial (+28%).
Speculative decoding hurts on A10G. Ngram: vLLM −7%, SGLang −28%. Eagle3: vLLM −20%. Draft proposal overhead exceeds decode savings. Constrained --max-model-len 2048 limits batch efficiency. Better realized on ≥40 GB GPUs.
| Use Case | Recommendation | Why |
|---|---|---|
| Latency-sensitive serving | vLLM | Wins TTFT on 13/14 models |
| Structured/JSON output | vLLM | Wins throughput on 12/14 models |
| Prefix-heavy workloads (RAG) | SGLang | Wins prefix-sharing TTFT on 10/14 |
| High-throughput batch (7B+) | Either | Tied within 3% |
| Gemma 3 models | SGLang | +77% throughput (vLLM CUDA graph limitation) |
- KV cache split into fixed-size pages (blocks), managed by a block allocator
- Prefix cache: LRU reuse of blocks for repeated prompt prefixes
- Continuous batching: adds/removes requests mid-batch for high utilisation
- Metrics exposed via Prometheus at
/metrics - SSE streaming at
/v1/completions(OpenAI-compat)
- KV cache stored as a radix tree (trie) keyed on token sequences
- All in-flight requests share the trie — automatic prefix deduplication
sgl.fork()creates parallel decode branches sharing the same KV prefix- Constrained decode built-in: regex / JSON schema enforces valid tokens
- Metrics via
/get_server_infoJSON endpoint
| Method | Endpoint | Description |
|---|---|---|
GET |
/ |
Browser-friendly dashboard home |
GET |
/api/results |
List saved result files (?model=... optional) |
GET |
/api/results/{id} |
Load a specific result |
GET |
/api/current |
Detect currently running benchmark + active services |
GET |
/api/compare/{scenario} |
vLLM + SGLang delta for a scenario |
POST |
/api/run |
Start a background benchmark run |
GET |
/api/run/{job_id}/status |
Poll run progress |
WS |
/ws/live |
Real-time metric stream (JSON messages) |
| Environment Variable | Default | Description |
|---|---|---|
HUGGING_FACE_HUB_TOKEN |
— | HF token for gated models |
VLLM_HOST / VLLM_PORT |
localhost / 8000 |
vLLM server |
SGLANG_HOST / SGLANG_PORT |
localhost / 8001 |
SGLang server |
RESULTS_DIR |
results/ |
JSON result file directory |
ALLOWED_ORIGINS |
http://localhost:3000 |
CORS origins for dashboard |
LOG_FORMAT |
console |
console (colored) or json (structured) |
LOG_LEVEL |
INFO |
DEBUG, INFO, WARNING, ERROR |
pytest tests/ -v # All tests (no live engines needed)
pytest tests/ --cov=engines --cov=benchmarks --cov-report=term-missingTwo deployment options: a self-contained bash script and a Terraform module.
deploy/ec2_deploy.sh handles everything end-to-end with only the AWS CLI and jq.
# Single GPU (~$1.21/hr)
./deploy/ec2_deploy.sh --mode single --key my-key-pair --region us-east-1
# Two dedicated GPUs (~$2.46/hr)
./deploy/ec2_deploy.sh --mode multi --key my-key-pair --hf-token hf_TOKEN --region us-east-1
# Teardown
./deploy/ec2_deploy.sh --destroycd deploy/terraform
terraform init
terraform apply \
-var="key_pair_name=my-key" \
-var="your_ip_cidr=$(curl -s https://checkip.amazonaws.com)/32" \
-var="hf_token=hf_TOKEN" \
-var="deployment_mode=single"| Mode | Instance(s) | Monthly est. (8h/day × 22 days) |
|---|---|---|
| Single | 1× g5.2xlarge | ~$213 |
| Multi | 2× g5.2xlarge + 1× t3.medium | ~$435 |
| Single Spot | 1× g5.2xlarge (spot) | ~$64–$100 |
# Teardown
terraform destroy -var="key_pair_name=my-key" -var="your_ip_cidr=$(curl -s https://checkip.amazonaws.com)/32"Use
terraform.tfvarsto avoid typing variables repeatedly. Seedeploy/terraform/for full variable reference.
| Model | Notes |
|---|---|
| Gemma 3 4B (vLLM) | Requires --enforce-eager --disable-frontend-multiprocessing — hybrid sliding-window + full attention incompatible with CUDA graph capture |
| Gemma 2 9B (vLLM) | Requires --max-model-len 4096 --gpu-memory-utilization 0.92 to fit on A10G |
| Llama 3.1 8B Eagle3 | Requires --gpu-memory-utilization 0.95 --enforce-eager --max-model-len 2048 — main + draft use ~16.8 GiB |
| SGLang Eagle3 | OOM on A10G for all tested models — main + draft + KV cache exceeds 24 GB |
| Qwen3-30B-A3B | Not benchmarked — ~60 GB at bf16, exceeds A10G capacity |
| Gemma 3 12B | Not benchmarked — ~24 GB weights, no KV cache headroom |
| Symptom | Fix |
|---|---|
| SSH connection refused | Check your_ip_cidr matches current IP |
curl localhost:8000/health hangs |
Model still downloading — check docker compose logs -f vllm |
| GPU not visible in Docker | Run nvidia-smi. If it fails, reboot |
| Out of GPU memory | Reduce --gpu-memory-utilization or --max-model-len |
| Bootstrap failed | Check /var/log/benchmark-setup.log |
| Dashboard 502 | Verify port 3000 is open in security group |
Two entry-point scripts, depending on what you want:
# ─── Option A: 14-model baseline only (the 152-file headline result set) ───
# Simple, sequential, one engine at a time. Skips any model with ≥10 files.
chmod +x scripts/run_all_benchmarks.sh
tmux new -s bench
./scripts/run_all_benchmarks.sh 2>&1 | tee logs/run_$(date +%Y%m%dT%H%M%S).log
./scripts/run_all_benchmarks.sh --force # ignore existing results
# ─── Option B: extended phases (variance, concurrency-64, decode sweep, Gemma 4) ───
# Everything beyond the baseline — idempotent, resume-safe.
bash scripts/run_new_benchmarks.sh --all
bash scripts/run_new_benchmarks.sh --variance # variance (4 models × 5 iter)
bash scripts/run_new_benchmarks.sh --concurrency # concurrency-64 ramp
bash scripts/run_new_benchmarks.sh --decode-sweep # decode-length sweep
bash scripts/run_new_benchmarks.sh --gemma4 # Gemma 4 baseline + ngram
# Generate summary reports after runs finish
conda run -n base python -m analysis.generate_final_benchmark_reportFull walk-through, env-var knobs, and troubleshooting:
scripts/EXECUTION_GUIDE.md.
- SGLang Eagle3 on ≥40 GB GPU (A100/H100) — OOMs on A10G
- Eagle3 for Qwen3-8B once
RedHatAI/Qwen3-8B-speculator.eagle3is published - Quantized models (AWQ/GPTQ) — test if spec-dec becomes viable at lower precision
- Multi-GPU tensor parallel benchmarks
- Nightly CI benchmark regression runs (current CI only gates lint + unit tests)
See CONTRIBUTING.md for guidelines on adding models, scenarios, or hardware results.