Skip to content

varad-more/inference-engine-benchmark-system

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

91 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

vLLM vs SGLang — Inference Engine Benchmark System

A production-grade benchmark harness that rigorously compares vLLM and SGLang LLM inference engines across latency, throughput, KV-cache efficiency, structured generation, and speculative decoding.

Summary

I benchmarked 16 models total (2B–9B parameters) on a single NVIDIA A10G 24 GB GPU, running 5 scenarios across both engines. The 14-model core baseline (table below) drove the headline comparison; the two later-arriving Gemma 4 models (E2B, E4B) were added in a separate block, so they don't appear in the "X / 14" tallies. The baseline plus speculative-decoding suite produced 152 result files at 100% success rate; follow-on phases (variance, concurrency-64, decode sweep, Gemma 4 baseline + ngram) add another ~380 files. Every cell is now complete. Speculative decoding: Ngram worked on Llama 3.1 8B, Qwen3 8B, and Gemma 4 E2B/E4B across both engines; Eagle3 worked on Llama 3.1 8B with vLLM only (SGLang OOM on A10G; Qwen3 8B draft model not yet published). See Benchmark Execution Status below for the per-phase breakdown.

Metric vLLM SGLang
Lower TTFT (single request) 13 / 14 models 1 / 14
Higher throughput (≤4B) 5 / 6 models 1 / 6 (Gemma 3)
Higher throughput (7–9B) — (tied within 3%)
Structured generation wins 12 / 14 2 / 14
Prefix-sharing TTFT wins 4 / 14 10 / 14
Best single-request TTFT 20 ms (Gemma 2 2B) 30 ms (Gemma 2 2B)
Peak throughput 265 tok/s (Gemma 2 2B) 258 tok/s (Gemma 2 2B)

Bottom line: vLLM is the stronger general-purpose default on A10G-class hardware — wins TTFT on nearly every model, wins small-model throughput by 3–12%, and dominates structured generation. SGLang matches vLLM on 7–9B throughput, has a decisive advantage on Gemma 3 4B (+77% throughput), and wins prefix-sharing TTFT on 10/14 models.

Hardware: AWS g5.2xlarge (NVIDIA A10G 24 GB), sequential execution, one engine at a time Full reports: reports/final_benchmark_report_2026-03-31.md (latest) · 2026-03-28 · 2026-03-22 · HTML: 03-31 · 03-28 · dated snapshots: 03-31 · 03-28 · 03-22 · all charts/tables: reports/index.html Supporting analyses: variance · TPOT · goodput · decode-length sweep · decode-length deep-dive · concurrency-64 ramp · cross-model summary · blog companion guides Figures: spec-dec · decode-length sweep · variance CV · goodput · concurrency-64 throughput · tradeoff map — all regenerable via python -m analysis.generate_*_figure. Benchmark status: 152 headline result files (140 baseline + 12 speculative-decoding) plus ~380 files from the extended phases (variance, concurrency-64, decode-length sweep, Gemma 4). Two known open items tracked below: Llama 3.1 8B SGLang-Eagle3 (retired nightly image) and a Gemma 4 E2B single/throughput rerun. Full matrix reproduces via scripts/run_all_benchmarks.sh (baseline) and scripts/run_new_benchmarks.sh (extended).


Benchmark Execution Status

Phase Description Status Result Files
Baseline 14 models × 5 scenarios × 2 engines ✅ Complete 152 / 152
Speculative decoding Llama 3.1 8B (Ngram + Eagle3), Qwen3 8B (Ngram) ✅ Complete (except Llama sglang-eagle3 — blocked on missing nightly image) In results/
Variance subset 4 models × 5 scenarios × 2 engines × 5 iterations ✅ Complete 201 / 200 — CV chart: reports/figures/variance_cv.svg
Concurrency-64 ramp 4 models × throughput_ramp_extended × 2 engines × 1 iteration ✅ Complete 8 / 8 (0% error rate)
Decode-length sweep (4-model base) 4 models × 4 lengths × 2 engines × 3 iterations ✅ Complete 96 / 96
Decode-length sweep (Gemma 4) 2 models × 4 lengths × 2 engines × 3 iterations ✅ Complete 48 / 48
Gemma 4 baseline + ngram 2 models (E2B, E4B) × 5 scenarios × 2 engines + ngram spec-dec ✅ Complete 28 / 28

Decode-Length Sweep Results

Prompt ≈ 512 tokens, max_output_tokens ∈ {64, 256, 1024, 4096}, concurrency 8, 180 requests/run. Mean across iterations. Full table: reports/decode_length_sweep_summary.md.

All cells at n=3 iterations after 2026-04-19 top-ups.

Decode-length sweep: tokens/sec vs max_output_tokens

Model Decode Engine n Tokens/s TTFT p50 (ms) TTFT p99 (ms) Latency p99 (ms) Err
gemma-2-2b-it 64 sglang 3 519.1 39.4 67.9 918 0.009
gemma-2-2b-it 64 vllm 3 523.0 42.1 188.7 1108 0.000
gemma-2-2b-it 256 sglang 3 484.3 41.9 70.5 3577 0.004
gemma-2-2b-it 256 vllm 3 493.8 36.5 60.4 3587 0.000
gemma-2-2b-it 1024 sglang 3 469.7 37.9 56.3 12742 0.000
gemma-2-2b-it 1024 vllm 3 458.0 37.3 57.4 12864 0.000
gemma-2-2b-it 4096 sglang 3 467.0 37.9 56.7 11044 0.000
gemma-2-2b-it 4096 vllm 3 459.2 37.5 53.7 12779 0.000
phi-4-mini-instruct 64 sglang 3 340.1 49.2 105.2 1378 0.000
phi-4-mini-instruct 64 vllm 3 354.4 55.1 82.7 1321 0.000
phi-4-mini-instruct 256 sglang 3 333.4 46.8 76.2 5350 0.000
phi-4-mini-instruct 256 vllm 3 346.2 56.0 70.5 5269 0.000
phi-4-mini-instruct 1024 sglang 3 322.6 47.6 73.9 22881 0.000
phi-4-mini-instruct 1024 vllm 3 304.7 56.4 80.5 23149 0.000
phi-4-mini-instruct 4096 sglang 3 293.5 48.3 99.9 87423 0.000
phi-4-mini-instruct 4096 vllm 3 287.2 56.8 74.5 79221 0.000
gemma-3-4b-it 64 sglang 3 280.8 128.8 155.3 1598 0.006
gemma-3-4b-it 64 vllm 3 146.3 128.2 2827.0 5758 0.000
gemma-3-4b-it 256 sglang 3 289.0 126.3 153.4 6101 0.004
gemma-3-4b-it 256 vllm 3 156.7 126.8 149.5 11259 0.000
gemma-3-4b-it 1024 sglang 3 274.9 100.1 162.9 25977 0.000
gemma-3-4b-it 1024 vllm 3 152.7 122.6 150.2 45465 0.000
gemma-3-4b-it 4096 sglang 3 269.3 100.2 153.5 36119 0.000
gemma-3-4b-it 4096 vllm 3 149.4 123.5 1886.5 65409 0.000
llama-3-1-8b-instruct 64 sglang 3 191.9 69.1 108.9 2394 0.000
llama-3-1-8b-instruct 64 vllm 3 189.2 96.2 128.8 2417 0.000
llama-3-1-8b-instruct 256 sglang 3 190.3 69.7 111.7 9452 0.000
llama-3-1-8b-instruct 256 vllm 3 189.4 93.3 126.2 9489 0.000
llama-3-1-8b-instruct 1024 sglang 3 186.5 69.1 103.6 39165 0.000
llama-3-1-8b-instruct 1024 vllm 3 185.1 96.3 128.7 39359 0.000
llama-3-1-8b-instruct 4096 sglang 3 157.6 106.4 99231.8 301590 0.030
llama-3-1-8b-instruct 4096 vllm 3 158.5 113.6 36139.6 283530 0.000

Observations:

  • SGLang has consistently lower TTFT (p50) than vLLM; vLLM edges ahead on small-model decode throughput at short outputs.
  • Gemma 3 4B: SGLang ≈ 1.8× vLLM tokens/s at every decode length (n=3 now confirms the Phase 0 finding with tighter CIs). Analysis script reports no throughput crossover — SGLang leads throughout; 44.5% gap at max_tokens=4096.
  • Llama 8B at 4096 tokens: p99 latency blows out to ~5 min and tail TTFT spikes to ~99 s (sglang) / ~36 s (vllm) — the A10G is queue-saturated at concurrency 8 for this size.
  • Crossovers surfaced by the analysis script (reports/decode_length_analysis.md): vllm→sglang at max_tokens=1024 for phi-4-mini and gemma-2-2b; sglang→vllm at max_tokens=4096 for Llama-3.1-8B (within CI).

Concurrency-64 Results

Single-iteration runs at concurrency levels {1, 4, 8, 16, 32, 64}, 150 req/level (900 total). Prompt 128 tok, output 256 tok. Full table: reports/concurrency64_summary.md. Charts: throughput · TTFT p50 · TTFT p99 · end-to-end latency p99.

Model Engine Succ Tokens/s TTFT p50 (ms) TTFT p99 (ms) Latency p50 (ms) Latency p99 (ms) Err
Mistral-7B-Instruct-v0.3 vllm 900/900 123.5 93.0 283.9 8686 10136 0.000
Mistral-7B-Instruct-v0.3 sglang 900/900 123.6 69.0 195.1 8607 10145 0.000
Llama-3.1-8B-Instruct vllm 900/900 117.8 97.2 235.0 9078 10516 0.000
Llama-3.1-8B-Instruct sglang 900/900 118.0 71.0 188.2 8993 10574 0.000
Qwen3-8B vllm 900/900 113.7 103.2 232.2 9430 11683 0.000
Qwen3-8B sglang 900/900 113.9 73.5 403.0 9355 11663 0.000
google/gemma-2-9b-it† vllm 900/900 92.2 130.9 1859.1 11723 23037 0.000
google/gemma-2-9b-it† sglang 900/900 89.5 130.4 29426.6 11631 43557 0.000

† gemma-2-9b-it vLLM required --max-model-len 2048 --enforce-eager --gpu-memory-utilization 0.90 to fit the 9B param KV cache on A10G 24 GB; default --max-model-len 8192 fails engine-core init (OOM). SGLang fits under default flags.

Key findings:

  • Zero errors across all 8 cells at concurrency=64 — A10G 24 GB sustains 7–9B-class models end-to-end at 128/256 prompt/output.
  • SGLang has consistently lower median TTFT (p50 69–74 ms vs vLLM's 93–131 ms) across every 7–9B model — a ~25–30 ms edge from RadixAttention prefix lookup.
  • vLLM has substantially tighter tail TTFT on gemma-2-9b-it: p99 1859 ms vs SGLang 29427 ms — a ~16× gap. SGLang's tail collapses on this specific model at high concurrency; vLLM is the safer choice for latency-SLO gemma-2-9b serving.
  • Throughput is engine-agnostic on all 7–8B models (within 0.2 tok/s). Both engines saturate A10G equivalently on decode once KV cache is warm.
  • Gemma-2-9b-it throughput is ~24% lower than Mistral-7B (92 vs 123 tok/s) — expected from the 9B/7B parameter ratio plus the smaller max-model-len for vLLM.

Notes on getting Gemma 4 working

Gemma 4 landed in Transformers 5.5.0 and introduced QK-norm (k_norm/q_norm) on top of the Gemma 3 architecture. Both engines needed careful image selection:

  • vLLMvllm/vllm-openai:latest has a native Gemma 4 loader (it derives from the existing Gemma 3 class and handles QK-norm correctly). Works out of the box once transformers is upgraded inside the container.
  • SGLanglmsysorg/sglang:latest (Apr-09 snapshot) does not have a native Gemma 4 class yet. It falls back to the generic TransformersMultiModalForCausalLM wrapper and dies during weight load:
    ValueError: No module or parameter named
      'model.language_model.layers.15.self_attn.k_norm'
      in TransformersMultiModalForCausalLM
    
    Fix: pin GEMMA4_SGLANG_IMAGE="lmsysorg/sglang:dev-cu13" (Apr-16 snapshot off main) in scripts/run_new_benchmarks.sh. That image ships the native Gemma 4 model class and loads the weights cleanly.

Open items

  • Llama 3.1 8B SGLang-Eagle3 — still blocked on the retired lmsysorg/sglang:nightly-dev-cu13-20260321-94194537 image. Needs a new nightly pin before retry.
  • Gemma 4 E2B rerunsingle_request_latency and throughput_ramp produced 0 output tokens; other scenarios are clean. Rerun needed to publish E2B spec-dec throughput numbers.

Re-running any phase is idempotent: each phase block auto-skips cells whose result file already exists, so scripts/run_new_benchmarks.sh --all is safe to launch at any time. Analysis scripts (variance_analysis, tpot_analysis, decode_length_analysis, goodput) regenerate their markdown reports from whatever is on disk — see the Benchmark Scenarios section below for commands.


Architecture Diagram

graph TB
    subgraph Client["Benchmark Client (Python / asyncio)"]
        CLI["run_experiment.py<br/>typer CLI"]
        Runner["BenchmarkRunner<br/>asyncio.gather"]
        Dashboard["FastAPI Dashboard<br/>port 3000"]
    end

    subgraph vLLM["vLLM Engine (port 8000)"]
        VR["REST API<br/>/v1/completions"]
        PA["PagedAttention<br/>Block Manager"]
        PC["Prefix Cache<br/>(LRU block reuse)"]
        VG["vLLM Scheduler<br/>Continuous Batching"]
        VM["Prometheus /metrics"]
        VR --> PA --> PC --> VG
    end

    subgraph SGLang["SGLang Engine (port 8001)"]
        SR["REST API<br/>/v1/completions"]
        RA["RadixAttention<br/>Trie KV Cache"]
        FK["sgl.fork()<br/>Parallel Branches"]
        CD["Constrained Decode<br/>regex / JSON schema"]
        SI["/get_server_info"]
        SR --> RA --> FK
        SR --> CD
    end

    GPU["GPU (NVIDIA A10G 24 GB)<br/>CUDA"]

    Runner -->|"httpx SSE"| VR
    Runner -->|"httpx SSE"| SR
    CLI --> Runner
    CLI --> Dashboard
    Dashboard -->|"httpx"| VR
    Dashboard -->|"httpx"| SR
    VG -->|"CUDA kernels"| GPU
    FK -->|"CUDA kernels"| GPU
    VM -->|"scrape"| Runner
    SI -->|"poll"| Runner
Loading

Project Structure

inference-engine-benchmark-system/
├── engines/
│   ├── base_client.py          # Abstract base + GenerationResult / EngineMetrics + retry helper
│   ├── vllm_client.py          # vLLM OpenAI-compat client (SSE streaming, Prometheus metrics)
│   ├── sglang_client.py        # SGLang client (REST + native sgl.Runtime support)
│   └── py.typed                # PEP 561 type marker
│
├── benchmarks/
│   ├── metrics.py              # LatencyStats, ThroughputStats, CDF, compare_metrics
│   ├── scenarios.py            # Scenario configs + default prompt-pack mapping
│   ├── prompt_packs.py         # Prompt-pack loaders (JSONL/JSON)
│   └── runner.py               # BenchmarkRunner (asyncio.gather, metrics polling, JSON output)
│
├── sglang_programs/            # Reserved for native @sgl.function programs
│
├── dashboard/
│   └── app.py                  # FastAPI: REST API + WebSocket live metrics stream
│
├── analysis/
│   ├── report.py                      # HTML report generator (matplotlib CDF/throughput/KV charts)
│   ├── final_report.py                # Aggregated markdown final summary across runs
│   ├── generate_final_benchmark_report.py  # Public-facing dated final report builder
│   ├── variance_analysis.py           # CV, 95% CI, t-distribution across 5-iteration variance runs
│   ├── tpot_analysis.py               # Per-request TPOT P50/P95/P99 from saved result files
│   ├── decode_length_analysis.py      # Decode-length sweep crossover + CI analysis
│   └── goodput.py                     # Joint (TTFT, TPOT) SLO goodput with configurable thresholds
│
├── prompts/
│   ├── short_chat.jsonl        # Low-latency chat prompts
│   ├── long_generation.jsonl   # Decode-heavy prompts
│   ├── long_context.jsonl      # Context-stress prompts
│   ├── structured_json.jsonl   # Schema-oriented extraction prompts
│   ├── reasoning.jsonl         # Multi-step / reasoning prompts
│   ├── shared_prefix.json      # Shared-prefix cache benchmark pack
│   └── schemas/                # JSON schemas referenced by structured prompts
│
├── tests/                      # pytest suite (httpx mocking via respx, no live engines needed)
├── results/                    # Raw JSON results — baseline (14 models × 5 scenarios × 2 engines)
├── results_variance/           # variance subset (5 iterations per scenario/engine/model)
├── results_concurrency64/      # concurrency-64 extended ramp (7–9B models)
├── results_decode_sweep/       # decode-length sweep (output tokens: 64/256/1024/4096)
├── reports/                    # Generated reports and SVG figures
│   └── figures/                # SVG charts (TTFT, throughput, tradeoff)
├── docs/                       # Detailed guides (getting started, spec-dec runbook, roadmap)
├── scripts/
│   ├── run_all_benchmarks.sh      # 14-model baseline suite (the 152-file headline set)
│   ├── run_new_benchmarks.sh      # extended suite: --variance, --concurrency, --decode-sweep, --gemma4
│   └── EXECUTION_GUIDE.md         # prerequisites, env-var knobs, troubleshooting
├── deploy/
│   ├── ec2_deploy.sh           # Self-contained bash AWS deployment
│   └── terraform/              # Terraform module for team/repeatable workflows
├── run_experiment.py           # Typer CLI (run / compare / matrix / report / serve / health)
├── docker-compose.yml          # 6 engine profiles: baseline + Eagle3 + Ngram for each engine
├── Dockerfile.dashboard        # Lightweight dashboard container
└── pyproject.toml              # Python 3.11+ project metadata

Quick Start

1. Install

pip install -e ".[dev]"

2. Configure environment

cp .env.example .env
# Edit .env — add your HUGGING_FACE_HUB_TOKEN for gated models (Llama, Gemma, Mistral)
mkdir -p model-cache

3. Start one engine at a time (single GPU)

# vLLM
docker compose --profile vllm up -d vllm
curl http://localhost:8000/health

# SGLang
docker compose --profile sglang up -d sglang
curl http://localhost:8001/health

On a single A10G, run engines sequentially — start one, benchmark, stop, then switch.

4. Check engine health

python run_experiment.py health
python run_experiment.py health --engines vllm
python run_experiment.py health --engines sglang

For detailed setup guides see:


CLI Usage

# Single scenario
python run_experiment.py run --scenario single_request_latency --engines vllm

# Both engines
python run_experiment.py run --scenario throughput_ramp --engines vllm,sglang

# Custom model + prompt pack
python run_experiment.py run \
  --scenario prefix_sharing_benefit \
  --engines vllm,sglang \
  --model Qwen/Qwen2.5-7B-Instruct \
  --prompt-pack shared_prefix

# Head-to-head comparison
python run_experiment.py compare --scenario structured_generation_speed

# Sequential matrix (scenario × engine × iteration)
python run_experiment.py matrix \
  --model Qwen/Qwen2.5-7B-Instruct \
  --scenarios single_request_latency,throughput_ramp \
  --engines sglang,vllm \
  --iterations 2 --cooldown-seconds 300

# Reports
python run_experiment.py report --output report.html
python run_experiment.py final-report --output final_report.md

# Dashboard
python run_experiment.py serve    # http://localhost:3000

# List available scenarios and prompt packs
python run_experiment.py list-scenarios
python run_experiment.py list-prompt-packs

Benchmark Scenarios

Core scenarios (registered in benchmarks/scenarios.py)

Scenario Requests Concurrency Focus
single_request_latency 50 1 P50/P95/P99 TTFT, pure engine overhead
throughput_ramp 100×7 levels 1 → 32 Max tokens/sec, saturation point
long_context_stress 20 4 8K-token prompts, GPU memory pressure
prefix_sharing_benefit 100 8 60% shared prefix, KV cache reuse
structured_generation_speed 200 16 JSON schema-constrained decode
throughput_ramp_extended 150×6 levels 1 → 64 Extended ramp to concurrency 64 (saturation + OOM ceiling)
decode_length_sweep_64 180 8 512-token prompt, 64-token decode
decode_length_sweep_256 180 8 512-token prompt, 256-token decode
decode_length_sweep_1024 180 8 512-token prompt, 1024-token decode
decode_length_sweep_4096 180 8 512-token prompt, 4096-token decode

Extended Benchmark Phases

Four additional benchmark blocks run on top of the baseline suite:

Block Script Scenarios Iterations Output Dir Status
Variance subset scripts/run_new_benchmarks.sh --variance 5 baseline results_variance/ ✅ Complete (201 files)
Concurrency-64 ramp scripts/run_new_benchmarks.sh --concurrency throughput_ramp_extended results_concurrency64/ ✅ Complete (8 files)
Decode-length sweep scripts/run_new_benchmarks.sh --decode-sweep decode_length_sweep_{64,256,1024,4096} results_decode_sweep/ ✅ Complete (144 files, incl. Gemma 4)
Gemma 4 baseline + ngram scripts/run_new_benchmarks.sh --gemma4 5 baseline + 2 ngram results/ ✅ Complete (28 files)

The concurrency-64 ramp pushes 7–9B models up to concurrency=64 to find the saturation point and OOM ceiling. The decode-length sweep fixes the prompt at ~512 tokens and sweeps max_output_tokens ∈ {64, 256, 1024, 4096} to isolate how output length affects throughput. The Gemma 4 block runs the 5 baseline scenarios plus ngram spec-dec on both engines for E2B and E4B.

Run all phases in the background:

nohup bash scripts/run_new_benchmarks.sh 2>&1 | tee logs/new_benchmarks_$(date +%Y%m%dT%H%M%S).log &

Analyse results after completion:

python -m analysis.variance_analysis      --results-dir results_variance
python -m analysis.tpot_analysis          --results-dir results_variance
python -m analysis.decode_length_analysis --results-dir results_decode_sweep
python -m analysis.goodput                --results-dir results_variance

Prompt Packs

Default scenario → pack mapping (override with --prompt-pack):

Pack File Used by
short_chat prompts/short_chat.jsonl single_request_latency
long_generation prompts/long_generation.jsonl throughput_ramp
long_context prompts/long_context.jsonl long_context_stress
shared_prefix prompts/shared_prefix.json prefix_sharing_benefit
structured_json prompts/structured_json.jsonl structured_generation_speed

Speculative Decoding

Speculative decoding is an engine startup configuration, not a separate scenario. The same scenarios run against 6 engine variants for apples-to-apples comparison.

Variant Engine Method Draft model needed
vllm vLLM Baseline No
vllm-eagle3 vLLM Eagle3 Yes (~1–2 GB)
vllm-ngram vLLM Ngram No
sglang SGLang Baseline No
sglang-eagle3 SGLang Eagle3 Yes (~1–2 GB)
sglang-ngram SGLang Ngram No
# Example: Eagle3 on Llama 3.1 8B with vLLM
export MODEL=meta-llama/Llama-3.1-8B-Instruct
docker compose --profile vllm-eagle3 up -d vllm-eagle3 && sleep 180
python run_experiment.py run -s single_request_latency -e vllm-eagle3 --model $MODEL
docker compose --profile vllm-eagle3 down

Full runbook and draft model reference: docs/SPECULATIVE_DECODING.md


Models Tested

Hardware & Software

Component Details
GPU NVIDIA A10G 24 GB
Instance AWS g5.2xlarge (8 vCPU, 32 GB RAM)
vLLM v0.18.0-cu130
SGLang nightly-dev-cu13-20260321
Precision bfloat16

A10G 24 GB — All 14 Models Benchmarked

Model Size Category Both Engines
google/gemma-2-2b-it 2B General Yes
HuggingFaceTB/SmolLM3-3B 3B General Yes
meta-llama/Llama-3.2-3B-Instruct 3B General Yes
microsoft/Phi-3-mini-4k-instruct 3.8B General Yes
google/gemma-3-4b-it 4B General Yes
microsoft/Phi-4-mini-instruct 4B General Yes
deepseek-ai/DeepSeek-R1-Distill-Qwen-7B 7B Reasoning Yes
Qwen/Qwen2.5-7B-Instruct 7B General Yes
mistralai/Mistral-7B-Instruct-v0.3 7B General Yes
meta-llama/Llama-3.1-8B-Instruct 8B General Yes
Qwen/Qwen3-8B 8B General Yes
ibm-granite/granite-3.3-8b-instruct 8B Enterprise Yes
deepseek-ai/DeepSeek-R1-Distill-Llama-8B 8B Reasoning Yes
google/gemma-2-9b-it 9B General Yes

Not benchmarked: Qwen3-30B-A3B (~60 GB at bf16) and Gemma 3 12B (~24 GB weights, no KV cache headroom) exceed A10G capacity.

A100/H100 (larger hardware — future scope)

Model Min GPU Notes
Mistral Small 3.2 24B A100 40 GB Strong multilingual
Qwen3 32B A100 80 GB Top open-weight at 32B
Llama 3.3 70B 2× A100 80 GB Full Eagle3 draft support

Benchmark Results

Visual Summary

Single-request latency (TTFT P95)

Single request TTFT p95

Throughput tokens/sec

Throughput tokens per second

Throughput requests/sec

Throughput requests per second

Throughput vs Latency tradeoff

Interactive version (hover for exact values, zoom, toggle engines): reports/figures/throughput_tradeoff_interactive.html Bubble size = model parameter count. Top-left is ideal: high throughput, low latency.

Throughput tradeoff map

P95 latency under load

P95 latency under load


1. Single Request Latency

TTFT and per-request decode speed at concurrency 1. Lower TTFT is better; higher tok/s is better.

Model vLLM TTFT SGLang TTFT vLLM tok/s SGLang tok/s
gemma-2-2b-it 20 ms 30 ms 77.6 78.2
smollm3-3b 24 ms 57 ms 69.2 63.4
llama-3.2-3b-instruct 23 ms 32 ms 66.3 67.7
phi-3-mini-4k-instruct 25 ms 43 ms 57.8 55.7
gemma-3-4b-it 87 ms 78 ms 23.8 45.0
phi-4-mini-instruct 33 ms 40 ms 56.8 52.7
deepseek-r1-distill-qwen-7b 40 ms 66 ms 30.5 30.9
qwen2.5-7b-instruct 41 ms 66 ms 30.6 30.9
mistral-7b-instruct-v0.3 41 ms 62 ms 31.8 31.8
llama-3.1-8b-instruct 43 ms 69 ms 30.3 30.3
qwen3-8b 44 ms 72 ms 29.2 29.4
granite-3.3-8b-instruct 46 ms 76 ms 27.7 27.6
deepseek-r1-distill-llama-8b 42 ms 69 ms 30.3 30.3
gemma-2-9b-it 74 ms 83 ms 24.0 24.1

vLLM wins TTFT on 13/14 models. The exception is Gemma 3 4B, where vLLM requires --enforce-eager (disabling CUDA graphs due to its sliding window + global attention interleaving), giving SGLang a 9 ms edge. At the decode speed level (tok/s), differences are negligible at concurrency 1 — engines are GPU-bound equally.


2. Sustained Throughput

Peak tokens/second during throughput ramp (concurrency 1 → 32). Higher is better.

Model vLLM tok/s SGLang tok/s Winner
gemma-2-2b-it 265 258 vLLM +3%
smollm3-3b 230 205 vLLM +12%
llama-3.2-3b-instruct 223 226 SGLang +1%
phi-3-mini-4k-instruct 191 187 vLLM +2%
gemma-3-4b-it 84 149 SGLang +77%
phi-4-mini-instruct 189 176 vLLM +7%
deepseek-r1-distill-qwen-7b 106 106 Tie
qwen2.5-7b-instruct 105 106 SGLang +1%
mistral-7b-instruct-v0.3 107 107 Tie
llama-3.1-8b-instruct 102 102 Tie
qwen3-8b 98 99 SGLang +1%
granite-3.3-8b-instruct 93 93 Tie
deepseek-r1-distill-llama-8b 102 102 Tie
gemma-2-9b-it 80 78 vLLM +3%

vLLM wins on ≤4B models (3–12%). At 7–9B scale, engines converge to the same GPU-bottlenecked ceiling.

Anomaly — Gemma 3 4B vLLM (84 tok/s vs SGLang 149): vLLM must run with --enforce-eager, disabling CUDA graph capture for Gemma 3's interleaved sliding-window attention. This prevents kernel fusion at high concurrency, causing 2,137 s total wall time vs SGLang's 1,200 s for the same 179K tokens. Not a scheduler issue — it's a CUDA graph incompatibility in vLLM 0.6.x with this architecture.

Anomaly — SmolLM3 3B SGLang (205 vs vLLM 230): SGLang is slower for SmolLM3 3B despite generally winning at large-model scale. SmolLM3 uses an updated HuggingFace architecture that vLLM's kernel selection handles more efficiently at the time of benchmarking.

Anomaly — Gemma 2 9B SGLang p95 (46,027 ms vs vLLM 14,399 ms): The p95 tail latency under throughput ramp is ~3× worse for SGLang on Gemma 2 9B. Gemma 2 uses alternating local/global attention layers; SGLang's continuous batch scheduler appears to stall under high queue depth for this attention pattern, causing severe tail-latency spikes. The median and throughput numbers are comparable — this is a scheduling outlier under extreme concurrency, not a general regression.


3. Long Context Stress (8K Tokens)

Performance with 8,192-token input prompts (20 requests, ~4 concurrent). Tests KV cache handling under memory pressure. tok/s = total output tokens generated per second across all concurrent requests.

Model vLLM TTFT SGLang TTFT vLLM tok/s SGLang tok/s
gemma-2-2b-it 34 ms 40 ms 311.5 290.6
smollm3-3b 42 ms 71 ms 265.0 234.6
llama-3.2-3b-instruct 40 ms 43 ms 254.5 253.6
phi-3-mini-4k-instruct 53 ms 48 ms 231.3 211.6
gemma-3-4b-it 127 ms 75 ms 98.0 180.2
phi-4-mini-instruct 47 ms 38 ms 212.0 211.8
deepseek-r1-distill-qwen-7b 87 ms 58 ms 121.2 125.9
qwen2.5-7b-instruct 91 ms 59 ms 118.4 126.0
mistral-7b-instruct-v0.3 94 ms 93 ms 117.4 112.8
llama-3.1-8b-instruct 90 ms 63 ms 115.7 115.6
qwen3-8b 102 ms 70 ms 110.7 115.5
granite-3.3-8b-instruct 110 ms 113 ms 100.7 99.1
deepseek-r1-distill-llama-8b 92 ms 63 ms 115.8 115.3
gemma-2-9b-it 125 ms 125 ms 80.8 82.5

SGLang wins long-context TTFT on 8/14 models, particularly at 7–9B scale. This contrasts with single-request latency where vLLM dominates. On decode throughput, vLLM leads for ≤3B models while SGLang edges ahead at 7–9B — consistent with the throughput ramp pattern. Gemma 3 4B is again the outlier: SGLang delivers 180 tok/s vs vLLM's 98 due to the same --enforce-eager constraint.


4. Prefix Sharing (60% Overlap)

KV cache reuse across 100 requests with 60% shared prefix. tok/s = total output tokens per second across all concurrent requests.

Model vLLM TTFT SGLang TTFT vLLM tok/s SGLang tok/s
gemma-2-2b-it 44 ms 40 ms 567.2 557.9
smollm3-3b 42 ms 72 ms 522.0 398.6
llama-3.2-3b-instruct 50 ms 41 ms 489.1 489.3
phi-3-mini-4k-instruct 59 ms 57 ms 397.6 396.6
gemma-3-4b-it 121 ms 103 ms 178.6 326.9
phi-4-mini-instruct 51 ms 56 ms 403.9 375.6
deepseek-r1-distill-qwen-7b 87 ms 78 ms 235.7 235.1
qwen2.5-7b-instruct 90 ms 92 ms 228.9 233.0
mistral-7b-instruct-v0.3 93 ms 93 ms 228.2 220.6
llama-3.1-8b-instruct 93 ms 65 ms 219.4 217.8
qwen3-8b 95 ms 59 ms 212.2 210.4
granite-3.3-8b-instruct 110 ms 66 ms 199.0 198.1
deepseek-r1-distill-llama-8b 94 ms 55 ms 219.5 225.8
gemma-2-9b-it 128 ms 125 ms 167.2 157.4

SGLang wins prefix-sharing TTFT on 10/14 models. Its radix-tree KV cache provides superior prefix reuse, shaving 20–40 ms at 7–9B scale. Decode throughput favours vLLM for most models once prefix overhead is amortised; Gemma 3 4B is again the exception (SGLang +83%) for the same --enforce-eager reason.


5. Structured Generation (JSON Schema)

JSON-constrained generation throughput across 200 requests. Higher tok/s is better.

Model vLLM tok/s SGLang tok/s Winner
gemma-2-2b-it 1,225 957 vLLM +28%
smollm3-3b 930 774 vLLM +20%
llama-3.2-3b-instruct 970 909 vLLM +7%
phi-3-mini-4k-instruct 785 783 Tie
gemma-3-4b-it 340 617 SGLang +81%
phi-4-mini-instruct 736 669 vLLM +10%
deepseek-r1-distill-qwen-7b 452 451 Tie
qwen2.5-7b-instruct 456 384 vLLM +19%
mistral-7b-instruct-v0.3 440 411 vLLM +7%
llama-3.1-8b-instruct 426 423 vLLM +1%
qwen3-8b 398 398 Tie
granite-3.3-8b-instruct 368 379 SGLang +3%
deepseek-r1-distill-llama-8b 426 422 vLLM +1%
gemma-2-9b-it 317 290 vLLM +9%

vLLM dominates structured generation — wins 12/14 models. Most pronounced on smaller models (Gemma 2 2B: +28%, SmolLM3: +20%).


6. TPOT & Goodput Analysis

TPOT (Time Per Output Token) = inter-token decode latency after the first token: (total_ms − ttft_ms) / max(output_tokens − 1, 1). Computed per request from existing result data; no re-runs required. Full per-scenario tables: reports/tpot_analysis.md.

TPOT at concurrency 1 (single_request_latency, P50 ms)

At serial load, TPOT reflects raw GPU decode speed — engines are near-identical for every model except Gemma 3 4B, where vLLM's --enforce-eager constraint doubles decode time.

Model vLLM P50 vLLM P99 SGLang P50 SGLang P99
gemma-2-2b-it 12.9 13.1 12.8 12.9
smollm3-3b 14.6 14.7 15.7 15.7
llama-3.2-3b-instruct 15.0 15.0 14.7 14.7
phi-3-mini-4k-instruct 17.4 17.4 17.7 18.0
gemma-3-4b-it 41.0 42.3 21.7 21.9
phi-4-mini-instruct 17.8 18.0 18.8 19.3
deepseek-r1-distill-qwen-7b 32.7 33.1 32.4 32.4
qwen2.5-7b-instruct 32.7 33.3 32.3 32.8
mistral-7b-instruct-v0.3 31.5 31.8 31.4 31.7
llama-3.1-8b-instruct 35.7 40.3 33.0 38.9
qwen3-8b 35.6 37.2 37.0 40.9
granite-3.3-8b-instruct 36.1 36.1 36.0 36.0
deepseek-r1-distill-llama-8b 33.1 33.1 32.9 33.0
gemma-2-9b-it 41.4 42.4 41.2 42.2

Takeaway: At concurrency 1, both engines are GPU-bound equally. TPOT tracks model size. The sole outlier is Gemma 3 4B: SGLang achieves 21.7 ms vs vLLM's 41.0 ms — the same CUDA graph incompatibility that drives the throughput gap.

TPOT tail latency under load (throughput_ramp, P99 ms)

Under high concurrency, TPOT P99 reveals scheduling behaviour. vLLM holds tail latency significantly better on larger models.

Model vLLM P99 SGLang P99 SGLang / vLLM
gemma-2-2b-it 17.6 18.4 1.05×
smollm3-3b 21.2 41.9 2.0× worse
llama-3.2-3b-instruct 20.6 21.3 1.04×
phi-3-mini-4k-instruct 30.7 29.1 0.95×
gemma-3-4b-it 44.6 56.8 1.27× worse
phi-4-mini-instruct 28.9 32.7 1.13× worse
qwen2.5-7b-instruct 37.6 36.9 0.98×
mistral-7b-instruct-v0.3 39.0 39.1 1.00×
llama-3.1-8b-instruct 92.7 220.8 2.4× worse
qwen3-8b 54.3 256.2 4.7× worse
granite-3.3-8b-instruct 47.5 48.7 1.03×
deepseek-r1-distill-llama-8b 40.4 41.1 1.02×
deepseek-r1-distill-qwen-7b 35.8 36.8 1.03×
gemma-2-9b-it 72.1 80.1 1.11× worse

Takeaway: vLLM tail latency is substantially more stable at 7–9B scale under high concurrency. SGLang P99 TPOT spikes to 4.7× vLLM on Qwen3 8B and 2.4× on Llama 3.1 8B — the same scheduler stall behaviour that produces its P95 tail-latency anomaly on Gemma 2 9B. At ≤4B, the gap closes to <5% for most models.

Goodput (TTFT ≤ 200 ms, TPOT ≤ 40 ms)

Goodput = requests/sec satisfying both SLOs simultaneously, summed across all scenarios. Re-run with different thresholds: python -m analysis.goodput --ttft-slo-ms <X> --tpot-slo-ms <Y>.

Goodput comparison — TTFT ≤ 100 ms and TPOT ≤ 35 ms

The chart uses the stricter TTFT ≤ 100 ms, TPOT ≤ 35 ms pair from reports/goodput_slo100_35.md; the table below uses the 200 ms / 40 ms pair.

Model vLLM goodput SGLang goodput SLO pass % (vLLM / SGLang)
gemma-2-2b-it 1.47 rps 1.32 rps 99.9% / 97.0%
smollm3-3b 1.12 rps 0.90 rps 98.8% / 88.9%
llama-3.2-3b-instruct 1.08 rps 1.11 rps 98.2% / 99.9%
phi-3-mini-4k-instruct 0.90 rps 0.92 rps 96.0% / 99.8%
gemma-3-4b-it 0.004 rps 0.60 rps 1.0% / 81.4%
phi-4-mini-instruct 0.94 rps 1.00 rps 98.8% / 98.9%
deepseek-r1-distill-qwen-7b 0.51 rps 0.47 rps 97.1% / 91.3%
qwen2.5-7b-instruct 0.52 rps 0.53 rps 97.1% / 98.6%
mistral-7b-instruct-v0.3 0.51 rps 0.53 rps 95.7% / 99.9%
llama-3.1-8b-instruct 0.28 rps 0.42 rps 60.4% / 71.7%
qwen3-8b 0.34 rps 0.37 rps 78.4% / 50.9%
granite-3.3-8b-instruct 0.25 rps 0.24 rps 54.3% / 53.8%
deepseek-r1-distill-llama-8b 0.47 rps 0.47 rps 94.0% / 93.1%
gemma-2-9b-it 0 rps 0 rps 0% / 0% — TPOT ~44 ms exceeds SLO

Takeaways:

  • Gemma 2 2B / SmolLM3 — vLLM leads by 10–12% in goodput (CUDA graphs, lower TPOT variance).
  • Gemma 3 4B — SGLang delivers 150× the goodput of vLLM (0.60 vs 0.004 rps) because vLLM almost never meets the TPOT SLO without CUDA graphs.
  • Llama 3.1 8B — SGLang wins goodput (0.42 vs 0.28 rps) despite vLLM's lower serial TTFT, because SGLang's better TTFT under concurrent load keeps more requests inside the 200 ms window.
  • Gemma 2 9B — neither engine meets a 40 ms TPOT SLO (native TPOT ~44 ms); tighten to ≤50 ms to get meaningful results.

7. Speculative Decoding

Llama 3.1 8B

Variant Engine TTFT (med) Single tok/s Peak Throughput
Baseline vLLM 43 ms 30.3 102 tok/s
Baseline SGLang 67 ms 30.3 102 tok/s
Ngram vLLM 42 ms 27.8 96 tok/s
Ngram SGLang 39 ms 26.0 73 tok/s
Eagle3 vLLM 48 ms 24.6 82 tok/s
Eagle3 SGLang

Eagle3 draft model (vLLM): RedHatAI/Llama-3.1-8B-Instruct-speculator.eagle3 SGLang Eagle3 exceeds A10G 24 GB capacity (main + draft + KV cache). Future scope on ≥40 GB GPUs.

Qwen3 8B

Variant Engine TTFT (med) Single tok/s Peak Throughput
Baseline vLLM 44 ms 29.2 98 tok/s
Baseline SGLang 72 ms 29.4 99 tok/s
Ngram vLLM 44 ms 26.8 91 tok/s
Ngram SGLang 40 ms 25.6 64 tok/s

Eagle3 not tested — RedHatAI/Qwen3-8B-speculator.eagle3 not yet published.

Gemma 4 E4B

Variant Engine TTFT p50 Single tok/s Peak Throughput
Baseline vLLM 84 ms 24.5 83.8 tok/s
Baseline SGLang 87 ms 24.8 81.3 tok/s
Ngram vLLM 47 ms 20.7 77.1 tok/s
Ngram SGLang 48 ms 23.2 58.0 tok/s

Ngram cuts TTFT by ~45% on both engines for E4B; peak throughput slightly regresses (SGLang-ngram −29%), matching the "spec-dec hurts on A10G" pattern seen on Llama/Qwen.

Gemma 4 E2B

TTFT numbers land in the same regime (vLLM baseline 71 ms, Ngram 41 ms; SGLang baseline 72 ms, Ngram 40 ms), but the single_request_latency and throughput_ramp result files for E2B report total_tokens_generated=0 — a data-quality issue in those two scenarios only (long-context, prefix-sharing, and structured-generation E2B runs are clean). Decode-throughput numbers for E2B spec-dec are therefore not published here and need a rerun. See docs/RUN_STATUS.md.

Speculative decoding comparison

Regenerate with python -m analysis.generate_spec_decoding_figure after new spec-dec runs land. The plotly-based interactive variant is out of date (still Llama/Qwen only) and pending a refresh.


Model Size vs Performance

How key metrics scale with model size (best engine, single request):

Size TTFT range tok/s range Peak throughput
2–3B 20–57 ms 62–78 tok/s 230–265 tok/s
3.8–4B 25–87 ms 24–57 tok/s 84–191 tok/s
7B 40–66 ms 30–32 tok/s 105–107 tok/s
8B 42–76 ms 28–30 tok/s 93–102 tok/s
9B 74–83 ms 24 tok/s 78–80 tok/s

TTFT grows ~4× from 2B to 9B. Throughput drops ~3×. The steepest jump is at 7B where VRAM pressure begins on 24 GB.


Key Findings

vLLM wins TTFT at low concurrency. 13/14 models, 20–60% faster to first token. CUDA graph execution eliminates kernel launch overhead.

Throughput converges at 7–9B. Both engines hit the same GPU-bottlenecked ceiling. Differences <3%.

vLLM wins small-model throughput. SmolLM3 3B: 230 vs 205 (+12%), Phi-4 mini: 189 vs 176 (+7%), Gemma 2 2B: 265 vs 258 (+3%).

Gemma 3 is SGLang's strongest case. vLLM requires --enforce-eager for hybrid attention, giving SGLang +77% throughput (149 vs 84 tok/s). Architectural compatibility issue, not fundamental engine difference.

SGLang wins prefix sharing. Radix-tree KV cache provides better prefix reuse — wins TTFT on 10/14 models.

vLLM dominates structured generation. 12/14 wins. Gap ranges from marginal (<1%) to substantial (+28%).

Speculative decoding hurts on A10G. Ngram: vLLM −7%, SGLang −28%. Eagle3: vLLM −20%. Draft proposal overhead exceeds decode savings. Constrained --max-model-len 2048 limits batch efficiency. Better realized on ≥40 GB GPUs.

When to Use Which Engine

Use Case Recommendation Why
Latency-sensitive serving vLLM Wins TTFT on 13/14 models
Structured/JSON output vLLM Wins throughput on 12/14 models
Prefix-heavy workloads (RAG) SGLang Wins prefix-sharing TTFT on 10/14
High-throughput batch (7B+) Either Tied within 3%
Gemma 3 models SGLang +77% throughput (vLLM CUDA graph limitation)

Architecture Deep-Dive

vLLM — PagedAttention

  • KV cache split into fixed-size pages (blocks), managed by a block allocator
  • Prefix cache: LRU reuse of blocks for repeated prompt prefixes
  • Continuous batching: adds/removes requests mid-batch for high utilisation
  • Metrics exposed via Prometheus at /metrics
  • SSE streaming at /v1/completions (OpenAI-compat)

SGLang — RadixAttention

  • KV cache stored as a radix tree (trie) keyed on token sequences
  • All in-flight requests share the trie — automatic prefix deduplication
  • sgl.fork() creates parallel decode branches sharing the same KV prefix
  • Constrained decode built-in: regex / JSON schema enforces valid tokens
  • Metrics via /get_server_info JSON endpoint

Dashboard API

Method Endpoint Description
GET / Browser-friendly dashboard home
GET /api/results List saved result files (?model=... optional)
GET /api/results/{id} Load a specific result
GET /api/current Detect currently running benchmark + active services
GET /api/compare/{scenario} vLLM + SGLang delta for a scenario
POST /api/run Start a background benchmark run
GET /api/run/{job_id}/status Poll run progress
WS /ws/live Real-time metric stream (JSON messages)

Configuration

Environment Variable Default Description
HUGGING_FACE_HUB_TOKEN HF token for gated models
VLLM_HOST / VLLM_PORT localhost / 8000 vLLM server
SGLANG_HOST / SGLANG_PORT localhost / 8001 SGLang server
RESULTS_DIR results/ JSON result file directory
ALLOWED_ORIGINS http://localhost:3000 CORS origins for dashboard
LOG_FORMAT console console (colored) or json (structured)
LOG_LEVEL INFO DEBUG, INFO, WARNING, ERROR

Running Tests

pytest tests/ -v                    # All tests (no live engines needed)
pytest tests/ --cov=engines --cov=benchmarks --cov-report=term-missing

AWS Deployment

Two deployment options: a self-contained bash script and a Terraform module.

Option 1 — Bash Script

deploy/ec2_deploy.sh handles everything end-to-end with only the AWS CLI and jq.

# Single GPU (~$1.21/hr)
./deploy/ec2_deploy.sh --mode single --key my-key-pair --region us-east-1

# Two dedicated GPUs (~$2.46/hr)
./deploy/ec2_deploy.sh --mode multi --key my-key-pair --hf-token hf_TOKEN --region us-east-1

# Teardown
./deploy/ec2_deploy.sh --destroy

Option 2 — Terraform

cd deploy/terraform
terraform init
terraform apply \
  -var="key_pair_name=my-key" \
  -var="your_ip_cidr=$(curl -s https://checkip.amazonaws.com)/32" \
  -var="hf_token=hf_TOKEN" \
  -var="deployment_mode=single"
Mode Instance(s) Monthly est. (8h/day × 22 days)
Single 1× g5.2xlarge ~$213
Multi 2× g5.2xlarge + 1× t3.medium ~$435
Single Spot 1× g5.2xlarge (spot) ~$64–$100
# Teardown
terraform destroy -var="key_pair_name=my-key" -var="your_ip_cidr=$(curl -s https://checkip.amazonaws.com)/32"

Use terraform.tfvars to avoid typing variables repeatedly. See deploy/terraform/ for full variable reference.


Model-Specific Notes

Model Notes
Gemma 3 4B (vLLM) Requires --enforce-eager --disable-frontend-multiprocessing — hybrid sliding-window + full attention incompatible with CUDA graph capture
Gemma 2 9B (vLLM) Requires --max-model-len 4096 --gpu-memory-utilization 0.92 to fit on A10G
Llama 3.1 8B Eagle3 Requires --gpu-memory-utilization 0.95 --enforce-eager --max-model-len 2048 — main + draft use ~16.8 GiB
SGLang Eagle3 OOM on A10G for all tested models — main + draft + KV cache exceeds 24 GB
Qwen3-30B-A3B Not benchmarked — ~60 GB at bf16, exceeds A10G capacity
Gemma 3 12B Not benchmarked — ~24 GB weights, no KV cache headroom

Troubleshooting

Symptom Fix
SSH connection refused Check your_ip_cidr matches current IP
curl localhost:8000/health hangs Model still downloading — check docker compose logs -f vllm
GPU not visible in Docker Run nvidia-smi. If it fails, reboot
Out of GPU memory Reduce --gpu-memory-utilization or --max-model-len
Bootstrap failed Check /var/log/benchmark-setup.log
Dashboard 502 Verify port 3000 is open in security group

Reproducing These Results

Two entry-point scripts, depending on what you want:

# ─── Option A: 14-model baseline only (the 152-file headline result set) ───
# Simple, sequential, one engine at a time. Skips any model with ≥10 files.
chmod +x scripts/run_all_benchmarks.sh
tmux new -s bench
./scripts/run_all_benchmarks.sh 2>&1 | tee logs/run_$(date +%Y%m%dT%H%M%S).log
./scripts/run_all_benchmarks.sh --force       # ignore existing results

# ─── Option B: extended phases (variance, concurrency-64, decode sweep, Gemma 4) ───
# Everything beyond the baseline — idempotent, resume-safe.
bash scripts/run_new_benchmarks.sh --all
bash scripts/run_new_benchmarks.sh --variance   # variance (4 models × 5 iter)
bash scripts/run_new_benchmarks.sh --concurrency   # concurrency-64 ramp
bash scripts/run_new_benchmarks.sh --decode-sweep   # decode-length sweep
bash scripts/run_new_benchmarks.sh --gemma4   # Gemma 4 baseline + ngram

# Generate summary reports after runs finish
conda run -n base python -m analysis.generate_final_benchmark_report

Full walk-through, env-var knobs, and troubleshooting: scripts/EXECUTION_GUIDE.md.


Future Work

  • SGLang Eagle3 on ≥40 GB GPU (A100/H100) — OOMs on A10G
  • Eagle3 for Qwen3-8B once RedHatAI/Qwen3-8B-speculator.eagle3 is published
  • Quantized models (AWQ/GPTQ) — test if spec-dec becomes viable at lower precision
  • Multi-GPU tensor parallel benchmarks
  • Nightly CI benchmark regression runs (current CI only gates lint + unit tests)

Contributing

See CONTRIBUTING.md for guidelines on adding models, scenarios, or hardware results.

License

MIT

About

Production-grade benchmark harness comparing vLLM vs SGLang LLM inference engines across latency, throughput, KV-cache, structured generation, and speculative decoding on NVIDIA A10G (14 models, 2B-9B).

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors