Efficient vLLM Batch Server

A gateway for vLLM that adds OpenAI-compatible batch processing with adaptive backpressure control. The system automatically prevents batch jobs from degrading interactive request latency.

What This Does

This monorepo provides two packages:

vllm-gateway: FastAPI proxy that sits in front of vLLM, adding batch processing with intelligent resource management
vllm-perf: Tools for benchmarking, tuning, and quantizing vLLM deployments

The main feature is adaptive backpressure. Batch workers monitor vLLM's queue depth in real-time and automatically reduce concurrency when interactive requests start waiting. This ensures single-shot requests stay fast even when processing large batches.

Architecture

Interactive requests go straight through the gateway to vLLM with priority 0. Batch requests get queued in Redis and processed by workers with priority 10. Workers watch vLLM's queue depth and throttle themselves when they see interactive requests waiting.

vLLM's priority scheduler ensures interactive requests get processed first. The workers poll the /metrics endpoint every 2 seconds and adjust how many batch requests they send concurrently.

The gateway runs on port 8010. Create a batch job:

# Upload input file
curl -X POST http://localhost:8010/v1/files \
  -F "file=@batch_input.jsonl" \
  -F "purpose=batch"

# Create batch
curl -X POST http://localhost:8010/v1/batches \
  -H "Content-Type: application/json" \
  -d '{"input_file_id": "file-xyz", "endpoint": "/v1/chat/completions", "completion_window": "24h"}'

# Check status
curl http://localhost:8010/v1/batches/{batch_id}

Configuration

Key settings in .env:

VLLM_MODEL: Model to load (default: Qwen/Qwen3-4B)
BATCH_MAX_CONCURRENT_REQUESTS: Max batch concurrency (default: 50)
BACKPRESSURE_ENABLED: Enable adaptive backpressure (default: true)
BACKPRESSURE_POLL_INTERVAL: How often to check vLLM metrics (default: 2.0s)
CHAT_COMPLETION_PRIORITY: Priority for interactive requests (default: 0)
BATCH_PRIORITY: Priority for batch requests (default: 10)

See packages/vllm-gateway/src/vllm_gateway/core/config.py for all options.

Backpressure

Workers poll vLLM's metrics endpoint and look at how many requests are waiting. Based on that, they adjust their concurrency:

Nothing waiting: run at full speed
1-5 waiting: drop to 3 concurrent requests
6-10 waiting: drop to 2
More than 10: drop to 1

The target concurrency is stored in Redis so all workers see it.

Monitoring

Start with monitoring stack:

docker compose --profile monitoring up -d

Prometheus: http://localhost:9090
Grafana: http://localhost:3000
Gateway metrics: http://localhost:8010/metrics

Project Structure

packages/
├── vllm-gateway/       # API gateway and batch workers
│   ├── src/
│   │   ├── api/        # FastAPI routes
│   │   ├── workers/    # ARQ batch processing
│   │   ├── services/   # Business logic
│   │   └── metrics/    # Prometheus instrumentation
│   └── tests/
│
└── vllm-perf/          # Performance tools
    └── src/
        ├── benchmarks/ # guidellm wrapper
        ├── tuning/     # Ray Tune + Optuna
        └── quantization/ # Model quantization

docker/                 # Orchestration
└── config/            # Prometheus & Grafana

Requirements

Python 3.11+
Docker and Docker Compose
Redis 7.0+
NVIDIA GPU with CUDA (for vLLM)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
config		config
docker		docker
packages		packages
scripts		scripts
.dockerignore		.dockerignore
.env		.env
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Efficient vLLM Batch Server

What This Does

Architecture

Configuration

Backpressure

Monitoring

Project Structure

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Efficient vLLM Batch Server

What This Does

Architecture

Configuration

Backpressure

Monitoring

Project Structure

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages