A gateway for vLLM that adds OpenAI-compatible batch processing with adaptive backpressure control. The system automatically prevents batch jobs from degrading interactive request latency.
This monorepo provides two packages:
- vllm-gateway: FastAPI proxy that sits in front of vLLM, adding batch processing with intelligent resource management
- vllm-perf: Tools for benchmarking, tuning, and quantizing vLLM deployments
The main feature is adaptive backpressure. Batch workers monitor vLLM's queue depth in real-time and automatically reduce concurrency when interactive requests start waiting. This ensures single-shot requests stay fast even when processing large batches.
Interactive requests go straight through the gateway to vLLM with priority 0. Batch requests get queued in Redis and processed by workers with priority 10. Workers watch vLLM's queue depth and throttle themselves when they see interactive requests waiting.
vLLM's priority scheduler ensures interactive requests get processed first. The workers poll the /metrics endpoint every 2 seconds and adjust how many batch requests they send concurrently.
The gateway runs on port 8010. Create a batch job:
# Upload input file
curl -X POST http://localhost:8010/v1/files \
-F "file=@batch_input.jsonl" \
-F "purpose=batch"
# Create batch
curl -X POST http://localhost:8010/v1/batches \
-H "Content-Type: application/json" \
-d '{"input_file_id": "file-xyz", "endpoint": "/v1/chat/completions", "completion_window": "24h"}'
# Check status
curl http://localhost:8010/v1/batches/{batch_id}Key settings in .env:
VLLM_MODEL: Model to load (default: Qwen/Qwen3-4B)BATCH_MAX_CONCURRENT_REQUESTS: Max batch concurrency (default: 50)BACKPRESSURE_ENABLED: Enable adaptive backpressure (default: true)BACKPRESSURE_POLL_INTERVAL: How often to check vLLM metrics (default: 2.0s)CHAT_COMPLETION_PRIORITY: Priority for interactive requests (default: 0)BATCH_PRIORITY: Priority for batch requests (default: 10)
See packages/vllm-gateway/src/vllm_gateway/core/config.py for all options.
Workers poll vLLM's metrics endpoint and look at how many requests are waiting. Based on that, they adjust their concurrency:
- Nothing waiting: run at full speed
- 1-5 waiting: drop to 3 concurrent requests
- 6-10 waiting: drop to 2
- More than 10: drop to 1
The target concurrency is stored in Redis so all workers see it.
Start with monitoring stack:
docker compose --profile monitoring up -d- Prometheus: http://localhost:9090
- Grafana: http://localhost:3000
- Gateway metrics: http://localhost:8010/metrics
packages/
├── vllm-gateway/ # API gateway and batch workers
│ ├── src/
│ │ ├── api/ # FastAPI routes
│ │ ├── workers/ # ARQ batch processing
│ │ ├── services/ # Business logic
│ │ └── metrics/ # Prometheus instrumentation
│ └── tests/
│
└── vllm-perf/ # Performance tools
└── src/
├── benchmarks/ # guidellm wrapper
├── tuning/ # Ray Tune + Optuna
└── quantization/ # Model quantization
docker/ # Orchestration
└── config/ # Prometheus & Grafana
- Python 3.11+
- Docker and Docker Compose
- Redis 7.0+
- NVIDIA GPU with CUDA (for vLLM)