rvLLM Serverless for RunPod

RunPod Serverless wrapper for rvLLM, keeping the Rust inference server intact and adding only the minimal serverless layer needed for deployment.

This repo follows the shape of the official RunPod worker repos:

a thin Python runpod.serverless.start(...) handler
rvllm serve running as the local OpenAI-compatible backend
generic-image deployment with MODEL_ID at runtime
baked-image deployment with a model snapshot inside the image

Current Status

generic serverless image is built and published
tested against real RunPod GPU workers
validated startup path after fixing missing PTX kernel packaging
still WIP in the sense that the deployment ergonomics and tuning will keep improving

Published Image

Current test image:

reniyap/rvllm-serverless:exp-20260401
digest: sha256:9fa7c365b125f15ad7f703d7952ba5b41291e8d14bea00efa0be52cdf2552e0d

If you want the most deterministic pull in RunPod, use the digest form:

reniyap/rvllm-serverless@sha256:9fa7c365b125f15ad7f703d7952ba5b41291e8d14bea00efa0be52cdf2552e0d

Why This Shape

rvLLM already handles the core inference work:

OpenAI-compatible HTTP API
Hugging Face model-id loading
Rust-native runtime with much smaller overhead than Python vLLM

This serverless layer only does three things:

Launch rvllm serve with env-driven configuration.
Wait for /health.
Proxy RunPod jobs to the local OpenAI-compatible API.

That keeps rvLLM itself respected and avoids growing a second inference implementation in Python.

Quick Start

Option 1. Use the Published Image in RunPod

In RunPod Serverless:

Create a Custom deployment.
Choose Deploy from Docker registry.
Use the image above.
Set endpoint type to Queue-based.
Leave Container start command empty.
Leave Expose HTTP ports and Expose TCP ports empty.
Add runtime env vars like this:

MODEL_ID=Qwen/Qwen2.5-7B-Instruct
DTYPE=half
MAX_MODEL_LEN=4096
GPU_MEMORY_UTILIZATION=0.80
MAX_NUM_SEQS=16
MAX_CONCURRENCY=4

For gated/private models, add HF_TOKEN as a RunPod Secret.

Option 2. Build Your Own Image

cd rvLLM-serverless
./scripts/build.sh --tag your-registry/rvllm-serverless:latest --push

To inspect the generated Docker command without building:

cd rvLLM-serverless
./scripts/build.sh --tag your-registry/rvllm-serverless:latest --dry-run

Option 3. Bake a Model into the Image

cd rvLLM-serverless
HF_TOKEN=hf_xxx ./scripts/build.sh \
  --tag your-registry/rvllm-serverless:qwen25-7b \
  --bake-model \
  --model-id Qwen/Qwen2.5-7B-Instruct \
  --push

For baked images, use runtime env like:

MODEL_TARGET=/models/default
SERVED_MODEL_NAME=Qwen/Qwen2.5-7B-Instruct
DTYPE=half
MAX_MODEL_LEN=4096
GPU_MEMORY_UTILIZATION=0.80
MAX_NUM_SEQS=16
MAX_CONCURRENCY=4

How To Call It

This is a queue-based RunPod worker, so call the RunPod endpoint APIs, not the container port directly.

List Models

curl --request POST \
  --url "https://api.runpod.ai/v2/<ENDPOINT_ID>/runsync" \
  -H "authorization: <RUNPOD_API_KEY>" \
  -H "content-type: application/json" \
  -d '{
    "input": {
      "path": "/v1/models",
      "method": "GET"
    }
  }'

Chat Completion

curl --request POST \
  --url "https://api.runpod.ai/v2/<ENDPOINT_ID>/runsync" \
  -H "authorization: <RUNPOD_API_KEY>" \
  -H "content-type: application/json" \
  -d '{
    "input": {
      "messages": [
        {"role": "system", "content": "Answer briefly."},
        {"role": "user", "content": "What is rvLLM?"}
      ],
      "temperature": 0.2,
      "max_tokens": 128
    }
  }'

Streamed Chat Completion

curl --request POST \
  --url "https://api.runpod.ai/v2/<ENDPOINT_ID>/run" \
  -H "authorization: <RUNPOD_API_KEY>" \
  -H "content-type: application/json" \
  -d '{
    "input": {
      "messages": [
        {"role": "user", "content": "Write three bullet points about rvLLM."}
      ],
      "stream": true,
      "max_tokens": 128
    }
  }'

Then read the stream with the returned job id:

curl --request GET \
  --url "https://api.runpod.ai/v2/<ENDPOINT_ID>/stream/<JOB_ID>" \
  -H "authorization: <RUNPOD_API_KEY>"

Configuration

Core Runtime Variables

Variable	Default	Purpose
`MODEL_ID`	unset	Public Hugging Face model id for generic images.
`MODEL_TARGET`	unset	Actual value passed to `rvllm serve --model`.
`SERVED_MODEL_NAME`	`MODEL_ID` or `MODEL_TARGET`	Public model name exposed to clients.
`TOKENIZER_ID`	unset	Optional tokenizer override.
`HF_TOKEN`	unset	Hugging Face token for gated/private models.
`HF_HOME`	`/runpod-volume/huggingface`	Hugging Face cache root.
`HUGGINGFACE_HUB_CACHE`	`${HF_HOME}/hub`	Hugging Face hub cache path.
`RVLLM_PORT`	`8000`	Local port used by `rvllm serve`.
`MAX_CONCURRENCY`	`30`	RunPod worker concurrency hint.
`SERVER_READY_TIMEOUT`	`900`	Startup timeout in seconds.
`REQUEST_TIMEOUT`	`600`	Proxy request timeout in seconds.

`rvLLM` Launch Variables

Variable	Default
`DTYPE`	`auto`
`MAX_MODEL_LEN`	`2048`
`GPU_MEMORY_UTILIZATION`	`0.9`
`TENSOR_PARALLEL_SIZE`	`1`
`MAX_NUM_SEQS`	`256`
`RUST_LOG`	`info`
`DISABLE_TELEMETRY`	`false`

Job Input Contract

The worker accepts two styles of input.

Direct OpenAI-Style Input

If input contains messages, it becomes /v1/chat/completions.

{
  "input": {
    "messages": [
      { "role": "user", "content": "What is rvLLM?" }
    ],
    "max_tokens": 128
  }
}

If input contains prompt, it becomes /v1/completions.

{
  "input": {
    "prompt": "Write a one-line summary of RunPod Serverless.",
    "max_tokens": 64
  }
}

If model is omitted, the worker injects SERVED_MODEL_NAME.

Explicit Proxy Input

If you want direct control over the local endpoint:

{
  "input": {
    "path": "/v1/chat/completions",
    "method": "POST",
    "body": {
      "model": "Qwen/Qwen2.5-7B-Instruct",
      "messages": [
        { "role": "user", "content": "Return JSON only." }
      ],
      "stream": true
    }
  }
}

Repository Layout

rvLLM-serverless/
├── .runpod/hub.json
├── builder/
│   ├── download_model.py
│   └── requirements.txt
├── scripts/
│   ├── build.sh
│   └── smoke_test.sh
├── src/
│   ├── config.py
│   ├── handler.py
│   ├── proxy.py
│   ├── request_mapping.py
│   └── server_launcher.py
└── tests/
    ├── test_config.py
    └── test_request_mapping.py

Verification

What has been verified so far:

local Python tests for config and request mapping
local Docker build flow on macOS with linux/amd64
published Docker image build and push
real RunPod GPU startup testing
startup fix for missing PTX kernel packaging

Run local checks:

cd rvLLM-serverless
./scripts/smoke_test.sh

Notes

The wrapper intentionally targets the existing rvllm serve CLI surface.
This repo is meant to stay thin. Inference behavior belongs in rvLLM, not here.
PTX kernels are compiled during image build and packaged into the runtime image.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.runpod		.runpod
builder		builder
scripts		scripts
src		src
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rvLLM Serverless for RunPod

Current Status

Published Image

Why This Shape

Quick Start

Option 1. Use the Published Image in RunPod

Option 2. Build Your Own Image

Option 3. Bake a Model into the Image

How To Call It

List Models

Chat Completion

Streamed Chat Completion

Configuration

Core Runtime Variables

`rvLLM` Launch Variables

Job Input Contract

Direct OpenAI-Style Input

Explicit Proxy Input

Repository Layout

Verification

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

rvLLM Serverless for RunPod

Current Status

Published Image

Why This Shape

Quick Start

Option 1. Use the Published Image in RunPod

Option 2. Build Your Own Image

Option 3. Bake a Model into the Image

How To Call It

List Models

Chat Completion

Streamed Chat Completion

Configuration

Core Runtime Variables

rvLLM Launch Variables

Job Input Contract

Direct OpenAI-Style Input

Explicit Proxy Input

Repository Layout

Verification

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

`rvLLM` Launch Variables

Packages