Production-ready gateway that puts an OpenAI-compatible and Anthropic Messages-compatible API in front of any open-source LLM runtime. One deployment pattern, one client library set, any backend.
┌────────────────────────────────────────────────────────┐
clients │ OpenAI SDK / Anthropic SDK / curl / any HTTP client │
──────────┼────────────────────────────────────────────────────────┤
TLS │ Nginx / Caddy (optional reverse proxy + TLS) │
──────────┼────────────────────────────────────────────────────────┤
│ FastAPI gateway auth · rate limit · CORS · metrics │
this │ /livez /readyz /health /metrics │
repo │ /v1/chat/completions (stream: yes)│
│ /v1/completions (stream: yes)│
│ /v1/embeddings │
│ /v1/messages (stream: yes)│
│ /v1/messages/count_tokens │
──────────┼────────────────────────────────────────────────────────┤
│ Backend (pick one, swap via env + compose profile) │
runtime │ vLLM · Ollama · llama.cpp · TGI · SGLang · LocalAI │
│ LM Studio · any OpenAI-compatible URL │
└────────────────────────────────────────────────────────┘
git clone https://github.com/varad-more/selfhosted-chat-api selfhosted-chat-api
cd selfhosted-chat-api
make demo # env + compose up + pull qwen2.5:0.5b-instruct (~400 MB)
curl http://127.0.0.1:8000/v1/chat/completions \
-H "Authorization: Bearer demo-key" \
-H "Content-Type: application/json" \
-d '{"model":"qwen2.5:0.5b-instruct","messages":[{"role":"user","content":"hi"}]}'That's it. Point any OpenAI/Anthropic SDK at http://127.0.0.1:8000/v1 and it works.
| Feature | Status |
|---|---|
| OpenAI chat, completions, models, embeddings | Yes |
| OpenAI streaming (SSE passthrough) | Yes |
Anthropic /v1/messages (non-streaming) |
Yes |
Anthropic /v1/messages (streaming, event-translated) |
Yes |
Anthropic /v1/messages/count_tokens (heuristic) |
Yes |
API key auth (multiple keys, Authorization + x-api-key) |
Yes |
| Structured JSON logs with request IDs | Yes |
Prometheus metrics at /metrics |
Yes |
/livez + /readyz + /health probes |
Yes |
| In-process rate limiting (token bucket) | Yes |
| CORS, proxy-header handling | Yes |
| Tests + CI + linted codebase | Yes |
Supports any of these open-source LLM runtimes out of the box, selectable via compose profile and a single env var:
- vLLM
- Ollama
- llama.cpp
llama-server - Hugging Face Text Generation Inference (TGI)
- SGLang
- LocalAI
- LM Studio (local server mode)
- any other OpenAI-compatible endpoint (set
BACKEND_KIND=openai)
Why a gateway in front of an OpenAI-compatible server? A stable public surface, API key enforcement, a real Anthropic-compatible facade (including streaming), metrics, rate limiting, health/readiness probes, and uniform error envelopes — without bolting any of that onto the inference engine.
- Quick start
- Repository layout
- Supported backends
- Configuration reference
- API surface
- Client examples
- Reproducibility matrix
- Observability
- Security & production posture
- Testing & CI
- Troubleshooting
- Further docs
git clone <your-fork-url> selfhosted-chat-api
cd selfhosted-chat-api
# Pick a backend. Examples: vllm, ollama, llamacpp, tgi, sglang, localai, external
make env-vllm # copies deploy/env/vllm.env to .env
$EDITOR .env # set API_KEYS=... and any backend knobs
make up BACKEND=vllm # docker compose --profile vllm up -d --build
make health # confirms /health reports backend_okFirst request:
curl http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $API_KEY" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"messages": [{"role": "user", "content": "Say hi."}],
"max_tokens": 32
}'Point any OpenAI SDK at http://127.0.0.1:8000/v1:
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="YOUR_API_KEY")
print(client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[{"role": "user", "content": "ping"}],
).choices[0].message.content)Point any Anthropic SDK at the same base URL:
import anthropic
client = anthropic.Anthropic(
base_url="http://127.0.0.1:8000",
api_key="YOUR_API_KEY",
)
msg = client.messages.create(
model="Qwen/Qwen2.5-7B-Instruct",
max_tokens=256,
messages=[{"role": "user", "content": "ping"}],
)
print(msg.content[0].text)selfhosted-chat-api/
├── api/
│ ├── app/
│ │ ├── backends.py # Backend profiles (vLLM, Ollama, ...)
│ │ ├── claude_translate.py # OpenAI <-> Anthropic translation (incl. SSE)
│ │ ├── config.py # Env-driven settings
│ │ ├── docs_page.py # Human-readable /docs page
│ │ ├── errors.py
│ │ ├── http_client.py # Shared httpx.AsyncClient lifecycle
│ │ ├── logging_setup.py # Structured JSON logs + request IDs
│ │ ├── main.py # FastAPI factory + lifespan
│ │ ├── metrics.py # Dependency-free Prometheus counters/histograms
│ │ ├── middleware.py # Request ID, access log, rate limit
│ │ ├── proxy.py # Backend proxying helpers (JSON + SSE)
│ │ ├── rate_limit.py # In-process token bucket
│ │ └── routes/
│ │ ├── claude.py # /v1/messages + /v1/messages/count_tokens
│ │ ├── openai.py # /v1/models, /v1/chat/completions, /v1/completions, /v1/embeddings
│ │ └── system.py # /, /docs, /reference, /health, /livez, /readyz, /metrics
│ ├── tests/ # Pytest + httpx MockTransport
│ ├── Dockerfile # Non-root, slim, with HEALTHCHECK
│ ├── main.py # Thin shim for backward compat
│ ├── pyproject.toml # Ruff + pytest config
│ ├── requirements.txt
│ └── requirements-dev.txt
├── deploy/
│ ├── env/ # Per-backend .env templates
│ └── nginx/ # Reverse proxy sample config
├── docs/
│ ├── API_CLAUDE.md
│ ├── API_OPENAI.md
│ ├── BACKENDS.md # Per-backend install & quirks
│ ├── DEPLOYMENT.md
│ ├── MODELS.md # Curated open-source model catalog
│ ├── NGINX.md
│ ├── OPERATIONS.md
│ └── USE_CASE_TUNING.md
├── examples/ # curl + Python clients
├── .github/workflows/ci.yml # Lint, tests, Docker build, compose validation
├── .env.example
├── docker-compose.yml # One file, compose profiles per backend
├── Makefile # Dev + ops shortcuts
└── README.md
BACKEND_KIND |
Runtime | Compose profile | Embeddings | Streaming | Tools |
|---|---|---|---|---|---|
vllm |
vLLM | vllm |
yes | yes | yes (model-dependent) |
ollama |
Ollama | ollama |
yes | yes | yes (model-dependent) |
llamacpp |
llama.cpp llama-server |
llamacpp |
yes | yes | partial |
tgi |
HF TGI | tgi |
no | yes | yes (model-dependent) |
sglang |
SGLang | sglang |
yes | yes | yes |
localai |
LocalAI | localai |
yes | yes | yes |
lmstudio |
LM Studio | — (run on host) | yes | yes | yes |
openai |
Any OpenAI-compatible URL | none |
depends | depends | depends |
See docs/BACKENDS.md for launch flags, model formats,
GPU requirements, and specific gotchas per backend.
docker compose down
make env-ollama
make up BACKEND=ollamaOr manage .env manually and run docker compose --profile <name> up -d --build.
All configuration is environment-driven. Every variable has a safe default
when it makes sense; see .env.example for the canonical
template.
| Variable | Default | Purpose |
|---|---|---|
API_HOST |
127.0.0.1 |
Host interface the gateway binds on. Keep 127.0.0.1 when fronting with Nginx. |
API_PORT |
8000 |
Gateway port. |
API_KEYS |
(empty) | Comma-separated API keys. Empty disables auth (dev only). |
CORS_ORIGINS |
* |
Comma-separated list; tighten in production. |
CORS_ALLOW_CREDENTIALS |
false |
Enable only with explicit origins. |
RATE_LIMIT_ENABLED |
false |
Turn on token-bucket limiting for /v1/*. |
RATE_LIMIT_RPM |
120 |
Per-identity sustained rate. |
RATE_LIMIT_BURST |
30 |
Per-identity burst capacity. |
REQUEST_TIMEOUT_S |
600 |
End-to-end backend timeout. |
CONNECT_TIMEOUT_S |
10 |
TCP connect timeout. |
LOG_LEVEL |
INFO |
Python log level. |
LOG_JSON |
true |
Structured JSON logs (disable for local dev). |
LOG_PROMPTS |
false |
If true, prompt payloads may show up in debug logs. Treat as PII. |
METRICS_ENABLED |
true |
Expose /metrics. |
| Variable | Default | Purpose |
|---|---|---|
BACKEND_KIND |
vllm |
One of vllm, ollama, llamacpp, tgi, sglang, localai, lmstudio, openai. |
BACKEND_BASE_URL |
http://vllm:8001/v1 |
OpenAI-compatible base URL of the runtime. |
BACKEND_API_KEY |
(empty) | If the backend requires its own API key (e.g. llama-server --api-key). |
MODEL_NAME |
Qwen/Qwen2.5-7B-Instruct |
Default model advertised in docs and examples. |
Backend-specific knobs (VLLM_DTYPE, LLAMACPP_CTX, etc.) live in
.env.example and only apply when that profile is active.
| Method | Path | Auth | What |
|---|---|---|---|
| GET | / |
no | Service summary, endpoint map. |
| GET | /docs |
no | Human-readable HTML docs. |
| GET | /reference |
no | Swagger UI. |
| GET | /openapi.json |
no | OpenAPI schema. |
| GET | /livez |
no | Liveness probe (doesn't touch backend). |
| GET | /readyz |
no | Readiness probe (verifies backend reachable). |
| GET | /health |
no | Combined gateway + backend health. |
| GET | /metrics |
no | Prometheus exposition. |
| Method | Path |
|---|---|
| GET | /v1/models |
| POST | /v1/chat/completions (streaming + non-streaming) |
| POST | /v1/completions (streaming + non-streaming) |
| POST | /v1/embeddings |
| Method | Path |
|---|---|
| POST | /v1/messages (streaming + non-streaming) |
| POST | /v1/messages/count_tokens (heuristic) |
All /v1/* endpoints require an API key when API_KEYS is set. Send it
as Authorization: Bearer <key> or x-api-key: <key>.
Detailed docs:
Runnable scripts live under examples/:
| Script | What it does |
|---|---|
openai_chat_curl.sh |
Non-streaming OpenAI chat completion. |
openai_stream_curl.sh |
Streaming OpenAI chat completion (SSE). |
openai_embeddings_curl.sh |
Embeddings request. |
claude_messages_curl.sh |
Non-streaming Anthropic Messages request. |
claude_stream_curl.sh |
Streaming Anthropic Messages request. |
rag_chat_curl.sh |
RAG prompt template. |
extraction_json_curl.sh |
JSON extraction prompt template. |
agent_tools_curl.sh |
OpenAI tool-calling request (requires tool-capable model). |
python_openai_client.py |
Minimal OpenAI SDK client. |
python_anthropic_client.py |
Minimal Anthropic SDK client. |
These pinned combinations are what CI and smoke tests exercise. Use them verbatim for a known-good baseline and vary intentionally from there.
| Component | Pinned value |
|---|---|
| Gateway Python | 3.12 |
| FastAPI / Uvicorn / httpx | 0.116.1 / 0.35.0 / 0.28.1 (see api/requirements.txt) |
| Gateway Docker base | python:3.12-slim |
| Default backend | vllm/vllm-openai:latest |
| Default model | Qwen/Qwen2.5-7B-Instruct |
| Default dtype | half |
| Default max context | 16384 |
| Default GPU memory util. | 0.92 |
| Reference GPU class | NVIDIA A10G / L4 / 24 GB-class |
| CUDA (via vLLM image) | shipped by the image (do not install locally) |
Deterministic build:
docker compose --profile vllm build --pull api
docker compose --profile vllm up -dTo pin upstream images for immutability, tag them in your registry and
replace vllm/vllm-openai:latest (etc.) with the SHA-pinned digest in
docker-compose.yml.
Model weight locations are mounted to:
./data/hf-cache— Hugging Face cache (vLLM, TGI, SGLang)./data/ollama— Ollama model store./data/models— llama.cpp GGUF files./data/localai-models— LocalAI catalog./data/vllm-cache— vLLM compilation cache
Back these up (or restore into them) to reproduce an environment on a new host without re-downloading weights.
A curated catalog of tested open-source models (by task, GPU class, and
license) is in docs/MODELS.md.
Structured JSON by default, one record per line, includes request_id:
{"ts":"2026-04-22T12:00:00.123Z","level":"INFO","logger":"chat_api.http",
"msg":"request","request_id":"7f9...","method":"POST",
"path":"/v1/chat/completions","status":200,"duration_ms":842.12}Disable with LOG_JSON=false for friendlier local output.
GET /metrics returns Prometheus text:
chat_api_uptime_secondschat_api_requests_total{method,path,status}chat_api_request_duration_seconds(histogram)chat_api_backend_errors_total{kind}chat_api_rate_limited_total{identity}
Point any Prometheus scraper at <gateway>/metrics.
| Path | Use |
|---|---|
/livez |
Kubernetes liveness — returns 200 unless the process is dead. |
/readyz |
Readiness — 200 when the backend responds, 503 otherwise. |
/health |
Combined human-readable status. |
Let a teammate hit your machine directly:
# in .env
API_HOST=0.0.0.0 # bind on all interfaces, still protected by API_KEYS
API_KEYS=pick-something-random-and-share-itThen give them your IP: http://<your-lan-ip>:8000/v1. Keep API_KEYS
set — 0.0.0.0 means anyone on the same network can reach the port.
Expose a local instance over HTTPS for a few minutes of demoing:
# Option 1: cloudflared (free, no signup for quick tunnels)
cloudflared tunnel --url http://127.0.0.1:8000
# Option 2: ngrok
ngrok http 8000Both print a public https://... URL your peer can use. Tear down when done.
Share these four lines and they're running an OpenAI SDK client against your deployment:
from openai import OpenAI
client = OpenAI(base_url="https://<your-url>/v1", api_key="<the-shared-key>")
client.chat.completions.create(model="<your-model>", messages=[{"role":"user","content":"hi"}])- Always set
API_KEYSin any deployment reachable beyond localhost. - Bind the API container to
127.0.0.1and place Nginx / Caddy in front with TLS. Sample config atdeploy/nginx/selfhosted-chat-api.conf. - Keep the inference backend on the internal Docker network. Never expose vLLM/Ollama/TGI/... directly unless you want an unauthenticated inference endpoint on the internet.
- The API container runs non-root with
read_onlyrootfs,no-new-privileges, and all Linux capabilities dropped. - For multi-replica deployments, disable the in-process limiter and use a
real one (Nginx
limit_req, Envoy, or Redis-backed). LOG_PROMPTS=falseby default — prompt bodies are not logged. Keep it off unless you accept the PII implications.CORS_ORIGINS=*is fine for public APIs that don't use cookies. For any deployment that does, set explicit origins.
Run locally against a conda env:
conda create -y -n selfhosted-chat-api python=3.12
conda activate selfhosted-chat-api
cd api && pip install -r requirements-dev.txt
pytest -q
ruff check app testsGitHub Actions runs on every push/PR:
ruff checkpytest- Docker build for the gateway
docker compose configagainst every profile
Tests use httpx.MockTransport and cover: auth, backend proxying, Claude
translation (both non-streaming and streaming SSE), health/readyz/metrics,
request IDs, rate-limit math, and backend capability detection.
| Symptom | Likely cause | Fix |
|---|---|---|
/health returns 503 |
Backend not reachable (still downloading weights, wrong URL, OOM) | docker compose logs <backend> |
401 on /v1/* |
Missing/mismatched API key | Send Authorization: Bearer ... or x-api-key: ... |
| 429 | In-process rate limit triggered | Lower traffic or raise RATE_LIMIT_RPM / RATE_LIMIT_BURST |
501 from /v1/embeddings |
Backend doesn't expose embeddings (e.g. TGI) | Switch backend or run a separate embeddings server |
| Claude streaming is empty | Backend emitted no content or returned an error | Check backend logs; verify stream: true is supported |
| Slow first request | Model weights + kernel compile | Expected; later restarts reuse the cache volumes |
See docs/OPERATIONS.md for the full runbook.
docs/DEPLOYMENT.md— host setup, first boot, rollbackdocs/BACKENDS.md— per-backend notes, GPU requirements, launch flagsdocs/MODELS.md— curated open-source model catalog by task and GPU classdocs/API_OPENAI.md— OpenAI-compatible endpointsdocs/API_CLAUDE.md— Anthropic Messages facadedocs/USE_CASE_TUNING.md— chat / RAG / extraction / agent patternsdocs/NGINX.md— reverse proxy configurationdocs/OPERATIONS.md— day-2 operations and troubleshooting
This repo ships deployment glue only. The inference runtimes and models it orchestrates carry their own licenses — read them before using a model.