Skip to content

varad-more/selfhosted-chat-api

Repository files navigation

Self-Hosted Chat API

ci license: MIT python

Production-ready gateway that puts an OpenAI-compatible and Anthropic Messages-compatible API in front of any open-source LLM runtime. One deployment pattern, one client library set, any backend.

Architecture

          ┌────────────────────────────────────────────────────────┐
 clients  │  OpenAI SDK / Anthropic SDK / curl / any HTTP client   │
──────────┼────────────────────────────────────────────────────────┤
   TLS    │  Nginx / Caddy     (optional reverse proxy + TLS)      │
──────────┼────────────────────────────────────────────────────────┤
          │  FastAPI gateway    auth · rate limit · CORS · metrics │
   this   │                     /livez /readyz /health /metrics    │
   repo   │                     /v1/chat/completions  (stream: yes)│
          │                     /v1/completions       (stream: yes)│
          │                     /v1/embeddings                     │
          │                     /v1/messages          (stream: yes)│
          │                     /v1/messages/count_tokens          │
──────────┼────────────────────────────────────────────────────────┤
          │  Backend (pick one, swap via env + compose profile)    │
 runtime  │  vLLM · Ollama · llama.cpp · TGI · SGLang · LocalAI    │
          │  LM Studio · any OpenAI-compatible URL                 │
          └────────────────────────────────────────────────────────┘

Demo on your laptop (no GPU)

git clone https://github.com/varad-more/selfhosted-chat-api selfhosted-chat-api
cd selfhosted-chat-api
make demo           # env + compose up + pull qwen2.5:0.5b-instruct (~400 MB)

curl http://127.0.0.1:8000/v1/chat/completions \
  -H "Authorization: Bearer demo-key" \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen2.5:0.5b-instruct","messages":[{"role":"user","content":"hi"}]}'

That's it. Point any OpenAI/Anthropic SDK at http://127.0.0.1:8000/v1 and it works.

Feature Status
OpenAI chat, completions, models, embeddings Yes
OpenAI streaming (SSE passthrough) Yes
Anthropic /v1/messages (non-streaming) Yes
Anthropic /v1/messages (streaming, event-translated) Yes
Anthropic /v1/messages/count_tokens (heuristic) Yes
API key auth (multiple keys, Authorization + x-api-key) Yes
Structured JSON logs with request IDs Yes
Prometheus metrics at /metrics Yes
/livez + /readyz + /health probes Yes
In-process rate limiting (token bucket) Yes
CORS, proxy-header handling Yes
Tests + CI + linted codebase Yes

Supports any of these open-source LLM runtimes out of the box, selectable via compose profile and a single env var:

Why a gateway in front of an OpenAI-compatible server? A stable public surface, API key enforcement, a real Anthropic-compatible facade (including streaming), metrics, rate limiting, health/readiness probes, and uniform error envelopes — without bolting any of that onto the inference engine.


Table of contents

  1. Quick start
  2. Repository layout
  3. Supported backends
  4. Configuration reference
  5. API surface
  6. Client examples
  7. Reproducibility matrix
  8. Observability
  9. Security & production posture
  10. Testing & CI
  11. Troubleshooting
  12. Further docs

Quick start

git clone <your-fork-url> selfhosted-chat-api
cd selfhosted-chat-api

# Pick a backend. Examples: vllm, ollama, llamacpp, tgi, sglang, localai, external
make env-vllm                    # copies deploy/env/vllm.env to .env
$EDITOR .env                     # set API_KEYS=... and any backend knobs

make up BACKEND=vllm             # docker compose --profile vllm up -d --build
make health                      # confirms /health reports backend_ok

First request:

curl http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [{"role": "user", "content": "Say hi."}],
    "max_tokens": 32
  }'

Point any OpenAI SDK at http://127.0.0.1:8000/v1:

from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="YOUR_API_KEY")
print(client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "ping"}],
).choices[0].message.content)

Point any Anthropic SDK at the same base URL:

import anthropic
client = anthropic.Anthropic(
    base_url="http://127.0.0.1:8000",
    api_key="YOUR_API_KEY",
)
msg = client.messages.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    max_tokens=256,
    messages=[{"role": "user", "content": "ping"}],
)
print(msg.content[0].text)

Repository layout

selfhosted-chat-api/
├── api/
│   ├── app/
│   │   ├── backends.py          # Backend profiles (vLLM, Ollama, ...)
│   │   ├── claude_translate.py  # OpenAI <-> Anthropic translation (incl. SSE)
│   │   ├── config.py            # Env-driven settings
│   │   ├── docs_page.py         # Human-readable /docs page
│   │   ├── errors.py
│   │   ├── http_client.py       # Shared httpx.AsyncClient lifecycle
│   │   ├── logging_setup.py     # Structured JSON logs + request IDs
│   │   ├── main.py              # FastAPI factory + lifespan
│   │   ├── metrics.py           # Dependency-free Prometheus counters/histograms
│   │   ├── middleware.py        # Request ID, access log, rate limit
│   │   ├── proxy.py             # Backend proxying helpers (JSON + SSE)
│   │   ├── rate_limit.py        # In-process token bucket
│   │   └── routes/
│   │       ├── claude.py        # /v1/messages + /v1/messages/count_tokens
│   │       ├── openai.py        # /v1/models, /v1/chat/completions, /v1/completions, /v1/embeddings
│   │       └── system.py        # /, /docs, /reference, /health, /livez, /readyz, /metrics
│   ├── tests/                   # Pytest + httpx MockTransport
│   ├── Dockerfile               # Non-root, slim, with HEALTHCHECK
│   ├── main.py                  # Thin shim for backward compat
│   ├── pyproject.toml           # Ruff + pytest config
│   ├── requirements.txt
│   └── requirements-dev.txt
├── deploy/
│   ├── env/                     # Per-backend .env templates
│   └── nginx/                   # Reverse proxy sample config
├── docs/
│   ├── API_CLAUDE.md
│   ├── API_OPENAI.md
│   ├── BACKENDS.md              # Per-backend install & quirks
│   ├── DEPLOYMENT.md
│   ├── MODELS.md                # Curated open-source model catalog
│   ├── NGINX.md
│   ├── OPERATIONS.md
│   └── USE_CASE_TUNING.md
├── examples/                    # curl + Python clients
├── .github/workflows/ci.yml     # Lint, tests, Docker build, compose validation
├── .env.example
├── docker-compose.yml           # One file, compose profiles per backend
├── Makefile                     # Dev + ops shortcuts
└── README.md

Supported backends

BACKEND_KIND Runtime Compose profile Embeddings Streaming Tools
vllm vLLM vllm yes yes yes (model-dependent)
ollama Ollama ollama yes yes yes (model-dependent)
llamacpp llama.cpp llama-server llamacpp yes yes partial
tgi HF TGI tgi no yes yes (model-dependent)
sglang SGLang sglang yes yes yes
localai LocalAI localai yes yes yes
lmstudio LM Studio — (run on host) yes yes yes
openai Any OpenAI-compatible URL none depends depends depends

See docs/BACKENDS.md for launch flags, model formats, GPU requirements, and specific gotchas per backend.

Switching backends

docker compose down
make env-ollama
make up BACKEND=ollama

Or manage .env manually and run docker compose --profile <name> up -d --build.


Configuration reference

All configuration is environment-driven. Every variable has a safe default when it makes sense; see .env.example for the canonical template.

Gateway

Variable Default Purpose
API_HOST 127.0.0.1 Host interface the gateway binds on. Keep 127.0.0.1 when fronting with Nginx.
API_PORT 8000 Gateway port.
API_KEYS (empty) Comma-separated API keys. Empty disables auth (dev only).
CORS_ORIGINS * Comma-separated list; tighten in production.
CORS_ALLOW_CREDENTIALS false Enable only with explicit origins.
RATE_LIMIT_ENABLED false Turn on token-bucket limiting for /v1/*.
RATE_LIMIT_RPM 120 Per-identity sustained rate.
RATE_LIMIT_BURST 30 Per-identity burst capacity.
REQUEST_TIMEOUT_S 600 End-to-end backend timeout.
CONNECT_TIMEOUT_S 10 TCP connect timeout.
LOG_LEVEL INFO Python log level.
LOG_JSON true Structured JSON logs (disable for local dev).
LOG_PROMPTS false If true, prompt payloads may show up in debug logs. Treat as PII.
METRICS_ENABLED true Expose /metrics.

Backend

Variable Default Purpose
BACKEND_KIND vllm One of vllm, ollama, llamacpp, tgi, sglang, localai, lmstudio, openai.
BACKEND_BASE_URL http://vllm:8001/v1 OpenAI-compatible base URL of the runtime.
BACKEND_API_KEY (empty) If the backend requires its own API key (e.g. llama-server --api-key).
MODEL_NAME Qwen/Qwen2.5-7B-Instruct Default model advertised in docs and examples.

Backend-specific knobs (VLLM_DTYPE, LLAMACPP_CTX, etc.) live in .env.example and only apply when that profile is active.


API surface

Service & observability

Method Path Auth What
GET / no Service summary, endpoint map.
GET /docs no Human-readable HTML docs.
GET /reference no Swagger UI.
GET /openapi.json no OpenAPI schema.
GET /livez no Liveness probe (doesn't touch backend).
GET /readyz no Readiness probe (verifies backend reachable).
GET /health no Combined gateway + backend health.
GET /metrics no Prometheus exposition.

OpenAI-compatible

Method Path
GET /v1/models
POST /v1/chat/completions (streaming + non-streaming)
POST /v1/completions (streaming + non-streaming)
POST /v1/embeddings

Anthropic-compatible

Method Path
POST /v1/messages (streaming + non-streaming)
POST /v1/messages/count_tokens (heuristic)

All /v1/* endpoints require an API key when API_KEYS is set. Send it as Authorization: Bearer <key> or x-api-key: <key>.

Detailed docs:


Client examples

Runnable scripts live under examples/:

Script What it does
openai_chat_curl.sh Non-streaming OpenAI chat completion.
openai_stream_curl.sh Streaming OpenAI chat completion (SSE).
openai_embeddings_curl.sh Embeddings request.
claude_messages_curl.sh Non-streaming Anthropic Messages request.
claude_stream_curl.sh Streaming Anthropic Messages request.
rag_chat_curl.sh RAG prompt template.
extraction_json_curl.sh JSON extraction prompt template.
agent_tools_curl.sh OpenAI tool-calling request (requires tool-capable model).
python_openai_client.py Minimal OpenAI SDK client.
python_anthropic_client.py Minimal Anthropic SDK client.

Reproducibility matrix

These pinned combinations are what CI and smoke tests exercise. Use them verbatim for a known-good baseline and vary intentionally from there.

Component Pinned value
Gateway Python 3.12
FastAPI / Uvicorn / httpx 0.116.1 / 0.35.0 / 0.28.1 (see api/requirements.txt)
Gateway Docker base python:3.12-slim
Default backend vllm/vllm-openai:latest
Default model Qwen/Qwen2.5-7B-Instruct
Default dtype half
Default max context 16384
Default GPU memory util. 0.92
Reference GPU class NVIDIA A10G / L4 / 24 GB-class
CUDA (via vLLM image) shipped by the image (do not install locally)

Deterministic build:

docker compose --profile vllm build --pull api
docker compose --profile vllm up -d

To pin upstream images for immutability, tag them in your registry and replace vllm/vllm-openai:latest (etc.) with the SHA-pinned digest in docker-compose.yml.

Model weight locations are mounted to:

  • ./data/hf-cache — Hugging Face cache (vLLM, TGI, SGLang)
  • ./data/ollama — Ollama model store
  • ./data/models — llama.cpp GGUF files
  • ./data/localai-models — LocalAI catalog
  • ./data/vllm-cache — vLLM compilation cache

Back these up (or restore into them) to reproduce an environment on a new host without re-downloading weights.

A curated catalog of tested open-source models (by task, GPU class, and license) is in docs/MODELS.md.


Observability

Logs

Structured JSON by default, one record per line, includes request_id:

{"ts":"2026-04-22T12:00:00.123Z","level":"INFO","logger":"chat_api.http",
 "msg":"request","request_id":"7f9...","method":"POST",
 "path":"/v1/chat/completions","status":200,"duration_ms":842.12}

Disable with LOG_JSON=false for friendlier local output.

Metrics

GET /metrics returns Prometheus text:

  • chat_api_uptime_seconds
  • chat_api_requests_total{method,path,status}
  • chat_api_request_duration_seconds (histogram)
  • chat_api_backend_errors_total{kind}
  • chat_api_rate_limited_total{identity}

Point any Prometheus scraper at <gateway>/metrics.

Probes

Path Use
/livez Kubernetes liveness — returns 200 unless the process is dead.
/readyz Readiness — 200 when the backend responds, 503 otherwise.
/health Combined human-readable status.

Sharing with peers

Same network (LAN)

Let a teammate hit your machine directly:

# in .env
API_HOST=0.0.0.0      # bind on all interfaces, still protected by API_KEYS
API_KEYS=pick-something-random-and-share-it

Then give them your IP: http://<your-lan-ip>:8000/v1. Keep API_KEYS set0.0.0.0 means anyone on the same network can reach the port.

Remote peer (no VPN, no deployment)

Expose a local instance over HTTPS for a few minutes of demoing:

# Option 1: cloudflared (free, no signup for quick tunnels)
cloudflared tunnel --url http://127.0.0.1:8000

# Option 2: ngrok
ngrok http 8000

Both print a public https://... URL your peer can use. Tear down when done.

Give them a config

Share these four lines and they're running an OpenAI SDK client against your deployment:

from openai import OpenAI
client = OpenAI(base_url="https://<your-url>/v1", api_key="<the-shared-key>")
client.chat.completions.create(model="<your-model>", messages=[{"role":"user","content":"hi"}])

Security & production posture

  • Always set API_KEYS in any deployment reachable beyond localhost.
  • Bind the API container to 127.0.0.1 and place Nginx / Caddy in front with TLS. Sample config at deploy/nginx/selfhosted-chat-api.conf.
  • Keep the inference backend on the internal Docker network. Never expose vLLM/Ollama/TGI/... directly unless you want an unauthenticated inference endpoint on the internet.
  • The API container runs non-root with read_only rootfs, no-new-privileges, and all Linux capabilities dropped.
  • For multi-replica deployments, disable the in-process limiter and use a real one (Nginx limit_req, Envoy, or Redis-backed).
  • LOG_PROMPTS=false by default — prompt bodies are not logged. Keep it off unless you accept the PII implications.
  • CORS_ORIGINS=* is fine for public APIs that don't use cookies. For any deployment that does, set explicit origins.

Testing & CI

Run locally against a conda env:

conda create -y -n selfhosted-chat-api python=3.12
conda activate selfhosted-chat-api
cd api && pip install -r requirements-dev.txt
pytest -q
ruff check app tests

GitHub Actions runs on every push/PR:

  • ruff check
  • pytest
  • Docker build for the gateway
  • docker compose config against every profile

Tests use httpx.MockTransport and cover: auth, backend proxying, Claude translation (both non-streaming and streaming SSE), health/readyz/metrics, request IDs, rate-limit math, and backend capability detection.


Troubleshooting

Symptom Likely cause Fix
/health returns 503 Backend not reachable (still downloading weights, wrong URL, OOM) docker compose logs <backend>
401 on /v1/* Missing/mismatched API key Send Authorization: Bearer ... or x-api-key: ...
429 In-process rate limit triggered Lower traffic or raise RATE_LIMIT_RPM / RATE_LIMIT_BURST
501 from /v1/embeddings Backend doesn't expose embeddings (e.g. TGI) Switch backend or run a separate embeddings server
Claude streaming is empty Backend emitted no content or returned an error Check backend logs; verify stream: true is supported
Slow first request Model weights + kernel compile Expected; later restarts reuse the cache volumes

See docs/OPERATIONS.md for the full runbook.


Further docs

License

This repo ships deployment glue only. The inference runtimes and models it orchestrates carry their own licenses — read them before using a model.

About

Self-hosted FastAPI gateway exposing OpenAI and Anthropic Messages APIs in front of any open-source LLM runtime (vLLM, Ollama, llama.cpp, TGI, SGLang, LocalAI, LM Studio). Streaming, embeddings, metrics, auth, rate limiting.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors