Self-Hosted Chat API

Production-ready gateway that puts an OpenAI-compatible and Anthropic Messages-compatible API in front of any open-source LLM runtime. One deployment pattern, one client library set, any backend.

Architecture

          ┌────────────────────────────────────────────────────────┐
 clients  │  OpenAI SDK / Anthropic SDK / curl / any HTTP client   │
──────────┼────────────────────────────────────────────────────────┤
   TLS    │  Nginx / Caddy     (optional reverse proxy + TLS)      │
──────────┼────────────────────────────────────────────────────────┤
          │  FastAPI gateway    auth · rate limit · CORS · metrics │
   this   │                     /livez /readyz /health /metrics    │
   repo   │                     /v1/chat/completions  (stream: yes)│
          │                     /v1/completions       (stream: yes)│
          │                     /v1/embeddings                     │
          │                     /v1/messages          (stream: yes)│
          │                     /v1/messages/count_tokens          │
──────────┼────────────────────────────────────────────────────────┤
          │  Backend (pick one, swap via env + compose profile)    │
 runtime  │  vLLM · Ollama · llama.cpp · TGI · SGLang · LocalAI    │
          │  LM Studio · any OpenAI-compatible URL                 │
          └────────────────────────────────────────────────────────┘

Demo on your laptop (no GPU)

git clone https://github.com/varad-more/selfhosted-chat-api selfhosted-chat-api
cd selfhosted-chat-api
make demo           # env + compose up + pull qwen2.5:0.5b-instruct (~400 MB)

curl http://127.0.0.1:8000/v1/chat/completions \
  -H "Authorization: Bearer demo-key" \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen2.5:0.5b-instruct","messages":[{"role":"user","content":"hi"}]}'

That's it. Point any OpenAI/Anthropic SDK at http://127.0.0.1:8000/v1 and it works.

Feature	Status
OpenAI chat, completions, models, embeddings	Yes
OpenAI streaming (SSE passthrough)	Yes
Anthropic `/v1/messages` (non-streaming)	Yes
Anthropic `/v1/messages` (streaming, event-translated)	Yes
Anthropic `/v1/messages/count_tokens` (heuristic)	Yes
API key auth (multiple keys, `Authorization` + `x-api-key`)	Yes
Structured JSON logs with request IDs	Yes
Prometheus metrics at `/metrics`	Yes
`/livez` + `/readyz` + `/health` probes	Yes
In-process rate limiting (token bucket)	Yes
CORS, proxy-header handling	Yes
Tests + CI + linted codebase	Yes

Supports any of these open-source LLM runtimes out of the box, selectable via compose profile and a single env var:

vLLM
Ollama
llama.cpp llama-server
Hugging Face Text Generation Inference (TGI)
SGLang
LocalAI
LM Studio (local server mode)
any other OpenAI-compatible endpoint (set BACKEND_KIND=openai)

Why a gateway in front of an OpenAI-compatible server? A stable public surface, API key enforcement, a real Anthropic-compatible facade (including streaming), metrics, rate limiting, health/readiness probes, and uniform error envelopes — without bolting any of that onto the inference engine.

Quick start

git clone <your-fork-url> selfhosted-chat-api
cd selfhosted-chat-api

# Pick a backend. Examples: vllm, ollama, llamacpp, tgi, sglang, localai, external
make env-vllm                    # copies deploy/env/vllm.env to .env
$EDITOR .env                     # set API_KEYS=... and any backend knobs

make up BACKEND=vllm             # docker compose --profile vllm up -d --build
make health                      # confirms /health reports backend_ok

First request:

curl http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [{"role": "user", "content": "Say hi."}],
    "max_tokens": 32
  }'

Point any OpenAI SDK at http://127.0.0.1:8000/v1:

from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="YOUR_API_KEY")
print(client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "ping"}],
).choices[0].message.content)

Point any Anthropic SDK at the same base URL:

import anthropic
client = anthropic.Anthropic(
    base_url="http://127.0.0.1:8000",
    api_key="YOUR_API_KEY",
)
msg = client.messages.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    max_tokens=256,
    messages=[{"role": "user", "content": "ping"}],
)
print(msg.content[0].text)

Repository layout

selfhosted-chat-api/
├── api/
│   ├── app/
│   │   ├── backends.py          # Backend profiles (vLLM, Ollama, ...)
│   │   ├── claude_translate.py  # OpenAI <-> Anthropic translation (incl. SSE)
│   │   ├── config.py            # Env-driven settings
│   │   ├── docs_page.py         # Human-readable /docs page
│   │   ├── errors.py
│   │   ├── http_client.py       # Shared httpx.AsyncClient lifecycle
│   │   ├── logging_setup.py     # Structured JSON logs + request IDs
│   │   ├── main.py              # FastAPI factory + lifespan
│   │   ├── metrics.py           # Dependency-free Prometheus counters/histograms
│   │   ├── middleware.py        # Request ID, access log, rate limit
│   │   ├── proxy.py             # Backend proxying helpers (JSON + SSE)
│   │   ├── rate_limit.py        # In-process token bucket
│   │   └── routes/
│   │       ├── claude.py        # /v1/messages + /v1/messages/count_tokens
│   │       ├── openai.py        # /v1/models, /v1/chat/completions, /v1/completions, /v1/embeddings
│   │       └── system.py        # /, /docs, /reference, /health, /livez, /readyz, /metrics
│   ├── tests/                   # Pytest + httpx MockTransport
│   ├── Dockerfile               # Non-root, slim, with HEALTHCHECK
│   ├── main.py                  # Thin shim for backward compat
│   ├── pyproject.toml           # Ruff + pytest config
│   ├── requirements.txt
│   └── requirements-dev.txt
├── deploy/
│   ├── env/                     # Per-backend .env templates
│   └── nginx/                   # Reverse proxy sample config
├── docs/
│   ├── API_CLAUDE.md
│   ├── API_OPENAI.md
│   ├── BACKENDS.md              # Per-backend install & quirks
│   ├── DEPLOYMENT.md
│   ├── MODELS.md                # Curated open-source model catalog
│   ├── NGINX.md
│   ├── OPERATIONS.md
│   └── USE_CASE_TUNING.md
├── examples/                    # curl + Python clients
├── .github/workflows/ci.yml     # Lint, tests, Docker build, compose validation
├── .env.example
├── docker-compose.yml           # One file, compose profiles per backend
├── Makefile                     # Dev + ops shortcuts
└── README.md

Supported backends

`BACKEND_KIND`	Runtime	Compose profile	Embeddings	Streaming	Tools
`vllm`	vLLM	`vllm`	yes	yes	yes (model-dependent)
`ollama`	Ollama	`ollama`	yes	yes	yes (model-dependent)
`llamacpp`	llama.cpp `llama-server`	`llamacpp`	yes	yes	partial
`tgi`	HF TGI	`tgi`	no	yes	yes (model-dependent)
`sglang`	SGLang	`sglang`	yes	yes	yes
`localai`	LocalAI	`localai`	yes	yes	yes
`lmstudio`	LM Studio	— (run on host)	yes	yes	yes
`openai`	Any OpenAI-compatible URL	`none`	depends	depends	depends

See docs/BACKENDS.md for launch flags, model formats, GPU requirements, and specific gotchas per backend.

Switching backends

docker compose down
make env-ollama
make up BACKEND=ollama

Or manage .env manually and run docker compose --profile <name> up -d --build.

Configuration reference

All configuration is environment-driven. Every variable has a safe default when it makes sense; see .env.example for the canonical template.

Gateway

Variable	Default	Purpose
`API_HOST`	`127.0.0.1`	Host interface the gateway binds on. Keep `127.0.0.1` when fronting with Nginx.
`API_PORT`	`8000`	Gateway port.
`API_KEYS`	(empty)	Comma-separated API keys. Empty disables auth (dev only).
`CORS_ORIGINS`	`*`	Comma-separated list; tighten in production.
`CORS_ALLOW_CREDENTIALS`	`false`	Enable only with explicit origins.
`RATE_LIMIT_ENABLED`	`false`	Turn on token-bucket limiting for `/v1/*`.
`RATE_LIMIT_RPM`	`120`	Per-identity sustained rate.
`RATE_LIMIT_BURST`	`30`	Per-identity burst capacity.
`REQUEST_TIMEOUT_S`	`600`	End-to-end backend timeout.
`CONNECT_TIMEOUT_S`	`10`	TCP connect timeout.
`LOG_LEVEL`	`INFO`	Python log level.
`LOG_JSON`	`true`	Structured JSON logs (disable for local dev).
`LOG_PROMPTS`	`false`	If true, prompt payloads may show up in debug logs. Treat as PII.
`METRICS_ENABLED`	`true`	Expose `/metrics`.

Backend

Variable	Default	Purpose
`BACKEND_KIND`	`vllm`	One of `vllm`, `ollama`, `llamacpp`, `tgi`, `sglang`, `localai`, `lmstudio`, `openai`.
`BACKEND_BASE_URL`	`http://vllm:8001/v1`	OpenAI-compatible base URL of the runtime.
`BACKEND_API_KEY`	(empty)	If the backend requires its own API key (e.g. `llama-server --api-key`).
`MODEL_NAME`	`Qwen/Qwen2.5-7B-Instruct`	Default model advertised in docs and examples.

Backend-specific knobs (VLLM_DTYPE, LLAMACPP_CTX, etc.) live in .env.example and only apply when that profile is active.

API surface

Service & observability

Method	Path	Auth	What
GET	`/`	no	Service summary, endpoint map.
GET	`/docs`	no	Human-readable HTML docs.
GET	`/reference`	no	Swagger UI.
GET	`/openapi.json`	no	OpenAPI schema.
GET	`/livez`	no	Liveness probe (doesn't touch backend).
GET	`/readyz`	no	Readiness probe (verifies backend reachable).
GET	`/health`	no	Combined gateway + backend health.
GET	`/metrics`	no	Prometheus exposition.

OpenAI-compatible

Method	Path
GET	`/v1/models`
POST	`/v1/chat/completions` (streaming + non-streaming)
POST	`/v1/completions` (streaming + non-streaming)
POST	`/v1/embeddings`

Anthropic-compatible

Method	Path
POST	`/v1/messages` (streaming + non-streaming)
POST	`/v1/messages/count_tokens` (heuristic)

All /v1/* endpoints require an API key when API_KEYS is set. Send it as Authorization: Bearer <key> or x-api-key: <key>.

Detailed docs:

Client examples

Runnable scripts live under examples/:

Script	What it does
`openai_chat_curl.sh`	Non-streaming OpenAI chat completion.
`openai_stream_curl.sh`	Streaming OpenAI chat completion (SSE).
`openai_embeddings_curl.sh`	Embeddings request.
`claude_messages_curl.sh`	Non-streaming Anthropic Messages request.
`claude_stream_curl.sh`	Streaming Anthropic Messages request.
`rag_chat_curl.sh`	RAG prompt template.
`extraction_json_curl.sh`	JSON extraction prompt template.
`agent_tools_curl.sh`	OpenAI tool-calling request (requires tool-capable model).
`python_openai_client.py`	Minimal OpenAI SDK client.
`python_anthropic_client.py`	Minimal Anthropic SDK client.

Reproducibility matrix

These pinned combinations are what CI and smoke tests exercise. Use them verbatim for a known-good baseline and vary intentionally from there.

Component	Pinned value
Gateway Python	`3.12`
FastAPI / Uvicorn / httpx	`0.116.1 / 0.35.0 / 0.28.1` (see `api/requirements.txt`)
Gateway Docker base	`python:3.12-slim`
Default backend	`vllm/vllm-openai:latest`
Default model	`Qwen/Qwen2.5-7B-Instruct`
Default dtype	`half`
Default max context	`16384`
Default GPU memory util.	`0.92`
Reference GPU class	NVIDIA A10G / L4 / 24 GB-class
CUDA (via vLLM image)	shipped by the image (do not install locally)

Deterministic build:

docker compose --profile vllm build --pull api
docker compose --profile vllm up -d

To pin upstream images for immutability, tag them in your registry and replace vllm/vllm-openai:latest (etc.) with the SHA-pinned digest in docker-compose.yml.

Model weight locations are mounted to:

./data/hf-cache — Hugging Face cache (vLLM, TGI, SGLang)
./data/ollama — Ollama model store
./data/models — llama.cpp GGUF files
./data/localai-models — LocalAI catalog
./data/vllm-cache — vLLM compilation cache

Back these up (or restore into them) to reproduce an environment on a new host without re-downloading weights.

A curated catalog of tested open-source models (by task, GPU class, and license) is in docs/MODELS.md.

Observability

Logs

Structured JSON by default, one record per line, includes request_id:

{"ts":"2026-04-22T12:00:00.123Z","level":"INFO","logger":"chat_api.http",
 "msg":"request","request_id":"7f9...","method":"POST",
 "path":"/v1/chat/completions","status":200,"duration_ms":842.12}

Disable with LOG_JSON=false for friendlier local output.

Metrics

GET /metrics returns Prometheus text:

chat_api_uptime_seconds
chat_api_requests_total{method,path,status}
chat_api_request_duration_seconds (histogram)
chat_api_backend_errors_total{kind}
chat_api_rate_limited_total{identity}

Point any Prometheus scraper at <gateway>/metrics.

Probes

Path	Use
`/livez`	Kubernetes liveness — returns 200 unless the process is dead.
`/readyz`	Readiness — 200 when the backend responds, 503 otherwise.
`/health`	Combined human-readable status.

Sharing with peers

Same network (LAN)

Let a teammate hit your machine directly:

# in .env
API_HOST=0.0.0.0      # bind on all interfaces, still protected by API_KEYS
API_KEYS=pick-something-random-and-share-it

Then give them your IP: http://<your-lan-ip>:8000/v1. Keep API_KEYS set — 0.0.0.0 means anyone on the same network can reach the port.

Remote peer (no VPN, no deployment)

Expose a local instance over HTTPS for a few minutes of demoing:

# Option 1: cloudflared (free, no signup for quick tunnels)
cloudflared tunnel --url http://127.0.0.1:8000

# Option 2: ngrok
ngrok http 8000

Both print a public https://... URL your peer can use. Tear down when done.

Give them a config

Share these four lines and they're running an OpenAI SDK client against your deployment:

from openai import OpenAI
client = OpenAI(base_url="https://<your-url>/v1", api_key="<the-shared-key>")
client.chat.completions.create(model="<your-model>", messages=[{"role":"user","content":"hi"}])

Security & production posture

Always set API_KEYS in any deployment reachable beyond localhost.
Bind the API container to 127.0.0.1 and place Nginx / Caddy in front with TLS. Sample config at deploy/nginx/selfhosted-chat-api.conf.
Keep the inference backend on the internal Docker network. Never expose vLLM/Ollama/TGI/... directly unless you want an unauthenticated inference endpoint on the internet.
The API container runs non-root with read_only rootfs, no-new-privileges, and all Linux capabilities dropped.
For multi-replica deployments, disable the in-process limiter and use a real one (Nginx limit_req, Envoy, or Redis-backed).
LOG_PROMPTS=false by default — prompt bodies are not logged. Keep it off unless you accept the PII implications.
CORS_ORIGINS=* is fine for public APIs that don't use cookies. For any deployment that does, set explicit origins.

Testing & CI

Run locally against a conda env:

conda create -y -n selfhosted-chat-api python=3.12
conda activate selfhosted-chat-api
cd api && pip install -r requirements-dev.txt
pytest -q
ruff check app tests

GitHub Actions runs on every push/PR:

ruff check
pytest
Docker build for the gateway
docker compose config against every profile

Tests use httpx.MockTransport and cover: auth, backend proxying, Claude translation (both non-streaming and streaming SSE), health/readyz/metrics, request IDs, rate-limit math, and backend capability detection.

Troubleshooting

Symptom	Likely cause	Fix
`/health` returns 503	Backend not reachable (still downloading weights, wrong URL, OOM)	`docker compose logs <backend>`
401 on `/v1/*`	Missing/mismatched API key	Send `Authorization: Bearer ...` or `x-api-key: ...`
429	In-process rate limit triggered	Lower traffic or raise `RATE_LIMIT_RPM` / `RATE_LIMIT_BURST`
501 from `/v1/embeddings`	Backend doesn't expose embeddings (e.g. TGI)	Switch backend or run a separate embeddings server
Claude streaming is empty	Backend emitted no content or returned an error	Check backend logs; verify `stream: true` is supported
Slow first request	Model weights + kernel compile	Expected; later restarts reuse the cache volumes

See docs/OPERATIONS.md for the full runbook.

Further docs

docs/DEPLOYMENT.md — host setup, first boot, rollback
docs/BACKENDS.md — per-backend notes, GPU requirements, launch flags
docs/MODELS.md — curated open-source model catalog by task and GPU class
docs/API_OPENAI.md — OpenAI-compatible endpoints
docs/API_CLAUDE.md — Anthropic Messages facade
docs/USE_CASE_TUNING.md — chat / RAG / extraction / agent patterns
docs/NGINX.md — reverse proxy configuration
docs/OPERATIONS.md — day-2 operations and troubleshooting

License

This repo ships deployment glue only. The inference runtimes and models it orchestrates carry their own licenses — read them before using a model.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
api		api
deploy		deploy
docs		docs
examples		examples
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

Self-Hosted Chat API

Architecture

Demo on your laptop (no GPU)

Table of contents

Quick start

Repository layout

Supported backends

Switching backends

Configuration reference

Gateway

Backend

API surface

Service & observability

OpenAI-compatible

Anthropic-compatible

Client examples

Reproducibility matrix

Observability

Logs

Metrics

Probes

Sharing with peers

Same network (LAN)

Remote peer (no VPN, no deployment)

Give them a config

Security & production posture

Testing & CI

Troubleshooting

Further docs

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages