Operational Runbook

This guide covers day-to-day operations for running ModelRelay in production. It assumes you have one modelrelay-server instance and one or more modelrelay-worker processes.

Health Checks

Proxy Server

The proxy server exposes a dedicated /health endpoint:

# Primary health check — returns JSON with version, worker count, queue depth, and uptime.
curl -sf http://proxy:8080/health | jq .

Example response:

{
  "status": "ok",
  "version": "0.1.6",
  "workers_connected": 2,
  "queue_depth": 0,
  "uptime_secs": 3621.5
}

Use /health for liveness probes, Kubernetes readiness checks, and monitoring. A workers_connected of 0 means the proxy is running but no workers are registered.

You can also list routable models directly:

curl -s http://proxy:8080/v1/models | jq '.data[].id'

Worker Daemon

The worker daemon does not expose its own HTTP port — it connects outward to the proxy. Health is observable from the proxy side:

# Check if workers are registered by listing models.
curl -s http://proxy:8080/v1/models | jq '.data[].id'

If expected models are missing, the worker is either down or failed to register. Check worker logs for connection errors or authentication failures.

Admin API & Monitoring

ModelRelay includes admin endpoints for inspecting workers, request metrics, and managing client API keys. All /admin/* endpoints require a Bearer token.

Enabling Admin Endpoints

Set MODELRELAY_ADMIN_TOKEN when starting the server:

modelrelay-server --worker-secret mysecret --admin-token my-admin-secret

Without this token, all /admin/* endpoints return 403 Forbidden.

Querying Admin Endpoints

TOKEN="my-admin-secret"

# List connected workers (models, load, capabilities)
curl -s -H "Authorization: Bearer $TOKEN" http://proxy:8080/admin/workers | jq .

# Request stats and queue depth
curl -s -H "Authorization: Bearer $TOKEN" http://proxy:8080/admin/stats | jq .

# List client API keys (metadata only, no secrets)
curl -s -H "Authorization: Bearer $TOKEN" http://proxy:8080/admin/keys | jq .

Managing Client API Keys

When MODELRELAY_REQUIRE_API_KEYS=true, clients must send a valid API key as a Bearer token on inference requests.

TOKEN="my-admin-secret"

# Create a new API key (the secret is returned only at creation time)
curl -s -X POST \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"name": "production-app"}' \
  http://proxy:8080/admin/keys | jq .

# Revoke a key by ID
curl -s -X DELETE \
  -H "Authorization: Bearer $TOKEN" \
  http://proxy:8080/admin/keys/{key-id}

Clients use the returned secret as a Bearer token:

curl -H "Authorization: Bearer mr-..." \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2:3b", "messages": [{"role": "user", "content": "Hi"}]}' \
  http://proxy:8080/v1/chat/completions

Web Dashboard & Setup Wizard

The proxy serves a built-in web UI:

Dashboard — visit http://proxy:8080/dashboard for real-time worker status, request metrics, and queue depth.
Setup Wizard — visit http://proxy:8080/setup for a step-by-step guide to connecting a new worker (platform detection, backend setup, binary download, and live connection verification).

The wizard is always accessible, not just on first run — use it whenever you add another GPU box.

Troubleshooting Admin Features

Admin endpoints return 403: MODELRELAY_ADMIN_TOKEN is not set on the server, or the Authorization header doesn't match. Verify the token value and ensure the header format is Authorization: Bearer <token>.

Client requests return 401 when API key auth is enabled: The client is not sending a Bearer token, or the key has been revoked. Create a new key via POST /admin/keys and ensure the client sends Authorization: Bearer <key>.

API key auth not taking effect: MODELRELAY_REQUIRE_API_KEYS must be set to true. When false (the default), inference endpoints accept unauthenticated requests.

Checking Worker Registration

After starting a new worker, confirm it registered:

# Should include the worker's advertised models.
curl -s http://proxy:8080/v1/models | jq .

If a worker's models don't appear within ~10 seconds:

Check the worker secret — does WORKER_SECRET on the worker match the proxy?

Check connectivity — can the worker reach PROXY_URL?

curl -v http://proxy:8080/v1/worker/connect
# Should get 400 or upgrade-required, not a connection error

Check worker logs — look for register / register_ack messages or error lines.

Draining a Worker Gracefully

To remove a worker from rotation without dropping in-flight requests:

Send SIGTERM to the modelrelay-worker process. The daemon initiates a graceful disconnect — the proxy sends a GracefulShutdown message and stops routing new requests to that worker.
In-flight requests finish normally. The proxy waits up to drain_timeout_secs (from the shutdown message) for active requests to complete.
Once idle, the WebSocket closes. The worker process exits.

# Graceful stop via systemd
systemctl stop modelrelay-worker@gpu-box-1

# Or with Docker
docker stop --time 60 worker-gpu-box-1

Monitoring drain progress: Watch the proxy logs for "worker drained" or similar messages. If the worker still has in-flight requests, you'll see ongoing ResponseChunk / ResponseComplete messages until they finish.

Scaling Workers

Adding a worker

Start a new modelrelay-worker instance pointing at the same proxy:

PROXY_URL=http://proxy:8080 \
WORKER_SECRET=your-secret \
WORKER_NAME=gpu-box-4 \
BACKEND_URL=http://localhost:8000 \
  modelrelay-worker --models llama3-8b

The proxy discovers it within seconds via the WebSocket registration handshake. No proxy restart or config change needed.

Removing a worker

Use the graceful drain procedure above. The proxy automatically routes around disconnected workers.

Scaling the proxy

The proxy is a single-process server. To scale:

Vertical: increase MAX_QUEUE_LEN and system file descriptor limits.
Horizontal: run multiple proxy instances behind a load balancer, but note that each worker connects to one proxy. Workers must be distributed across proxy instances manually or via DNS round-robin.

Log Interpretation

Proxy Server

Log pattern	Meaning
`worker registered` / `register_ack`	Worker connected and authenticated
`request dispatched`	Request sent to a worker
`response complete`	Worker returned a result
`worker heartbeat timed out`	Worker missed pings — WebSocket closed
`request requeued`	Worker died mid-request, retrying on another worker
`requeue exhausted`	Request failed after `MAX_REQUEUE_COUNT` (3) retries
`queue full`	Rejected request — queue at `MAX_QUEUE_LEN` capacity
`queue timeout`	Request sat in queue longer than `QUEUE_TIMEOUT_SECS`
`graceful shutdown`	Worker drain initiated

Worker Daemon

Log pattern	Meaning
`connected to proxy`	WebSocket connection established
`registered`	Registration acknowledged by proxy
`forwarding request`	Proxying a request to the local backend
`backend error`	Local backend returned an error or is unreachable
`cancelled`	Proxy sent a cancel for an in-flight request
`graceful shutdown`	Drain in progress, finishing active requests

Adjusting log verbosity

Set LOG_LEVEL environment variable on either component:

LOG_LEVEL=debug modelrelay-server   # trace, debug, info (default), warn, error
LOG_LEVEL=debug modelrelay-worker

Common Failure Modes

Worker can't connect to proxy

Symptoms: Worker logs show connection refused or timeouts.

Checklist:

Is the proxy running? curl http://proxy:8080/v1/models
Is PROXY_URL correct? The worker connects to {PROXY_URL}/v1/worker/connect via WebSocket.
Firewall / network: the worker makes an outbound connection to the proxy — no inbound ports needed on the worker machine.
If using TLS (nginx/reverse proxy in front), ensure WebSocket upgrade headers are forwarded. See the TLS Setup guide.

Worker registers but requests fail

Symptoms: /v1/models shows the model, but requests return 502 or timeout.

Checklist:

Is the local backend running? curl http://localhost:8000/v1/models (or whatever BACKEND_URL is set to)
Does the backend support the requested endpoint? (/v1/chat/completions, /v1/messages, /v1/responses)
Check worker logs for backend error messages.
Try a direct request to the backend to isolate the issue.

Requests queue but never complete

Symptoms: Clients hang, then get a timeout error after QUEUE_TIMEOUT_SECS.

Causes:

No workers are connected (check /v1/models)
Workers are at capacity (max_concurrent reached on all workers)
Workers are connected but not advertising the requested model

Fix: Add more workers, increase max_concurrent if the hardware allows, or reduce QUEUE_TIMEOUT_SECS to fail faster.

Streaming responses arrive corrupted

Symptoms: SSE chunks arrive garbled or out of order.

Checklist:

Ensure no intermediate proxy is buffering. Disable response buffering in nginx:
```
proxy_buffering off;
```
If using a CDN or reverse proxy, ensure it supports chunked transfer encoding and doesn't aggregate small writes.

High memory usage on the proxy

Symptoms: Proxy RSS grows over time.

Causes:

Large queue of pending requests (each holds the full request body)
Many concurrent streaming responses with large chunk buffers

Fix: Lower MAX_QUEUE_LEN, set QUEUE_TIMEOUT_SECS to a shorter value, or add workers to drain the queue faster.

Worker keeps reconnecting

Symptoms: Worker logs show repeated connect/disconnect cycles.

Causes:

Heartbeat timeout — the worker or network is too slow to respond to pings within HEARTBEAT_INTERVAL
WORKER_SECRET mismatch — worker connects, fails auth, gets disconnected, retries

Fix: Check secrets match, check network latency between worker and proxy.

Configuration Quick Reference

Proxy Server

Env Var	Default	Description
`LISTEN_ADDR`	`127.0.0.1:8080`	HTTP listen address
`PROVIDER_NAME`	`local`	Provider name for routing
`WORKER_SECRET`	(required)	Shared secret for worker auth
`MAX_QUEUE_LEN`	`100`	Max queued requests before rejecting
`QUEUE_TIMEOUT_SECS`	`30`	How long a request can wait in queue
`REQUEST_TIMEOUT_SECS`	`300`	Total request timeout (5 min)
`LOG_LEVEL`	`info`	Log verbosity
`MODELRELAY_ADMIN_TOKEN`	(none)	Bearer token for `/admin/*` endpoints (if unset, admin returns 403)
`MODELRELAY_REQUIRE_API_KEYS`	`false`	When `true`, client requests require a valid API key

Worker Daemon

Env Var	Default	Description
`PROXY_URL`	`http://127.0.0.1:8080`	Proxy server URL
`WORKER_SECRET`	(required)	Must match proxy's secret
`WORKER_NAME`	`worker`	Human-readable worker name
`BACKEND_URL`	`http://127.0.0.1:8000`	Local model server URL
`LOG_LEVEL`	`info`	Log verbosity

Windows Service

Checking Service Status

Get-Service ModelRelayServer
Get-Service ModelRelayWorker

Starting and Stopping

Start-Service ModelRelayServer
Stop-Service ModelRelayServer

Start-Service ModelRelayWorker
Stop-Service ModelRelayWorker

Stop-Service sends a stop control signal and waits for the process to exit. ModelRelay handles this as a graceful shutdown — in-flight requests finish before the process terminates. To set an explicit timeout:

# Stop with a 60-second timeout (kills the process if it doesn't exit in time)
Stop-Service ModelRelayServer -NoWait
Start-Sleep -Seconds 60
(Get-Service ModelRelayServer).WaitForStatus("Stopped", "00:00:05")

Logs

Windows Services don't write to stdout by default. Two options:

Windows Event Log — ModelRelay writes to the Application log. View with:

Get-EventLog -LogName Application -Source ModelRelayServer -Newest 50

File logging via RUST_LOG — set RUST_LOG as a system environment variable and redirect output to a file by wrapping the binary in a small script, or use the RUST_LOG_FILE convention if supported. The simplest approach:
```
[Environment]::SetEnvironmentVariable("RUST_LOG", "info", "Machine")
```

Draining a Worker

To drain a worker gracefully before maintenance:

# Stop the service — this triggers graceful shutdown.
Stop-Service ModelRelayWorker

# Verify it has stopped.
Get-Service ModelRelayWorker

The worker completes in-flight requests before exiting, identical to the systemctl stop behavior on Linux.

Monitoring Checklist

For production deployments, monitor these signals:

Proxy process is up — HTTP health check on /health
At least one worker registered — /health returns workers_connected > 0
Queue depth — /health returns queue_depth; watch for sustained growth
Request latency — track time from client request to first byte
Worker reconnect rate — frequent reconnects indicate network or auth issues
Error rates — 4xx (client errors) vs 5xx (backend/proxy errors)
Backend health — each worker's local model server should be independently monitored

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operational Runbook

Health Checks

Proxy Server

Worker Daemon

Admin API & Monitoring

Enabling Admin Endpoints

Querying Admin Endpoints

Managing Client API Keys

Web Dashboard & Setup Wizard

Troubleshooting Admin Features

Checking Worker Registration

Draining a Worker Gracefully

Scaling Workers

Adding a worker

Removing a worker

Scaling the proxy

Log Interpretation

Proxy Server

Worker Daemon

Adjusting log verbosity

Common Failure Modes

Worker can't connect to proxy

Worker registers but requests fail

Requests queue but never complete

Streaming responses arrive corrupted

High memory usage on the proxy

Worker keeps reconnecting

Configuration Quick Reference

Proxy Server

Worker Daemon

Windows Service

Checking Service Status

Starting and Stopping

Logs

Draining a Worker

Monitoring Checklist

FilesExpand file tree

runbook.md

Latest commit

History

runbook.md

File metadata and controls

Operational Runbook

Health Checks

Proxy Server

Worker Daemon

Admin API & Monitoring

Enabling Admin Endpoints

Querying Admin Endpoints

Managing Client API Keys

Web Dashboard & Setup Wizard

Troubleshooting Admin Features

Checking Worker Registration

Draining a Worker Gracefully

Scaling Workers

Adding a worker

Removing a worker

Scaling the proxy

Log Interpretation

Proxy Server

Worker Daemon

Adjusting log verbosity

Common Failure Modes

Worker can't connect to proxy

Worker registers but requests fail

Requests queue but never complete

Streaming responses arrive corrupted

High memory usage on the proxy

Worker keeps reconnecting

Configuration Quick Reference

Proxy Server

Worker Daemon

Windows Service

Checking Service Status

Starting and Stopping

Logs

Draining a Worker

Monitoring Checklist