This guide covers day-to-day operations for running ModelRelay in
production. It assumes you have one modelrelay-server instance and one or
more modelrelay-worker processes.
The proxy server exposes a dedicated /health endpoint:
# Primary health check — returns JSON with version, worker count, queue depth, and uptime.
curl -sf http://proxy:8080/health | jq .Example response:
{
"status": "ok",
"version": "0.1.6",
"workers_connected": 2,
"queue_depth": 0,
"uptime_secs": 3621.5
}Use /health for liveness probes, Kubernetes readiness checks, and
monitoring. A workers_connected of 0 means the proxy is running but
no workers are registered.
You can also list routable models directly:
curl -s http://proxy:8080/v1/models | jq '.data[].id'The worker daemon does not expose its own HTTP port — it connects outward to the proxy. Health is observable from the proxy side:
# Check if workers are registered by listing models.
curl -s http://proxy:8080/v1/models | jq '.data[].id'If expected models are missing, the worker is either down or failed to register. Check worker logs for connection errors or authentication failures.
ModelRelay includes admin endpoints for inspecting workers, request metrics,
and managing client API keys. All /admin/* endpoints require a Bearer
token.
Set MODELRELAY_ADMIN_TOKEN when starting the server:
modelrelay-server --worker-secret mysecret --admin-token my-admin-secretWithout this token, all /admin/* endpoints return 403 Forbidden.
TOKEN="my-admin-secret"
# List connected workers (models, load, capabilities)
curl -s -H "Authorization: Bearer $TOKEN" http://proxy:8080/admin/workers | jq .
# Request stats and queue depth
curl -s -H "Authorization: Bearer $TOKEN" http://proxy:8080/admin/stats | jq .
# List client API keys (metadata only, no secrets)
curl -s -H "Authorization: Bearer $TOKEN" http://proxy:8080/admin/keys | jq .When MODELRELAY_REQUIRE_API_KEYS=true, clients must send a valid API key
as a Bearer token on inference requests.
TOKEN="my-admin-secret"
# Create a new API key (the secret is returned only at creation time)
curl -s -X POST \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"name": "production-app"}' \
http://proxy:8080/admin/keys | jq .
# Revoke a key by ID
curl -s -X DELETE \
-H "Authorization: Bearer $TOKEN" \
http://proxy:8080/admin/keys/{key-id}Clients use the returned secret as a Bearer token:
curl -H "Authorization: Bearer mr-..." \
-H "Content-Type: application/json" \
-d '{"model": "llama3.2:3b", "messages": [{"role": "user", "content": "Hi"}]}' \
http://proxy:8080/v1/chat/completionsThe proxy serves a built-in web UI:
- Dashboard — visit
http://proxy:8080/dashboardfor real-time worker status, request metrics, and queue depth. - Setup Wizard — visit
http://proxy:8080/setupfor a step-by-step guide to connecting a new worker (platform detection, backend setup, binary download, and live connection verification).
The wizard is always accessible, not just on first run — use it whenever you add another GPU box.
Admin endpoints return 403:
MODELRELAY_ADMIN_TOKEN is not set on the server, or the Authorization
header doesn't match. Verify the token value and ensure the header format
is Authorization: Bearer <token>.
Client requests return 401 when API key auth is enabled:
The client is not sending a Bearer token, or the key has been revoked.
Create a new key via POST /admin/keys and ensure the client sends
Authorization: Bearer <key>.
API key auth not taking effect:
MODELRELAY_REQUIRE_API_KEYS must be set to true. When false
(the default), inference endpoints accept unauthenticated requests.
After starting a new worker, confirm it registered:
# Should include the worker's advertised models.
curl -s http://proxy:8080/v1/models | jq .If a worker's models don't appear within ~10 seconds:
- Check the worker secret — does
WORKER_SECRETon the worker match the proxy? - Check connectivity — can the worker reach
PROXY_URL?curl -v http://proxy:8080/v1/worker/connect # Should get 400 or upgrade-required, not a connection error - Check worker logs — look for
register/register_ackmessages or error lines.
To remove a worker from rotation without dropping in-flight requests:
-
Send SIGTERM to the modelrelay-worker process. The daemon initiates a graceful disconnect — the proxy sends a
GracefulShutdownmessage and stops routing new requests to that worker. -
In-flight requests finish normally. The proxy waits up to
drain_timeout_secs(from the shutdown message) for active requests to complete. -
Once idle, the WebSocket closes. The worker process exits.
# Graceful stop via systemd
systemctl stop modelrelay-worker@gpu-box-1
# Or with Docker
docker stop --time 60 worker-gpu-box-1Monitoring drain progress: Watch the proxy logs for
"worker drained" or similar messages. If the worker still has
in-flight requests, you'll see ongoing ResponseChunk / ResponseComplete
messages until they finish.
Start a new modelrelay-worker instance pointing at the same proxy:
PROXY_URL=http://proxy:8080 \
WORKER_SECRET=your-secret \
WORKER_NAME=gpu-box-4 \
BACKEND_URL=http://localhost:8000 \
modelrelay-worker --models llama3-8bThe proxy discovers it within seconds via the WebSocket registration handshake. No proxy restart or config change needed.
Use the graceful drain procedure above. The proxy automatically routes around disconnected workers.
The proxy is a single-process server. To scale:
- Vertical: increase
MAX_QUEUE_LENand system file descriptor limits. - Horizontal: run multiple proxy instances behind a load balancer, but note that each worker connects to one proxy. Workers must be distributed across proxy instances manually or via DNS round-robin.
| Log pattern | Meaning |
|---|---|
worker registered / register_ack |
Worker connected and authenticated |
request dispatched |
Request sent to a worker |
response complete |
Worker returned a result |
worker heartbeat timed out |
Worker missed pings — WebSocket closed |
request requeued |
Worker died mid-request, retrying on another worker |
requeue exhausted |
Request failed after MAX_REQUEUE_COUNT (3) retries |
queue full |
Rejected request — queue at MAX_QUEUE_LEN capacity |
queue timeout |
Request sat in queue longer than QUEUE_TIMEOUT_SECS |
graceful shutdown |
Worker drain initiated |
| Log pattern | Meaning |
|---|---|
connected to proxy |
WebSocket connection established |
registered |
Registration acknowledged by proxy |
forwarding request |
Proxying a request to the local backend |
backend error |
Local backend returned an error or is unreachable |
cancelled |
Proxy sent a cancel for an in-flight request |
graceful shutdown |
Drain in progress, finishing active requests |
Set LOG_LEVEL environment variable on either component:
LOG_LEVEL=debug modelrelay-server # trace, debug, info (default), warn, error
LOG_LEVEL=debug modelrelay-workerSymptoms: Worker logs show connection refused or timeouts.
Checklist:
- Is the proxy running?
curl http://proxy:8080/v1/models - Is
PROXY_URLcorrect? The worker connects to{PROXY_URL}/v1/worker/connectvia WebSocket. - Firewall / network: the worker makes an outbound connection to the proxy — no inbound ports needed on the worker machine.
- If using TLS (nginx/reverse proxy in front), ensure WebSocket upgrade headers are forwarded. See the TLS Setup guide.
Symptoms: /v1/models shows the model, but requests return 502 or
timeout.
Checklist:
- Is the local backend running?
curl http://localhost:8000/v1/models(or whateverBACKEND_URLis set to) - Does the backend support the requested endpoint?
(
/v1/chat/completions,/v1/messages,/v1/responses) - Check worker logs for
backend errormessages. - Try a direct request to the backend to isolate the issue.
Symptoms: Clients hang, then get a timeout error after
QUEUE_TIMEOUT_SECS.
Causes:
- No workers are connected (check
/v1/models) - Workers are at capacity (
max_concurrentreached on all workers) - Workers are connected but not advertising the requested model
Fix: Add more workers, increase max_concurrent if the hardware
allows, or reduce QUEUE_TIMEOUT_SECS to fail faster.
Symptoms: SSE chunks arrive garbled or out of order.
Checklist:
- Ensure no intermediate proxy is buffering. Disable response
buffering in nginx:
proxy_buffering off; - If using a CDN or reverse proxy, ensure it supports chunked transfer encoding and doesn't aggregate small writes.
Symptoms: Proxy RSS grows over time.
Causes:
- Large queue of pending requests (each holds the full request body)
- Many concurrent streaming responses with large chunk buffers
Fix: Lower MAX_QUEUE_LEN, set QUEUE_TIMEOUT_SECS to a shorter
value, or add workers to drain the queue faster.
Symptoms: Worker logs show repeated connect/disconnect cycles.
Causes:
- Heartbeat timeout — the worker or network is too slow to respond to
pings within
HEARTBEAT_INTERVAL WORKER_SECRETmismatch — worker connects, fails auth, gets disconnected, retries
Fix: Check secrets match, check network latency between worker and proxy.
| Env Var | Default | Description |
|---|---|---|
LISTEN_ADDR |
127.0.0.1:8080 |
HTTP listen address |
PROVIDER_NAME |
local |
Provider name for routing |
WORKER_SECRET |
(required) | Shared secret for worker auth |
MAX_QUEUE_LEN |
100 |
Max queued requests before rejecting |
QUEUE_TIMEOUT_SECS |
30 |
How long a request can wait in queue |
REQUEST_TIMEOUT_SECS |
300 |
Total request timeout (5 min) |
LOG_LEVEL |
info |
Log verbosity |
MODELRELAY_ADMIN_TOKEN |
(none) | Bearer token for /admin/* endpoints (if unset, admin returns 403) |
MODELRELAY_REQUIRE_API_KEYS |
false |
When true, client requests require a valid API key |
| Env Var | Default | Description |
|---|---|---|
PROXY_URL |
http://127.0.0.1:8080 |
Proxy server URL |
WORKER_SECRET |
(required) | Must match proxy's secret |
WORKER_NAME |
worker |
Human-readable worker name |
BACKEND_URL |
http://127.0.0.1:8000 |
Local model server URL |
LOG_LEVEL |
info |
Log verbosity |
Get-Service ModelRelayServer
Get-Service ModelRelayWorkerStart-Service ModelRelayServer
Stop-Service ModelRelayServer
Start-Service ModelRelayWorker
Stop-Service ModelRelayWorkerStop-Service sends a stop control signal and waits for the process to
exit. ModelRelay handles this as a graceful shutdown — in-flight
requests finish before the process terminates. To set an explicit
timeout:
# Stop with a 60-second timeout (kills the process if it doesn't exit in time)
Stop-Service ModelRelayServer -NoWait
Start-Sleep -Seconds 60
(Get-Service ModelRelayServer).WaitForStatus("Stopped", "00:00:05")Windows Services don't write to stdout by default. Two options:
-
Windows Event Log — ModelRelay writes to the Application log. View with:
Get-EventLog -LogName Application -Source ModelRelayServer -Newest 50
-
File logging via
RUST_LOG— setRUST_LOGas a system environment variable and redirect output to a file by wrapping the binary in a small script, or use theRUST_LOG_FILEconvention if supported. The simplest approach:[Environment]::SetEnvironmentVariable("RUST_LOG", "info", "Machine")
To drain a worker gracefully before maintenance:
# Stop the service — this triggers graceful shutdown.
Stop-Service ModelRelayWorker
# Verify it has stopped.
Get-Service ModelRelayWorkerThe worker completes in-flight requests before exiting, identical to the
systemctl stop behavior on Linux.
For production deployments, monitor these signals:
- Proxy process is up — HTTP health check on
/health - At least one worker registered —
/healthreturnsworkers_connected > 0 - Queue depth —
/healthreturnsqueue_depth; watch for sustained growth - Request latency — track time from client request to first byte
- Worker reconnect rate — frequent reconnects indicate network or auth issues
- Error rates — 4xx (client errors) vs 5xx (backend/proxy errors)
- Backend health — each worker's local model server should be independently monitored