MATA-SERVER is the production inference runtime server for the MATA ecosystem. It wraps MATA's unified model adapter behind a REST + WebSocket API, providing on-demand model loading, memory-aware eviction, and real-time streaming inference — all without coupling your application to a specific ML runtime.
- Unified inference API — single endpoint for detection, segmentation, classification, pose estimation, OCR, depth, and VLM tasks
- REST + WebSocket —
POST /v1/inferfor single-shot requests;WS /v1/stream/{session_id}for real-time frame streaming - On-demand model loading — models are loaded on first request and evicted under memory pressure (LRU policy)
- Memory-aware eviction — configurable VRAM and RAM utilization ceilings with keep-alive protection for active models
- Multi-source model pulls — fetch models from HuggingFace Hub, arbitrary URLs, or local directories via
POST /v1/models/pull mata.v1response schema — consistent, versioned JSON output across all task types- GPU + CPU support — CUDA GPU inference via NVIDIA runtime; automatic CPU fallback
- API key authentication — bearer token auth with configurable key list (or disabled for development)
- OpenAPI docs — auto-generated Swagger UI at
/docsand ReDoc at/redoc - Docker-ready — multi-stage CUDA image; GPU passthrough with NVIDIA Container Toolkit
Requirements: Python 3.10+, pip
# 1. Clone the repository
git clone https://github.com/datamata-io/mata-server.git
cd mata-server
# 2. Install (CPU-only; add [onnx] or [torch] for GPU backends)
pip install -e .
# 3. Configure
cp .env.example .env
# Edit .env — set MATA_SERVER_API_KEYS, or set MATA_SERVER_AUTH_MODE=none for local dev
# 4. Start the server
mataserver serveThe server starts on http://0.0.0.0:8110. Visit http://localhost:8110/docs for the interactive API explorer.
# Pull the pre-built image
docker pull ghcr.io/datamata-io/mataserver:latest
# Run (CPU-only)
docker run -p 8110:8110 \
-e MATA_SERVER_AUTH_MODE=none \
-v mataserver-data:/var/lib/mataserver \
ghcr.io/datamata-io/mataserver:latest
# Run with GPU (requires NVIDIA Container Toolkit)
docker run --gpus all -p 8110:8110 \
-e MATA_SERVER_AUTH_MODE=none \
-v mataserver-data:/var/lib/mataserver \
ghcr.io/datamata-io/mataserver:latestcp .env.example .env
# Edit .env with your settings
docker compose up -dVerify the server is running:
curl http://localhost:8110/v1/health
# {"status":"ok","version":"0.1.0","gpu_available":false}The mataserver console script provides commands for server management and model operations.
| Command | Description |
|---|---|
mataserver serve |
Start the inference server |
mataserver pull <m> --task T |
Download/install and register a model (HuggingFace or pip backend) |
mataserver list |
List all registered models (alias: ls) |
mataserver show <m> |
Show detailed info for a model |
mataserver rm <m> |
Remove a model from the registry |
mataserver load <m> |
Preload a model into memory (alias: warmup) |
mataserver stop <m> |
Unload a model from memory |
mataserver version |
Print version (also: mataserver -v) |
For full usage details, argument references, and examples, see docs/api.md.
All settings use the MATA_SERVER_ environment variable prefix and can also be set in an .env file or a YAML config specified by MATA_SERVER_CONFIG_FILE.
| Variable | Default | Description |
|---|---|---|
MATA_SERVER_HOST |
0.0.0.0 |
Bind address |
MATA_SERVER_PORT |
8110 |
Bind port |
MATA_SERVER_LOG_LEVEL |
info |
Logging level (debug, info, warning, error) |
MATA_SERVER_AUTH_MODE |
api_key |
Auth mode: api_key (enforce bearer tokens) or none (open, dev only) |
MATA_SERVER_API_KEYS |
(empty) | Comma-separated list of valid API keys (required when auth_mode=api_key) |
MATA_SERVER_KEEP_ALIVE |
600 |
Seconds a loaded model stays in memory after its last request |
MATA_SERVER_MAX_VRAM_UTIL |
0.85 |
Fraction of GPU VRAM that triggers model eviction (0.0–1.0) |
MATA_SERVER_MAX_RAM_UTIL |
0.80 |
Fraction of system RAM that triggers model eviction (0.0–1.0) |
MATA_SERVER_EVICTION_POLICY |
lru |
Model eviction policy — currently only lru (least-recently-used) |
MATA_SERVER_DATA_DIR |
/var/lib/mataserver |
Root directory for models, cache, and blob storage |
MATA_SERVER_CONFIG_FILE |
(unset) | Optional path to a YAML config file (env vars always take priority) |
Priority (highest → lowest): environment variables → YAML config file → .env file → built-in defaults.
See .env.example for a fully annotated template with production-recommended values.
Interactive docs are served at http://localhost:8110/docs (Swagger UI) and http://localhost:8110/redoc.
| Method | Path | Auth | Description |
|---|---|---|---|
GET |
/v1/health |
No | Server health check — always returns 200 OK |
GET |
/v1/models |
Yes | List all registered models |
GET |
/v1/models/{model_id} |
Yes | Get details and load state for a single model |
POST |
/v1/models/pull |
Yes | Pull a model from HuggingFace, URL, or local path |
POST |
/v1/models/warmup |
Yes | Pre-load a model into memory |
POST |
/v1/infer |
Yes | Single-shot inference (JSON body, base64 image) |
POST |
/v1/infer/upload |
Yes | Single-shot inference (multipart file upload) |
POST |
/v1/sessions |
Yes | Create a WebSocket streaming session |
DELETE |
/v1/sessions/{session_id} |
Yes | Close and clean up a streaming session |
WebSocket |
/v1/stream/{session_id} |
?token |
Real-time binary frame → JSON result streaming |
REST endpoints authenticate via Authorization: Bearer <key>. The WebSocket endpoint uses ?token=<key> as a query parameter.
For full request/response schemas, per-endpoint error codes, and additional curl examples, see docs/api.md.
curl http://localhost:8110/v1/health{ "status": "ok", "version": "0.1.0", "gpu_available": false }# HuggingFace model
curl -X POST http://localhost:8110/v1/models/pull \
-H "Authorization: Bearer your-api-key" \
-H "Content-Type: application/json" \
-d '{"model": "datamata/rtdetr-l", "task": "detect"}'{ "status": "pulled", "model": "datamata/rtdetr-l" }# Pip-based OCR backend
curl -X POST http://localhost:8110/v1/models/pull \
-H "Authorization: Bearer your-api-key" \
-H "Content-Type: application/json" \
-d '{"model": "easyocr", "task": "ocr"}'Or via the CLI:
# HuggingFace Task Detection model (Example: RT-DETR ResNet-18 backbone)
mataserver pull PekingU/rtdetr_r18vd --task detect
# HuggingFace Task Classification model (Example: ResNet-50)
mataserver pull microsoft/resnet-50 --task classify
# HuggingFace Task Segmentation model (Example: Mask2Former Swin-Tiny trained on COCO)
mataserver pull facebook/mask2former-swin-tiny-coco-instance --task segment
# HuggingFace Task Depth model (Example: Depth Anything V2 Small)
mataserver pull depth-anything/Depth-Anything-V2-Small-hf --task depth
# HuggingFace Task Visual Language Model (VLM)
mataserver pull Qwen/Qwen3-VL-2B-Instruct --task vlm
# HuggingFace OCR model
mataserver pull stepfun-ai/GOT-OCR-2.0-hf --task ocr
# Pip-installed OCR backends
mataserver pull easyocr --task ocr
mataserver pull paddleocr --task ocr
mataserver pull tesseract --task ocr # requires tesseract system binaryIMAGE_B64=$(base64 -w0 /path/to/image.jpg)
curl -X POST http://localhost:8110/v1/infer \
-H "Authorization: Bearer your-api-key" \
-H "Content-Type: application/json" \
-d "{\"model\": \"datamata/rtdetr-l\", \"image\": \"${IMAGE_B64}\", \"confidence\": 0.5}"{
"schema_version": "mata.v1",
"task": "detect",
"model": "datamata/rtdetr-l",
"timestamp": "2026-03-04T10:00:00Z",
"detections": [
{ "label": "person", "confidence": 0.92, "bbox": [120, 45, 300, 480] }
]
}curl -X POST http://localhost:8110/v1/infer/upload \
-H "Authorization: Bearer your-api-key" \
-F "model=datamata/rtdetr-l" \
-F "confidence=0.5" \
-F "file=@/path/to/image.jpg"# 1. Create the session
SESSION=$(curl -s -X POST http://localhost:8110/v1/sessions \
-H "Authorization: Bearer your-api-key" \
-H "Content-Type: application/json" \
-d '{"model": "datamata/rtdetr-l", "task": "detect"}' \
| python3 -c "import sys, json; print(json.load(sys.stdin)['session_id'])")
echo "Session ID: $SESSION"# 2. Connect via WebSocket and stream frames
import asyncio, struct, time, websockets
SESSION_ID = "sess_xxxxxxxxxxxx" # replace with session_id from above
API_KEY = "your-api-key"
async def stream():
uri = f"ws://localhost:8110/v1/stream/{SESSION_ID}?token={API_KEY}"
async with websockets.connect(uri) as ws:
with open("/path/to/image.jpg", "rb") as f:
image = f.read()
# 13-byte header: frame_id (uint32 BE) + timestamp (float64 BE) + encoding (uint8, 0=JPEG)
header = struct.pack(">IdB", 1, time.time(), 0)
await ws.send(header + image)
result = await ws.recv()
print(result)
asyncio.run(stream())See docs/streaming.md for the full binary frame protocol specification and a complete async client example.
The examples/ directory contains ready-to-run Python clients:
| Script | Description |
|---|---|
examples/rest_infer.py |
REST inference — detect, classify, segment |
examples/rest_vlm.py |
REST inference — visual language model (VLM) |
examples/ws_video_infer.py |
WebSocket video streaming — frame-by-frame results |
See examples/README.md for full usage, argument reference, and sample output for each script.
# 1. Clone
git clone https://github.com/datamata-io/mata-server.git
cd mata-server
# 2. Create and activate a virtual environment
python -m venv venv
source venv/bin/activate # Linux / macOS
venv\Scripts\activate # Windows
# 3. Install with all extras and dev tools
pip install -e ".[all,dev]"
# 4. Run the full test suite
pytest
# 5. Run with coverage report
pytest --cov=mataserver --cov-report=term-missing
# 6. Lint and format
ruff check mataserver/ tests/
ruff format mataserver/ tests/
# 7. Start the server in development mode (auth disabled, auto-reload)
MATA_SERVER_AUTH_MODE=none uvicorn mataserver.main:create_app --factory --reloadmataserver/
├── api/ # FastAPI routers and middleware
│ └── v1/ # Versioned endpoints: health, models, infer, sessions, stream
├── core/ # Lifecycle state machine, model cache, memory manager, runtime manager
├── engines/ # Engine base class (MATA handles runtime internals)
├── models/ # Pack manifest parsing, registry, pull system
├── schemas/ # Pydantic request / response models
├── streaming/ # WebSocket binary protocol, session dataclass, session manager
└── utils/ # GPU utilities, structured logging helpers
Contributions are welcome! Please follow these steps:
-
Fork the repository and create a feature branch:
git checkout -b feature/my-feature
-
Implement your change, following the coding standards:
- Python 3.10+ syntax; type annotations on all public functions
- Docstrings on all public classes and functions
- Line length ≤ 100 characters (
ruffenforced)
-
Test your change — all tests must pass and coverage must stay above 85%:
pytest --cov=mataserver ruff check mataserver/ tests/ ruff format --check mataserver/ tests/
-
Commit using conventional commit messages:
feat(api): add streaming frame drop policy fix(memory): prevent eviction of active models docs(readme): update configuration table -
Open a Pull Request against
main. Describe what the PR changes and reference any related issue.
CI runs lint and tests on Python 3.10, 3.11, and 3.12. PRs that fail CI will not be merged.
Bug reports and feature requests are welcome via GitHub Issues.
MIT — see LICENSE
