Skip to content

datamata-io/mata-server

Repository files navigation

MATA Logo

MATA-SERVER

CI codecov License: MIT Python 3.10+ Docker

MATA-SERVER is the production inference runtime server for the MATA ecosystem. It wraps MATA's unified model adapter behind a REST + WebSocket API, providing on-demand model loading, memory-aware eviction, and real-time streaming inference — all without coupling your application to a specific ML runtime.


Features

  • Unified inference API — single endpoint for detection, segmentation, classification, pose estimation, OCR, depth, and VLM tasks
  • REST + WebSocketPOST /v1/infer for single-shot requests; WS /v1/stream/{session_id} for real-time frame streaming
  • On-demand model loading — models are loaded on first request and evicted under memory pressure (LRU policy)
  • Memory-aware eviction — configurable VRAM and RAM utilization ceilings with keep-alive protection for active models
  • Multi-source model pulls — fetch models from HuggingFace Hub, arbitrary URLs, or local directories via POST /v1/models/pull
  • mata.v1 response schema — consistent, versioned JSON output across all task types
  • GPU + CPU support — CUDA GPU inference via NVIDIA runtime; automatic CPU fallback
  • API key authentication — bearer token auth with configurable key list (or disabled for development)
  • OpenAPI docs — auto-generated Swagger UI at /docs and ReDoc at /redoc
  • Docker-ready — multi-stage CUDA image; GPU passthrough with NVIDIA Container Toolkit

Quick Start

Local install

Requirements: Python 3.10+, pip

# 1. Clone the repository
git clone https://github.com/datamata-io/mata-server.git
cd mata-server

# 2. Install (CPU-only; add [onnx] or [torch] for GPU backends)
pip install -e .

# 3. Configure
cp .env.example .env
# Edit .env — set MATA_SERVER_API_KEYS, or set MATA_SERVER_AUTH_MODE=none for local dev

# 4. Start the server
mataserver serve

The server starts on http://0.0.0.0:8110. Visit http://localhost:8110/docs for the interactive API explorer.

Docker

# Pull the pre-built image
docker pull ghcr.io/datamata-io/mataserver:latest

# Run (CPU-only)
docker run -p 8110:8110 \
  -e MATA_SERVER_AUTH_MODE=none \
  -v mataserver-data:/var/lib/mataserver \
  ghcr.io/datamata-io/mataserver:latest

# Run with GPU (requires NVIDIA Container Toolkit)
docker run --gpus all -p 8110:8110 \
  -e MATA_SERVER_AUTH_MODE=none \
  -v mataserver-data:/var/lib/mataserver \
  ghcr.io/datamata-io/mataserver:latest

Docker Compose

cp .env.example .env
# Edit .env with your settings
docker compose up -d

Verify the server is running:

curl http://localhost:8110/v1/health
# {"status":"ok","version":"0.1.0","gpu_available":false}

CLI Commands

The mataserver console script provides commands for server management and model operations.

Command Description
mataserver serve Start the inference server
mataserver pull <m> --task T Download/install and register a model (HuggingFace or pip backend)
mataserver list List all registered models (alias: ls)
mataserver show <m> Show detailed info for a model
mataserver rm <m> Remove a model from the registry
mataserver load <m> Preload a model into memory (alias: warmup)
mataserver stop <m> Unload a model from memory
mataserver version Print version (also: mataserver -v)

For full usage details, argument references, and examples, see docs/api.md.


Configuration

All settings use the MATA_SERVER_ environment variable prefix and can also be set in an .env file or a YAML config specified by MATA_SERVER_CONFIG_FILE.

Variable Default Description
MATA_SERVER_HOST 0.0.0.0 Bind address
MATA_SERVER_PORT 8110 Bind port
MATA_SERVER_LOG_LEVEL info Logging level (debug, info, warning, error)
MATA_SERVER_AUTH_MODE api_key Auth mode: api_key (enforce bearer tokens) or none (open, dev only)
MATA_SERVER_API_KEYS (empty) Comma-separated list of valid API keys (required when auth_mode=api_key)
MATA_SERVER_KEEP_ALIVE 600 Seconds a loaded model stays in memory after its last request
MATA_SERVER_MAX_VRAM_UTIL 0.85 Fraction of GPU VRAM that triggers model eviction (0.01.0)
MATA_SERVER_MAX_RAM_UTIL 0.80 Fraction of system RAM that triggers model eviction (0.01.0)
MATA_SERVER_EVICTION_POLICY lru Model eviction policy — currently only lru (least-recently-used)
MATA_SERVER_DATA_DIR /var/lib/mataserver Root directory for models, cache, and blob storage
MATA_SERVER_CONFIG_FILE (unset) Optional path to a YAML config file (env vars always take priority)

Priority (highest → lowest): environment variables → YAML config file → .env file → built-in defaults.

See .env.example for a fully annotated template with production-recommended values.


API Endpoints

Interactive docs are served at http://localhost:8110/docs (Swagger UI) and http://localhost:8110/redoc.

Method Path Auth Description
GET /v1/health No Server health check — always returns 200 OK
GET /v1/models Yes List all registered models
GET /v1/models/{model_id} Yes Get details and load state for a single model
POST /v1/models/pull Yes Pull a model from HuggingFace, URL, or local path
POST /v1/models/warmup Yes Pre-load a model into memory
POST /v1/infer Yes Single-shot inference (JSON body, base64 image)
POST /v1/infer/upload Yes Single-shot inference (multipart file upload)
POST /v1/sessions Yes Create a WebSocket streaming session
DELETE /v1/sessions/{session_id} Yes Close and clean up a streaming session
WebSocket /v1/stream/{session_id} ?token Real-time binary frame → JSON result streaming

REST endpoints authenticate via Authorization: Bearer <key>. The WebSocket endpoint uses ?token=<key> as a query parameter.

For full request/response schemas, per-endpoint error codes, and additional curl examples, see docs/api.md.


Example Usage

Health check

curl http://localhost:8110/v1/health
{ "status": "ok", "version": "0.1.0", "gpu_available": false }

Pull a model

# HuggingFace model
curl -X POST http://localhost:8110/v1/models/pull \
  -H "Authorization: Bearer your-api-key" \
  -H "Content-Type: application/json" \
  -d '{"model": "datamata/rtdetr-l", "task": "detect"}'
{ "status": "pulled", "model": "datamata/rtdetr-l" }
# Pip-based OCR backend
curl -X POST http://localhost:8110/v1/models/pull \
  -H "Authorization: Bearer your-api-key" \
  -H "Content-Type: application/json" \
  -d '{"model": "easyocr", "task": "ocr"}'

Or via the CLI:

# HuggingFace Task Detection model (Example: RT-DETR ResNet-18 backbone)
mataserver pull PekingU/rtdetr_r18vd --task detect

# HuggingFace Task Classification model (Example: ResNet-50)
mataserver pull microsoft/resnet-50 --task classify

# HuggingFace Task Segmentation model (Example: Mask2Former Swin-Tiny trained on COCO)
mataserver pull facebook/mask2former-swin-tiny-coco-instance --task segment

# HuggingFace Task Depth model (Example: Depth Anything V2 Small)
mataserver pull depth-anything/Depth-Anything-V2-Small-hf --task depth

# HuggingFace Task Visual Language Model (VLM)
mataserver pull Qwen/Qwen3-VL-2B-Instruct --task vlm

# HuggingFace OCR model
mataserver pull stepfun-ai/GOT-OCR-2.0-hf --task ocr

# Pip-installed OCR backends
mataserver pull easyocr --task ocr
mataserver pull paddleocr --task ocr
mataserver pull tesseract --task ocr  # requires tesseract system binary

Single-shot inference (base64 JSON)

IMAGE_B64=$(base64 -w0 /path/to/image.jpg)

curl -X POST http://localhost:8110/v1/infer \
  -H "Authorization: Bearer your-api-key" \
  -H "Content-Type: application/json" \
  -d "{\"model\": \"datamata/rtdetr-l\", \"image\": \"${IMAGE_B64}\", \"confidence\": 0.5}"
{
  "schema_version": "mata.v1",
  "task": "detect",
  "model": "datamata/rtdetr-l",
  "timestamp": "2026-03-04T10:00:00Z",
  "detections": [
    { "label": "person", "confidence": 0.92, "bbox": [120, 45, 300, 480] }
  ]
}

Single-shot inference (file upload)

curl -X POST http://localhost:8110/v1/infer/upload \
  -H "Authorization: Bearer your-api-key" \
  -F "model=datamata/rtdetr-l" \
  -F "confidence=0.5" \
  -F "file=@/path/to/image.jpg"

Create a streaming session and connect via WebSocket

# 1. Create the session
SESSION=$(curl -s -X POST http://localhost:8110/v1/sessions \
  -H "Authorization: Bearer your-api-key" \
  -H "Content-Type: application/json" \
  -d '{"model": "datamata/rtdetr-l", "task": "detect"}' \
  | python3 -c "import sys, json; print(json.load(sys.stdin)['session_id'])")

echo "Session ID: $SESSION"
# 2. Connect via WebSocket and stream frames
import asyncio, struct, time, websockets

SESSION_ID = "sess_xxxxxxxxxxxx"  # replace with session_id from above
API_KEY    = "your-api-key"

async def stream():
    uri = f"ws://localhost:8110/v1/stream/{SESSION_ID}?token={API_KEY}"
    async with websockets.connect(uri) as ws:
        with open("/path/to/image.jpg", "rb") as f:
            image = f.read()
        # 13-byte header: frame_id (uint32 BE) + timestamp (float64 BE) + encoding (uint8, 0=JPEG)
        header = struct.pack(">IdB", 1, time.time(), 0)
        await ws.send(header + image)
        result = await ws.recv()
        print(result)

asyncio.run(stream())

See docs/streaming.md for the full binary frame protocol specification and a complete async client example.

Runnable example scripts

The examples/ directory contains ready-to-run Python clients:

Script Description
examples/rest_infer.py REST inference — detect, classify, segment
examples/rest_vlm.py REST inference — visual language model (VLM)
examples/ws_video_infer.py WebSocket video streaming — frame-by-frame results

See examples/README.md for full usage, argument reference, and sample output for each script.


Development Setup

# 1. Clone
git clone https://github.com/datamata-io/mata-server.git
cd mata-server

# 2. Create and activate a virtual environment
python -m venv venv
source venv/bin/activate       # Linux / macOS
venv\Scripts\activate          # Windows

# 3. Install with all extras and dev tools
pip install -e ".[all,dev]"

# 4. Run the full test suite
pytest

# 5. Run with coverage report
pytest --cov=mataserver --cov-report=term-missing

# 6. Lint and format
ruff check mataserver/ tests/
ruff format mataserver/ tests/

# 7. Start the server in development mode (auth disabled, auto-reload)
MATA_SERVER_AUTH_MODE=none uvicorn mataserver.main:create_app --factory --reload

Project structure

mataserver/
├── api/          # FastAPI routers and middleware
│   └── v1/       # Versioned endpoints: health, models, infer, sessions, stream
├── core/         # Lifecycle state machine, model cache, memory manager, runtime manager
├── engines/      # Engine base class (MATA handles runtime internals)
├── models/       # Pack manifest parsing, registry, pull system
├── schemas/      # Pydantic request / response models
├── streaming/    # WebSocket binary protocol, session dataclass, session manager
└── utils/        # GPU utilities, structured logging helpers

Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository and create a feature branch:

    git checkout -b feature/my-feature
  2. Implement your change, following the coding standards:

    • Python 3.10+ syntax; type annotations on all public functions
    • Docstrings on all public classes and functions
    • Line length ≤ 100 characters (ruff enforced)
  3. Test your change — all tests must pass and coverage must stay above 85%:

    pytest --cov=mataserver
    ruff check mataserver/ tests/
    ruff format --check mataserver/ tests/
  4. Commit using conventional commit messages:

    feat(api): add streaming frame drop policy
    fix(memory): prevent eviction of active models
    docs(readme): update configuration table
    
  5. Open a Pull Request against main. Describe what the PR changes and reference any related issue.

CI runs lint and tests on Python 3.10, 3.11, and 3.12. PRs that fail CI will not be merged.

Bug reports and feature requests are welcome via GitHub Issues.


License

MIT — see LICENSE

About

MATA Server is a lightweight inference runtime designed to serve computer vision models built on the MATA (Model-Agnostic Task Architecture)

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors