Skip to content

Security: tushu1232/turboquant-server

Security

SECURITY.md

Security Policy

Supported versions

Version Supported
1.2.x ✅ Active
1.1.x ⚠️ Critical fixes only
1.0.x ❌ End of life

Only the latest minor release receives security patches. We strongly recommend running v1.2.x in production.


Reporting a vulnerability

Please do not open a public GitHub issue for security vulnerabilities.

Report privately via one of these channels:

Channel Address
Email contact@eviox.tech
GitHub private advisory New advisory

Include in your report:

  • A description of the vulnerability and its potential impact
  • Affected component (turboquant_corrected.py, app/server.py, app/engine.py, Docker image, etc.)
  • Steps to reproduce (proof of concept, curl command, or code snippet)
  • Suggested fix if you have one

You will receive an acknowledgement within 48 hours and a full response (fix timeline or decision) within 7 days.

We follow coordinated disclosure. We ask that you give us 90 days to release a patch before publishing details publicly. We will credit you in the changelog and release notes unless you prefer to remain anonymous.


Scope

In scope

  • Authentication bypass on the /v1/ API endpoints
  • Server-side request forgery (SSRF) via model name or path parameters
  • Arbitrary file read/write through the codebook path (CODEBOOK_PATH env var)
  • Remote code execution via prompt injection into generation parameters
  • Denial-of-service via malformed requests that bypass Pydantic validation
  • Container escape or privilege escalation via the Docker image
  • Secrets leakage (HF_TOKEN, API_KEY) in logs, error responses, or metrics
  • Dependency vulnerabilities in requirements.txt with a known CVE and a working exploit

Out of scope

  • Vulnerabilities in the base model weights (Llama, Mistral, etc.) — report to Meta / Mistral AI
  • Vulnerabilities in upstream dependencies without a working exploit against this server specifically
  • Rate limiting / abuse without authentication bypass
  • Social engineering attacks
  • Physical access attacks
  • Issues only reproducible on end-of-life versions

Security architecture

Authentication

The server supports optional bearer token authentication via the API_KEY environment variable.

# Enable auth — set a strong random token
export API_KEY=$(openssl rand -hex 32)

When API_KEY is set, all /v1/ endpoints require:

Authorization: Bearer <token>

Requests without a valid token receive HTTP 401. The /health and /ready endpoints are intentionally unauthenticated for load-balancer probes. The /metrics endpoint should be network-restricted separately (see below).

Default deployment has no authentication (API_KEY is empty). You must set API_KEY before exposing the server to any network.

Secrets handling

Secret How it is used How to protect
HF_TOKEN Passed to the HuggingFace Hub client to download model weights at startup. Not stored. Use a read-only HF token scoped to the specific model. Rotate regularly.
API_KEY Compared in-memory for request authentication. Never logged. Generate with openssl rand -hex 32. Store in Docker secrets or a secrets manager — never in docker-compose.yml directly.
GRAFANA_PASSWORD Grafana admin password. Change from the default admin before any network exposure.

Never commit secrets to the repository. The .gitignore excludes .env files. Use environment variable injection at runtime:

# docker-compose — use an .env file (not committed)
echo "HF_TOKEN=hf_..." > .env
echo "API_KEY=$(openssl rand -hex 32)" >> .env
docker compose up -d

Network exposure

By default all three services bind to 0.0.0.0:

Port Service Recommended production access
8000 TurboQuant API Behind a reverse proxy (nginx / Caddy) with TLS. Restrict to trusted clients.
9090 Prometheus Internal network only. Block from public internet.
3000 Grafana Internal network only or behind SSO proxy. Change default password.

Use a firewall or Docker network to isolate Prometheus and Grafana from public access. Example using ufw:

ufw allow from 10.0.0.0/8 to any port 9090
ufw allow from 10.0.0.0/8 to any port 3000
ufw deny 9090
ufw deny 3000

TLS

The server does not handle TLS termination directly. Place it behind a reverse proxy that enforces HTTPS:

# nginx example
server {
    listen 443 ssl;
    ssl_certificate     /etc/ssl/certs/turboquant.crt;
    ssl_certificate_key /etc/ssl/private/turboquant.key;
    ssl_protocols       TLSv1.2 TLSv1.3;

    location / {
        proxy_pass http://localhost:8000;
        proxy_set_header Authorization $http_authorization;
    }
}

Container security

The Docker image (docker/Dockerfile) runs as root by default. For hardened deployments, add a non-root user:

RUN useradd -m -u 1000 turboquant
USER turboquant

Additional hardening recommendations:

# docker-compose.yml additions
security_opt:
  - no-new-privileges:true
read_only: true
tmpfs:
  - /tmp
cap_drop:
  - ALL
cap_add:
  - SYS_PTRACE   # required by PyTorch CUDA profiling; remove if not profiling

Model path injection

The MODEL_NAME parameter is passed directly to AutoModelForCausalLM.from_pretrained(). This accepts both HuggingFace Hub IDs and local filesystem paths. In a multi-tenant environment, this could allow loading arbitrary model files. Mitigate by:

  • Setting MODEL_NAME only at container build/start time via environment variable, not accepting it from API requests
  • Running the container with a read-only filesystem (read_only: true) except for the /data volume

Codebook path

The CODEBOOK_PATH environment variable points to a writable file. The server performs an atomic write (tmp + os.replace) when recomputing codebooks. Ensure the /data volume is not world-writable and is mounted from a trusted storage backend.

Prompt content

The server does not filter, sanitise, or validate the content of messages or prompt fields. This is intentional — it is an inference server, not an application gateway. If you are building a product on top of this server, implement prompt filtering and content moderation at the application layer before requests reach this API.


Known limitations

Single-worker, single-GPU deployment

The server uses a single generation lock (threading.Lock). Concurrent requests queue behind each other. This means a long streaming request (e.g. 100k token context) blocks all other clients. There is no per-client timeout beyond the HTTP request timeout enforced by the reverse proxy.

Mitigation: Set a request timeout at the reverse proxy level (e.g. proxy_read_timeout 120s in nginx) and consider horizontal scaling for production multi-tenant deployments.

/metrics endpoint is unauthenticated

The Prometheus metrics endpoint exposes model name, request counts, and GPU memory usage. This does not include request content, but it may leak deployment details.

Mitigation: Restrict /metrics to the internal monitoring network at the reverse proxy or firewall level.

No rate limiting

The server does not implement rate limiting. A malicious or misconfigured client can exhaust GPU memory by submitting many large-context requests.

Mitigation: Implement rate limiting at the reverse proxy (e.g. nginx limit_req) or an API gateway layer.


Dependency security

Dependencies are declared in requirements.txt. We use Dependabot (.github/dependabot.yml) to receive automated PRs for security updates on a weekly schedule.

To audit dependencies locally:

pip install pip-audit
pip-audit -r requirements.txt

Critical dependencies and their security posture:

Package Version constraint Notes
torch >=2.3.0 Pin to a specific version in production. Avoid latest.
transformers >=4.43.0 The Cache API changed significantly in v5.x. We tested with v5.3.0.
fastapi >=0.111.0 Keep updated — FastAPI releases address Pydantic validation edge cases.
uvicorn >=0.30.0 Use uvicorn[standard] for production (includes httptools and uvloop).

CI/CD security

The GitHub Actions workflows follow least-privilege principles:

  • ci.yml — reads contents, writes packages (Docker push only on branch pushes, not PRs)
  • release.yml — reads contents, writes packages and contents (release creation only on version tags)
  • No workflow accepts arbitrary input from pull request bodies or issue comments
  • GITHUB_TOKEN is used for all GitHub API operations; no long-lived personal access tokens are stored as secrets
  • HF_TOKEN is stored as a repository secret and only injected into the self-hosted GPU runner workflow

Changelog

Security-relevant fixes are marked in CHANGELOG.md with their severity. Notable security-adjacent fixes in past releases:

Version Fix
1.2.0 Calibration hooks removed via try/finally — OOM during calibration no longer leaves model hooks active
1.2.0 Codebook file written atomically — prevents race condition between simultaneous process starts
1.1.0 SSE error format changed to OpenAI-compatible JSON — prevents raw exception details leaking in stream
1.1.0 Generation lock added to generate_stream — prevents concurrent model access and OOM
1.0.0 Initial release

This policy was last updated: 2026-03-25
Maintained by Eviox Tech — contact@eviox.tech

There aren’t any published security advisories