Security Policy

Supported versions

Version	Supported
1.2.x	✅ Active
1.1.x	⚠️ Critical fixes only
1.0.x	❌ End of life

Only the latest minor release receives security patches. We strongly recommend running v1.2.x in production.

Reporting a vulnerability

Please do not open a public GitHub issue for security vulnerabilities.

Report privately via one of these channels:

Channel	Address
Email	contact@eviox.tech
GitHub private advisory	New advisory

Include in your report:

A description of the vulnerability and its potential impact
Affected component (turboquant_corrected.py, app/server.py, app/engine.py, Docker image, etc.)
Steps to reproduce (proof of concept, curl command, or code snippet)
Suggested fix if you have one

You will receive an acknowledgement within 48 hours and a full response (fix timeline or decision) within 7 days.

We follow coordinated disclosure. We ask that you give us 90 days to release a patch before publishing details publicly. We will credit you in the changelog and release notes unless you prefer to remain anonymous.

Scope

In scope

Authentication bypass on the /v1/ API endpoints
Server-side request forgery (SSRF) via model name or path parameters
Arbitrary file read/write through the codebook path (CODEBOOK_PATH env var)
Remote code execution via prompt injection into generation parameters
Denial-of-service via malformed requests that bypass Pydantic validation
Container escape or privilege escalation via the Docker image
Secrets leakage (HF_TOKEN, API_KEY) in logs, error responses, or metrics
Dependency vulnerabilities in requirements.txt with a known CVE and a working exploit

Out of scope

Vulnerabilities in the base model weights (Llama, Mistral, etc.) — report to Meta / Mistral AI
Vulnerabilities in upstream dependencies without a working exploit against this server specifically
Rate limiting / abuse without authentication bypass
Social engineering attacks
Physical access attacks
Issues only reproducible on end-of-life versions

Security architecture

Authentication

The server supports optional bearer token authentication via the API_KEY environment variable.

# Enable auth — set a strong random token
export API_KEY=$(openssl rand -hex 32)

When API_KEY is set, all /v1/ endpoints require:

Authorization: Bearer <token>

Requests without a valid token receive HTTP 401. The /health and /ready endpoints are intentionally unauthenticated for load-balancer probes. The /metrics endpoint should be network-restricted separately (see below).

Default deployment has no authentication (API_KEY is empty). You must set API_KEY before exposing the server to any network.

Secrets handling

Secret	How it is used	How to protect
`HF_TOKEN`	Passed to the HuggingFace Hub client to download model weights at startup. Not stored.	Use a read-only HF token scoped to the specific model. Rotate regularly.
`API_KEY`	Compared in-memory for request authentication. Never logged.	Generate with `openssl rand -hex 32`. Store in Docker secrets or a secrets manager — never in `docker-compose.yml` directly.
`GRAFANA_PASSWORD`	Grafana admin password.	Change from the default `admin` before any network exposure.

Never commit secrets to the repository. The .gitignore excludes .env files. Use environment variable injection at runtime:

# docker-compose — use an .env file (not committed)
echo "HF_TOKEN=hf_..." > .env
echo "API_KEY=$(openssl rand -hex 32)" >> .env
docker compose up -d

Network exposure

By default all three services bind to 0.0.0.0:

Port	Service	Recommended production access
8000	TurboQuant API	Behind a reverse proxy (nginx / Caddy) with TLS. Restrict to trusted clients.
9090	Prometheus	Internal network only. Block from public internet.
3000	Grafana	Internal network only or behind SSO proxy. Change default password.

Use a firewall or Docker network to isolate Prometheus and Grafana from public access. Example using ufw:

ufw allow from 10.0.0.0/8 to any port 9090
ufw allow from 10.0.0.0/8 to any port 3000
ufw deny 9090
ufw deny 3000

TLS

The server does not handle TLS termination directly. Place it behind a reverse proxy that enforces HTTPS:

# nginx example
server {
    listen 443 ssl;
    ssl_certificate     /etc/ssl/certs/turboquant.crt;
    ssl_certificate_key /etc/ssl/private/turboquant.key;
    ssl_protocols       TLSv1.2 TLSv1.3;

    location / {
        proxy_pass http://localhost:8000;
        proxy_set_header Authorization $http_authorization;
    }
}

Container security

The Docker image (docker/Dockerfile) runs as root by default. For hardened deployments, add a non-root user:

RUN useradd -m -u 1000 turboquant
USER turboquant

Additional hardening recommendations:

# docker-compose.yml additions
security_opt:
  - no-new-privileges:true
read_only: true
tmpfs:
  - /tmp
cap_drop:
  - ALL
cap_add:
  - SYS_PTRACE   # required by PyTorch CUDA profiling; remove if not profiling

Model path injection

The MODEL_NAME parameter is passed directly to AutoModelForCausalLM.from_pretrained(). This accepts both HuggingFace Hub IDs and local filesystem paths. In a multi-tenant environment, this could allow loading arbitrary model files. Mitigate by:

Setting MODEL_NAME only at container build/start time via environment variable, not accepting it from API requests
Running the container with a read-only filesystem (read_only: true) except for the /data volume

Codebook path

The CODEBOOK_PATH environment variable points to a writable file. The server performs an atomic write (tmp + os.replace) when recomputing codebooks. Ensure the /data volume is not world-writable and is mounted from a trusted storage backend.

Prompt content

The server does not filter, sanitise, or validate the content of messages or prompt fields. This is intentional — it is an inference server, not an application gateway. If you are building a product on top of this server, implement prompt filtering and content moderation at the application layer before requests reach this API.

Known limitations

Single-worker, single-GPU deployment

The server uses a single generation lock (threading.Lock). Concurrent requests queue behind each other. This means a long streaming request (e.g. 100k token context) blocks all other clients. There is no per-client timeout beyond the HTTP request timeout enforced by the reverse proxy.

Mitigation: Set a request timeout at the reverse proxy level (e.g. proxy_read_timeout 120s in nginx) and consider horizontal scaling for production multi-tenant deployments.

`/metrics` endpoint is unauthenticated

The Prometheus metrics endpoint exposes model name, request counts, and GPU memory usage. This does not include request content, but it may leak deployment details.

Mitigation: Restrict /metrics to the internal monitoring network at the reverse proxy or firewall level.

No rate limiting

The server does not implement rate limiting. A malicious or misconfigured client can exhaust GPU memory by submitting many large-context requests.

Mitigation: Implement rate limiting at the reverse proxy (e.g. nginx limit_req) or an API gateway layer.

Dependency security

Dependencies are declared in requirements.txt. We use Dependabot (.github/dependabot.yml) to receive automated PRs for security updates on a weekly schedule.

To audit dependencies locally:

pip install pip-audit
pip-audit -r requirements.txt

Critical dependencies and their security posture:

Package	Version constraint	Notes
`torch`	`>=2.3.0`	Pin to a specific version in production. Avoid `latest`.
`transformers`	`>=4.43.0`	The `Cache` API changed significantly in v5.x. We tested with v5.3.0.
`fastapi`	`>=0.111.0`	Keep updated — FastAPI releases address Pydantic validation edge cases.
`uvicorn`	`>=0.30.0`	Use `uvicorn[standard]` for production (includes `httptools` and `uvloop`).

CI/CD security

The GitHub Actions workflows follow least-privilege principles:

ci.yml — reads contents, writes packages (Docker push only on branch pushes, not PRs)
release.yml — reads contents, writes packages and contents (release creation only on version tags)
No workflow accepts arbitrary input from pull request bodies or issue comments
GITHUB_TOKEN is used for all GitHub API operations; no long-lived personal access tokens are stored as secrets
HF_TOKEN is stored as a repository secret and only injected into the self-hosted GPU runner workflow

Changelog

Security-relevant fixes are marked in CHANGELOG.md with their severity. Notable security-adjacent fixes in past releases:

Version	Fix
1.2.0	Calibration hooks removed via `try/finally` — OOM during calibration no longer leaves model hooks active
1.2.0	Codebook file written atomically — prevents race condition between simultaneous process starts
1.1.0	SSE error format changed to OpenAI-compatible JSON — prevents raw exception details leaking in stream
1.1.0	Generation lock added to `generate_stream` — prevents concurrent model access and OOM
1.0.0	Initial release

This policy was last updated: 2026-03-25
Maintained by Eviox Tech — contact@eviox.tech

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Security

SECURITY.md

Security Policy

Supported versions

Reporting a vulnerability

Scope

In scope

Out of scope

Security architecture

Authentication

Secrets handling

Network exposure

TLS

Container security

Model path injection

Codebook path

Prompt content

Known limitations

Single-worker, single-GPU deployment

`/metrics` endpoint is unauthenticated

No rate limiting

Dependency security

CI/CD security

Changelog

There aren’t any published security advisories

Security: tushu1232/turboquant-server

Security

SECURITY.md

Security Policy

Supported versions

Reporting a vulnerability

Scope

In scope

Out of scope

Security architecture

Authentication

Secrets handling

Network exposure

TLS

Container security

Model path injection

Codebook path

Prompt content

Known limitations

Single-worker, single-GPU deployment

/metrics endpoint is unauthenticated

No rate limiting

Dependency security

CI/CD security

Changelog

There aren’t any published security advisories

`/metrics` endpoint is unauthenticated