| Version | Supported |
|---|---|
| 1.2.x | ✅ Active |
| 1.1.x | |
| 1.0.x | ❌ End of life |
Only the latest minor release receives security patches. We strongly recommend running v1.2.x in production.
Please do not open a public GitHub issue for security vulnerabilities.
Report privately via one of these channels:
| Channel | Address |
|---|---|
| contact@eviox.tech | |
| GitHub private advisory | New advisory |
Include in your report:
- A description of the vulnerability and its potential impact
- Affected component (
turboquant_corrected.py,app/server.py,app/engine.py, Docker image, etc.) - Steps to reproduce (proof of concept, curl command, or code snippet)
- Suggested fix if you have one
You will receive an acknowledgement within 48 hours and a full response (fix timeline or decision) within 7 days.
We follow coordinated disclosure. We ask that you give us 90 days to release a patch before publishing details publicly. We will credit you in the changelog and release notes unless you prefer to remain anonymous.
- Authentication bypass on the
/v1/API endpoints - Server-side request forgery (SSRF) via model name or path parameters
- Arbitrary file read/write through the codebook path (
CODEBOOK_PATHenv var) - Remote code execution via prompt injection into generation parameters
- Denial-of-service via malformed requests that bypass Pydantic validation
- Container escape or privilege escalation via the Docker image
- Secrets leakage (HF_TOKEN, API_KEY) in logs, error responses, or metrics
- Dependency vulnerabilities in
requirements.txtwith a known CVE and a working exploit
- Vulnerabilities in the base model weights (Llama, Mistral, etc.) — report to Meta / Mistral AI
- Vulnerabilities in upstream dependencies without a working exploit against this server specifically
- Rate limiting / abuse without authentication bypass
- Social engineering attacks
- Physical access attacks
- Issues only reproducible on end-of-life versions
The server supports optional bearer token authentication via the API_KEY
environment variable.
# Enable auth — set a strong random token
export API_KEY=$(openssl rand -hex 32)When API_KEY is set, all /v1/ endpoints require:
Authorization: Bearer <token>
Requests without a valid token receive HTTP 401. The /health and /ready
endpoints are intentionally unauthenticated for load-balancer probes. The
/metrics endpoint should be network-restricted separately (see below).
Default deployment has no authentication (API_KEY is empty). You must
set API_KEY before exposing the server to any network.
| Secret | How it is used | How to protect |
|---|---|---|
HF_TOKEN |
Passed to the HuggingFace Hub client to download model weights at startup. Not stored. | Use a read-only HF token scoped to the specific model. Rotate regularly. |
API_KEY |
Compared in-memory for request authentication. Never logged. | Generate with openssl rand -hex 32. Store in Docker secrets or a secrets manager — never in docker-compose.yml directly. |
GRAFANA_PASSWORD |
Grafana admin password. | Change from the default admin before any network exposure. |
Never commit secrets to the repository. The .gitignore excludes .env
files. Use environment variable injection at runtime:
# docker-compose — use an .env file (not committed)
echo "HF_TOKEN=hf_..." > .env
echo "API_KEY=$(openssl rand -hex 32)" >> .env
docker compose up -dBy default all three services bind to 0.0.0.0:
| Port | Service | Recommended production access |
|---|---|---|
| 8000 | TurboQuant API | Behind a reverse proxy (nginx / Caddy) with TLS. Restrict to trusted clients. |
| 9090 | Prometheus | Internal network only. Block from public internet. |
| 3000 | Grafana | Internal network only or behind SSO proxy. Change default password. |
Use a firewall or Docker network to isolate Prometheus and Grafana from public
access. Example using ufw:
ufw allow from 10.0.0.0/8 to any port 9090
ufw allow from 10.0.0.0/8 to any port 3000
ufw deny 9090
ufw deny 3000The server does not handle TLS termination directly. Place it behind a reverse proxy that enforces HTTPS:
# nginx example
server {
listen 443 ssl;
ssl_certificate /etc/ssl/certs/turboquant.crt;
ssl_certificate_key /etc/ssl/private/turboquant.key;
ssl_protocols TLSv1.2 TLSv1.3;
location / {
proxy_pass http://localhost:8000;
proxy_set_header Authorization $http_authorization;
}
}The Docker image (docker/Dockerfile) runs as root by default. For hardened
deployments, add a non-root user:
RUN useradd -m -u 1000 turboquant
USER turboquantAdditional hardening recommendations:
# docker-compose.yml additions
security_opt:
- no-new-privileges:true
read_only: true
tmpfs:
- /tmp
cap_drop:
- ALL
cap_add:
- SYS_PTRACE # required by PyTorch CUDA profiling; remove if not profilingThe MODEL_NAME parameter is passed directly to
AutoModelForCausalLM.from_pretrained(). This accepts both HuggingFace Hub IDs
and local filesystem paths. In a multi-tenant environment, this could allow
loading arbitrary model files. Mitigate by:
- Setting
MODEL_NAMEonly at container build/start time via environment variable, not accepting it from API requests - Running the container with a read-only filesystem (
read_only: true) except for the/datavolume
The CODEBOOK_PATH environment variable points to a writable file. The server
performs an atomic write (tmp + os.replace) when recomputing codebooks. Ensure
the /data volume is not world-writable and is mounted from a trusted storage
backend.
The server does not filter, sanitise, or validate the content of messages or
prompt fields. This is intentional — it is an inference server, not an
application gateway. If you are building a product on top of this server,
implement prompt filtering and content moderation at the application layer
before requests reach this API.
The server uses a single generation lock (threading.Lock). Concurrent requests
queue behind each other. This means a long streaming request (e.g. 100k token
context) blocks all other clients. There is no per-client timeout beyond the
HTTP request timeout enforced by the reverse proxy.
Mitigation: Set a request timeout at the reverse proxy level (e.g.
proxy_read_timeout 120s in nginx) and consider horizontal scaling for
production multi-tenant deployments.
The Prometheus metrics endpoint exposes model name, request counts, and GPU memory usage. This does not include request content, but it may leak deployment details.
Mitigation: Restrict /metrics to the internal monitoring network at the
reverse proxy or firewall level.
The server does not implement rate limiting. A malicious or misconfigured client can exhaust GPU memory by submitting many large-context requests.
Mitigation: Implement rate limiting at the reverse proxy (e.g. nginx
limit_req) or an API gateway layer.
Dependencies are declared in requirements.txt. We use Dependabot
(.github/dependabot.yml) to receive automated PRs for security updates on a
weekly schedule.
To audit dependencies locally:
pip install pip-audit
pip-audit -r requirements.txtCritical dependencies and their security posture:
| Package | Version constraint | Notes |
|---|---|---|
torch |
>=2.3.0 |
Pin to a specific version in production. Avoid latest. |
transformers |
>=4.43.0 |
The Cache API changed significantly in v5.x. We tested with v5.3.0. |
fastapi |
>=0.111.0 |
Keep updated — FastAPI releases address Pydantic validation edge cases. |
uvicorn |
>=0.30.0 |
Use uvicorn[standard] for production (includes httptools and uvloop). |
The GitHub Actions workflows follow least-privilege principles:
ci.yml— readscontents, writespackages(Docker push only on branch pushes, not PRs)release.yml— readscontents, writespackagesandcontents(release creation only on version tags)- No workflow accepts arbitrary input from pull request bodies or issue comments
GITHUB_TOKENis used for all GitHub API operations; no long-lived personal access tokens are stored as secretsHF_TOKENis stored as a repository secret and only injected into the self-hosted GPU runner workflow
Security-relevant fixes are marked in CHANGELOG.md with their severity.
Notable security-adjacent fixes in past releases:
| Version | Fix |
|---|---|
| 1.2.0 | Calibration hooks removed via try/finally — OOM during calibration no longer leaves model hooks active |
| 1.2.0 | Codebook file written atomically — prevents race condition between simultaneous process starts |
| 1.1.0 | SSE error format changed to OpenAI-compatible JSON — prevents raw exception details leaking in stream |
| 1.1.0 | Generation lock added to generate_stream — prevents concurrent model access and OOM |
| 1.0.0 | Initial release |
This policy was last updated: 2026-03-25
Maintained by Eviox Tech — contact@eviox.tech