Note: this repo is a showcase of the LUMINAL setup. It's not a turnkey deployment and isn't meant to be cloned and run as-is.
LUMINAL is a self-hosted AI stack. It runs on a Proxmox VM with an NVIDIA GPU and stitches together workflow automation, local LLM inference, RAG, and smart-home control in one Docker Compose project.
The goal: run a real AI product entirely on self-hosted hardware. Models, auth, data, everything. No hosted inference APIs, no commercial accounts, nothing leaving the network.
Midnight is a custom assistant written on top of OpenWebUI that talks to the HELIOS media library through 7 Python tools (Plex, Radarr, Sonarr, Tautulli, Bazarr, SABnzbd, Overseerr). It answers questions by calling live APIs instead of making stuff up.
Everything runs as Docker containers on a Proxmox VM. The stack breaks down like this:
- Auth β Cloudflare Access sits in front. Google OAuth via trusted headers. No local passwords.
- Interface β OpenWebUI is the frontend. It hosts Midnight, does RAG against Qdrant, and sends LLM calls to Ollama.
- Inference & data β Ollama runs three local LLMs with GPU passthrough. Qdrant holds the RAG vectors. n8n handles visual workflow automation.
- Physical world β Home Assistant plus Matter Server, both on host networking so mDNS device discovery works.
Midnight talks to a separate media stack (HELIOS) over HTTP APIs. LUMINAL is the brain, HELIOS is the library.
| Category | Service | Role in the system | |
|---|---|---|---|
| π€ AI Services | n8n | Visual workflow engine. Glue for cross-service automation. | |
| OpenWebUI | Chat interface and tool runtime. Hosts Midnight, handles RAG. | ||
| Ollama | Local LLM inference with GPU acceleration. | ||
| π§ AI Infrastructure | Qdrant | Vector DB for OpenWebUI's RAG. | |
| Docker | Containers, named volumes, GPU passthrough. | ||
| π Home Automation | Home Assistant | Device control hub. Runs on host network for discovery. | |
| Matter Server | Matter protocol bridge for HA. | ||
| π Security | Cloudflare Access | Zero Trust SSO. Google OAuth in front of OpenWebUI. | |
Three models get pulled on first boot and cached on disk. Each does something different:
llama3.1:8b(4.9 GB) β fast general-purpose model. Good for quick chat and simple tool calls.gemma4:e4b(9.6 GB) β multimodal with native tool use. This is what Midnight runs on.gpt-oss:20b(~20 GB) β heavier reasoning when capability matters more than latency.
All three share one Ollama instance and one GPU.
Midnight is a custom AI assistant built on top of OpenWebUI that queries the HELIOS media library. It uses function tools for everything. No question gets answered from what the model "knows" if a tool could answer from live data.
| Component | Description |
|---|---|
| Base Model | gemma4:e4b via Ollama |
| Interface | OpenWebUI with a custom system prompt |
| Tools | 7 Python function tools |
| Knowledge | RAG-indexed reference docs |
| Tool | What it does |
|---|---|
midnight_plex_tool |
Library search, recently added, episode details, cast, actor/director lookup |
midnight_radarr_tool |
Movie details, genres, synopses |
midnight_sonarr_tool |
TV show details, upcoming episodes |
midnight_tautulli_tool |
Watch history, current activity, most watched |
midnight_bazarr_tool |
Subtitle status and history |
midnight_sabnzbd_tool |
Download queue and history |
midnight_overseerr_tool |
Content requests and search |
- Always calls a tool. Never answers a library question from model knowledge.
- Normalizes curly quotes and special characters before sending to APIs.
- Pulls real episode synopses from Plex instead of guessing plot summaries.
- Returns the actual Plex "added on" date, not the file's download timestamp.
- Says "I don't see that in the library" when something isn't there, instead of inventing a plausible answer.
"What movies do we have with Tom Hanks?"
"What's new in the library?"
"What's the Bob's Burgers episode 'It's a Stunterful Life' about?"
"Show me Christmas movies"
"What's currently downloading?"
"Who's watching right now?"
See midnight/README.md for the full system prompt and tool docs.
Why things are set up the way they are.
OpenWebUI doesn't have its own login. Cloudflare Access sits in front, redirects to Google, and passes the authenticated email via a trusted header (Cf-Access-Authenticated-User-Email). OpenWebUI auto-creates the user from that header. No local passwords to manage, and access policy lives in one place instead of scattered across services.
Credentials (n8n encryption key, JWT secret, OpenWebUI session key) are mounted as files via Docker Secrets. They don't show up in env, process dumps, or compose logs. The plaintext files live in a locked-down system directory outside the repo.
The real env.sh and secrets/ directory live at /etc/LUMINAL/, symlinked into the project and gitignored. direnv picks them up on cd into the project, so interactive shells, cron jobs, and Docker Compose all see the same values without explicit sourcing. The pattern came after almost committing secrets one too many times.
Every piece of persistent state (n8n workflows, Ollama model cache, Qdrant indices, chat history, HA config) lives in an external Docker named volume. Containers get torn down and recreated without losing anything. Upgrades stop feeling risky.
scripts/docker-rebuild.sh pulls new images first, then runs docker compose up -d so only the services whose images actually changed get recreated. Everything else keeps running. It also runs a health check that skips the one-shot Ollama pullers (they're supposed to exit), retries transient failures, and returns 0/1/2 exit codes so cron can alert properly. --dry-run shows what would change without touching anything.
Midnight's system prompt assumes the model will hallucinate if allowed to. Every question has to go through a tool call. The prompt explicitly bans answering from model knowledge when a tool could answer instead. It normalizes curly quotes in input and uses RAG against MIDNIGHT_REFERENCE.md to pick the right tool. Trade-off: Midnight is occasionally too strict and refuses things it could reasonably answer. Better than made-up movie titles.
Ollama and OpenWebUI both reserve an NVIDIA GPU in the compose file. Inference runs at hardware speed. No API costs, no rate limits, nothing leaving the box.
Version history and evolution in CHANGELOG.md.