89 tok/s. Zero containers. 115GB GPU memory. Compiled from source.
curl -fsSL https://raw.githubusercontent.com/bong-water-water-bong/halo-ai/main/install.sh | bashA complete AI platform for the AMD Ryzen AI MAX+ 395 — LLM inference, chat, deep research, voice, image generation, RAG, and workflows. All bare metal, all compiled from source, all on one chip with 128GB unified memory.
| Service | Port | Purpose |
|---|---|---|
| Lemonade | 8080 | Unified AI API (OpenAI/Ollama/Anthropic compatible) |
| llama.cpp | 8081 | LLM inference — Vulkan + HIP dual backends |
| Open WebUI | 3000 | Chat with RAG, documents, multi-model |
| Vane | 3001 | Deep research (Perplexica) |
| SearXNG | 8888 | Private search |
| Qdrant | 6333 | Vector DB for RAG |
| n8n | 5678 | Workflow automation |
| whisper.cpp | 8082 | Speech-to-text |
| Kokoro | 8083 | Text-to-speech |
| ComfyUI | 8188 | Image generation |
| Dashboard | 3003 | GPU metrics + service health (DreamServer fork) |
All services bind to 127.0.0.1. Access via SSH tunnel or Caddy reverse proxy.
| Model | Speed | Size |
|---|---|---|
| Qwen3-30B-A3B (MoE) | 89 tok/s | 18 GB |
| Llama 3 70B | ~18 tok/s | 40 GB |
| Llama 3 70B Q8 | ~10 tok/s | 70 GB |
Full benchmarks: BENCHMARKS.md
SSH / Caddy (only network entry point)
│
┌─────────▼──────────┐
│ Lemonade :8080 │ ← OpenAI-compatible API
└──┬──────┬──────┬───┘
▼ ▼ ▼
llama.cpp whisper Kokoro ← GPU inference (gfx1151, 115GB GTT)
│
Open WebUI :3000 ← Chat UI
Vane :3001 ← Deep research
n8n :5678 ← Workflows
ComfyUI :8188 ← Image gen
Qdrant :6333 SearXNG :8888 ← RAG + search
halo-models.sh list # Browse models for Strix Halo
halo-models.sh download <name> # Download a model
halo-models.sh activate <name> # Switch models (auto-benchmark)
halo-driver-swap.sh vulkan # Switch to Vulkan backend
halo-driver-swap.sh hip # Switch to HIP/ROCm backend
halo-update.sh update # Update everything (snapshot-protected)
halo-proxy-setup.sh # Enable Caddy reverse proxy
halo-backup.sh # Backup service dataThe halo-ai agent runs continuously:
- Monitors all services every 30s, auto-restarts failures
- Btrfs snapshot before every repair
- GPU health, thermals, memory monitoring
- Auto-applies system security patches
- Tracks inference performance trends
- Silent unless it cannot fix something
Local Copilot via Lemonade VS Code extension. Your code AI runs on your Strix Halo — private, offline, zero API costs.
- SECURITY.md — Access model, SSH config, Caddy setup, firewall
- BENCHMARKS.md — Full performance data
- Credits
Before anything else, massive respect and gratitude to DreamServer and the Light-Heart-Labs team.
DreamServer is the reason halo-ai exists. Full stop.
They were the first to look at the Strix Halo chip and see not just an inference engine, but a complete AI platform. While everyone else was benchmarking single models, Light-Heart-Labs built an entire ecosystem — LLM inference, chat UI, voice I/O, RAG pipelines, autonomous agents, image generation, workflow automation, privacy tools, and a monitoring dashboard — all integrated, all working together, all on one chip.
That vision is what halo-ai is built on. The architecture, the service selection, the idea that a single Strix Halo box can replace a rack of cloud services — that came from DreamServer. Their open-source dashboard panel is what powers the halo-ai control center. We did not just take inspiration — we directly forked their work and built on it.
Fork: github.com/bong-water-water-bong/DreamServer
Lemonade by AMD is the backbone of this stack. Their unified API layer provides OpenAI, Ollama, and Anthropic compatibility through a single endpoint. Their dedicated engineering on gfx1151 ROCm support, NPU+GPU hybrid inference, and the llamacpp-rocm nightly builds turned scattered hardware support into a production-ready platform. Without Lemonade, Strix Halo AI would still be a series of disconnected experiments.
Every service in this stack is open source. We compile from source, but we build nothing — these teams do:
| Project | What It Does | License |
|---|---|---|
| llama.cpp | The LLM inference engine that powers the entire stack | MIT |
| Open WebUI | Full-featured chat interface with RAG and multi-model support | MIT |
| Vane (Perplexica) | AI-powered deep research with cited sources | MIT |
| whisper.cpp | Fast, GPU-accelerated speech-to-text | MIT |
| Kokoro-FastAPI | High-quality text-to-speech API | Apache 2.0 |
| ComfyUI | Node-based image generation pipeline | GPL 3.0 |
| SearXNG | Privacy-respecting meta-search engine | AGPL 3.0 |
| Qdrant | High-performance vector database for RAG | Apache 2.0 |
| n8n | Workflow automation with 400+ integrations | Sustainable Use |
| Caddy | Reverse proxy with automatic TLS | Apache 2.0 |
| ROCm / TheRock | AMD GPU compute stack for gfx1151 | Various |
- kyuz0 — Docker toolboxes and comprehensive backend benchmarks
- Gygeek — Framework laptop setup guides
- The Framework laptop community for extensive hardware testing
- The Arch Linux community for bleeding-edge packages
- Everyone contributing to ROCm gfx1151 support in kernel, Mesa, and userspace
Apache 2.0