- 9B parameter LLM with thinking mode
- Vision (images) and video analysis
- 30 structured analysis patterns
- Smart context windowing with semantic recall
- Character-accurate letter counting
|
- Streaming audio — hear responses as they generate (~3s to first audio)
- Neural TTS with voice cloning
- Speech-to-text via Whisper
- 10 language support
- Voice Studio for managing cloned voices
- Auto-unloads from VRAM when idle
|
- Code execution in 7 languages
- Syntax-highlighted Code Playground
- Document generation (PDF, DOCX, XLSX, PPTX)
- Web search via self-hosted SearXNG
- BM25-ranked persistent memory
|
git clone https://github.com/nisakson2000/Gizmo-AI.git
cd Gizmo-AI
bash scripts/download-model.sh # Downloads ~14GB (LLM + TTS + vision projector)
bash scripts/build-llamacpp.sh # Builds model server (~5-10min)
bash scripts/start.sh # Starts all 6 services
# Open http://localhost:3100
┌─────────────── gizmo-net (10.90.0.0/24) ──────────────┐
│ │
┌──────────┐ │ ┌────────────────┐ ┌───────────────┐ │
│ gizmo-ui │────▶│ │ gizmo- │────────▶ │ gizmo-llama │ │
│ :3100 │ │ │ orchestrator │ │ :8080 │ │
│ SvelteKit│ │ │ :9100 FastAPI │ │ Qwen3.5-9B │ │
│ + nginx │ │ └───────┬────────┘ │ [GPU] │ │
└──────────┘ │ ┌────┼─────┐ └───────────────┘ │
│┌────▼──┐ │ ┌───▼─────┐ ┌─────────────┐ │
││searxng│ │ │gizmo-tts│ │gizmo-whisper│ │
││ :8300 │ │ │ :8400 │ │ :8200 │ │
││ [CPU] │ │ │ [GPU] │ │ [CPU] │ │
│└───────┘ │ └─────────┘ └─────────────┘ │
└──────────┴────────────────────────────────────────────┘
| Service |
Port |
Role |
GPU |
| gizmo-llama |
8080 |
LLM inference (Qwen3.5-9B Q8_0 + vision) |
Yes |
| gizmo-orchestrator |
9100 |
FastAPI backend — routing, streaming, tools |
No |
| gizmo-ui |
3100 |
SvelteKit web UI via nginx |
No |
| gizmo-tts |
8400 |
Qwen3-TTS neural voice cloning |
Yes |
| gizmo-whisper |
8200 |
faster-whisper speech-to-text |
No |
| gizmo-searxng |
8300 |
Self-hosted web search |
No |
Chat & Conversation
- Streaming chat with persistent server-side history and LLM-generated titles
- Regenerate & edit — re-roll any response or edit a sent message, with
< 1/N > variant navigation
- Full-text search — sidebar filters by title; press Enter for deep message content search
- Conversation export as formatted Markdown
- Double-click to rename conversations
- Scroll-to-bottom floating button when scrolled up
- Mobile swipe gestures for sidebar (swipe right to open, left to close)
AI Capabilities
- Mode switcher — 6 behavioral modes (Chat, Brainstorm, Coder, Research, Planner, Roleplay) + custom mode creation with prompt editor
- Usage analytics — token counts, response times, and cloud cost comparison dashboard at
/analytics
- Thinking mode — step-by-step reasoning in collapsible blocks (toggle on/off)
- Vision — analyze images via multimodal vision projector (mmproj)
- Video analysis — upload video, extract frames, analyze visual content with playback
- Audio transcription — upload M4A/MP3/WAV for Whisper transcription + LLM analysis
- Multi-round tool calling — model autonomously chains up to 5 rounds of tool calls
- Web search via self-hosted SearXNG — no API keys
- Document upload — PDFs, text, code up to 50MB
- Memory — BM25-ranked facts with recency weighting + semantic session recall (CPU embeddings)
- Cross-conversation recall — two-tier semantic search across all past conversations with topic room categorization
- Conversation compaction — rolling LLM summaries preserve context awareness in long conversations
- Knowledge extraction — automatic temporal fact tracking with entity normalization and invalidation
- Smart context windowing — keeps most relevant older messages by semantic similarity
- Recitation — fetches authoritative text from the web for poems, lyrics, speeches
- Character analysis — accurate letter counting via pre-computed character maps
Voice & TTS
- Streaming TTS — sentence-level audio streaming via WebSocket (~3s to first audio vs 7-45s batch mode), with gapless browser playback
- Voice Studio — upload reference audio, name and save voices, adjustable clip duration
- Qwen3-TTS — GPU-accelerated neural voice cloning (x-vector mode) via faster-qwen3-tts
- Speed control — 0.5x to 2.0x
- 10 languages — English, Chinese, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
- Speech-to-text — dictate via microphone with Whisper
- Auto-unload — TTS model frees VRAM after 60s idle
Code & Tools
- Sandbox — 7 languages (Python, JavaScript, Bash, C, C++, Go, Lua) in isolated containers (no network, 256MB RAM, read-only fs)
- Code Playground —
/code route with syntax highlighting (highlight.js), resizable split pane, auto-save, copy/download, word wrap, output file display
- AI code assistant — isolated chat overlay with multi-round tool calling
- Document generation — PDF, DOCX, XLSX, PPTX, CSV, TXT via natural language
- Markup preview — live rendering for HTML, CSS, SVG, Markdown
- Memory Manager — browse, add, and delete memories from the UI
Patterns & Routing
- 30 patterns — Fabric-inspired cognitive templates (extract_wisdom, summarize, analyze_threat, etc.)
- Intelligent routing — model sees only 3-8 relevant tools per request via keyword pre-routing
- Auto or explicit — patterns activate by keyword matching or
[pattern:name] prefix
- Pattern-scoped tools — each pattern declares which tools are available
Task Tracker
- Built-in task and note management at
/tracker
- Tags, priorities, due dates, recurrence (daily/weekly/biweekly/monthly/yearly), subtasks
- Free-text search across titles, descriptions, and tags
- Keyboard navigation —
j/k navigate, x toggle status, n new task, / search
- Inline title editing (double-click), collapsible subtasks, undo delete with toast
- LLM chat overlay for natural language task creation
UI & Accessibility
- 9 Nintendo themes — NES, SNES, GBA, N64, GameCube, Wii, DS, 3DS, Switch with console frames, sound effects, screen overlays, and boot animations
- Keyboard shortcuts — Ctrl+Shift+N (new chat), Ctrl+Shift+T (think), Ctrl+/ (focus), Escape (close)
- Mobile support — swipe gestures, always-visible message actions on touch devices
- Accessibility — focus trapping in modals, aria-expanded, sidebar keyboard nav, prefers-reduced-motion
- Service health — live status dashboard for all backend services
- Dual API — WebSocket for streaming UI, REST (
/api/chat) for programmatic access
- Tailscale HTTPS — secure access from any device on your tailnet
- 100% local — your data never leaves your machine
Android App
- Native Compose chat — streaming responses, markdown rendering, syntax highlighting
- Multi-server profiles — connect to LAN, Tailscale, or any Gizmo instance
- Thinking mode — collapsible reasoning blocks, tool call status cards
- Vision + documents — attach images and files for analysis
- Conversation management — search, rename, delete with undo
- Mode selector — Chat, Brainstorm, Coder, Research, and custom modes
- Auto-reconnect — exponential backoff on network interruption
- Build from source — containerized Podman build, no Android Studio needed
- CI releases — GitHub Actions builds APK on version tags
|
Minimum |
Tested |
| GPU |
NVIDIA, 16GB+ VRAM |
RTX 4090, 24GB |
| RAM |
32GB |
64GB DDR5 |
| Disk |
50GB free |
NVMe SSD |
| OS |
Linux (Ubuntu, Fedora, Arch) |
Bazzite OS (Fedora) |
| Runtime |
Podman or Docker + NVIDIA container support |
Podman 5.8 |
VRAM breakdown
| Component |
VRAM |
Notes |
| Qwen3.5-9B weights (Q8_0) |
~9.5 GB |
Always loaded |
| KV cache (Q8_0, 32K context) |
~6.2 GB |
Grows with conversation |
| Qwen3-TTS |
~4.0 GB |
Auto-unloads after 60s idle |
| Peak total |
~20.7 GB |
LLM + TTS active |
| Whisper |
0 GB |
Runs on CPU |
Full documentation is available on the Wiki.
MIT