-
Notifications
You must be signed in to change notification settings - Fork 0
[research] omlx — SSD KV caching doubles agent swarm capacity on Apple Silicon #54
Description
oMLX — LLM inference server with SSD KV caching for Apple Silicon
What it does: oMLX is a native macOS LLM inference server built on Apple MLX with a two-tier KV cache: hot blocks stay in RAM, cold blocks spill to SSD in safetensors format. Unlike Ollama (which evicts KV state from RAM and recomputes), oMLX persists KV cache across restarts and context switches. It exposes OpenAI-compatible Chat Completions and Anthropic-compatible Messages endpoints, includes continuous batching, and ships a menu-bar UI + web admin dashboard.
Why it matters for ShellForge: ShellForge's swarm mode is fundamentally RAM-limited — the README itself documents the ceiling (e.g. 3-4 agents on M4 Pro 48 GB with qwen3:30b Q4). oMLX's SSD spill layer could double or triple that capacity without requiring more RAM: once a batch agent's KV blocks go cold between tool calls, they move to SSD and free up space for new agents. Because oMLX is OpenAI API-compatible, Crush (which already targets Ollama's OpenAI-compat endpoint) can point at it with a single endpoint config change — no code changes to the governance layer.
GitHub: https://github.com/jundot/omlx ⭐ 7,168 (created Feb 2026)
License: Apache 2.0 ✅
Rough integration effort: Moderate — swap OLLAMA_HOST for the oMLX endpoint in shellforge setup, add an oMLX status check to shellforge status, and document the SSD cache config (--kv-cache-ssd-path) in the README swarm section.