Local MLX Chat is a lightweight web app for fully local LLM conversation on Apple Silicon.
It is designed to keep setup simple while still supporting practical chat features:
- Local-only model inference with
mlx-lm - Multi-turn context-aware conversation
- Streaming and non-streaming generation
- Real-time token speed display (
tok/s) during streaming - Thought section and final answer section rendering
- Thread history (new/switch/delete)
- Edit-and-regenerate from a previous user turn
- Tunable generation parameters
Default model:
lukey03/Qwen3.5-9B-abliterated-MLX-8bit
This is a local chat stack:
- Backend: FastAPI
- Inference:
mlx-lm(load,stream_generate, sampler/logits processors) - Frontend: single-page HTML/CSS/JS (no heavy framework)
Important:
- Inference runs on your local machine (Apple Silicon).
- The model is downloaded from Hugging Face on first use.
- After download, model files are reused from local cache.
- This project does not proxy requests to cloud LLM APIs.
- Multi-turn chat with history passed to the model each turn
Send,Stop, andNew Chat- Session thread list with switch/delete
- User-message
Edit & Regenerate - Parameter presets (save/select/update/delete)
- Custom model management from UI (add/remove model IDs)
- Optional streaming output (SSE)
- Optional thinking mode (model/template dependent)
- Language forcing (
Auto,中文,English,日本語,한국어) - Persistent settings in browser storage
- macOS (Apple Silicon)
- Python 3.10+
cd <your-project-directory>
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
uvicorn app:app --host 127.0.0.1 --port 8000 --reloadOpen:
http://127.0.0.1:8000
Default model id:
lukey03/Qwen3.5-9B-abliterated-MLX-8bit
You can append additional models to the dropdown by environment variable:
EXTRA_MODELS="mlx-community/Qwen2.5-7B-Instruct-4bit,mlx-community/Llama-3.2-3B-Instruct-4bit" \
uvicorn app:app --host 127.0.0.1 --port 8000You can also add custom model IDs directly in the UI:
- Enter a model id in
模型管理(自定义模型 ID) - Click
添加模型 - The model is saved in browser
localStorageand appears in the model dropdown 删除当前模型removes only custom-added models (not built-in defaults)
Temperature: higher = more diverse, lower = more deterministicTop P: nucleus sampling probability massTop K: sample only from top K candidate tokensRepetition Penalty: discourages repetitionRepeat Context: token window used by repetition penaltyMax Tokens: max newly generated tokens per response (backend cap:4096)Streaming: stream token deltas or wait for full responseThinking: enables thinking mode in chat template (if model supports it)System Prompt: global behavior instructionResponse Language: language forcing rule
Note: Max Tokens is generation length per reply, not total context window.
- Threads and messages are stored in
sessionStorage - Refreshing the tab keeps the current browser-session history
- Closing the browser clears session chat history
- Settings are stored in
localStorage(persist across reopen)
app.py: API routes, model loading, prompt building, generation loopstatic/index.html: UI layout, chat rendering, history managementrequirements.txt: Python dependencies.model_overrides/: runtime compatibility overrides for model config
First request may download and initialize the model. Later requests are faster.
Model files are managed by huggingface_hub cache.
You can inspect cache root with:
python -c "from huggingface_hub import scan_cache_dir; print(scan_cache_dir().cache_dir)"Then remove the target repo cache directory if needed.
This app intentionally caps generation length for stability and predictable latency.