Local MLX Chat

中文文档 (Chinese README)

Local MLX Chat is a lightweight web app for fully local LLM conversation on Apple Silicon.

It is designed to keep setup simple while still supporting practical chat features:

Local-only model inference with mlx-lm
Multi-turn context-aware conversation
Streaming and non-streaming generation
Real-time token speed display (tok/s) during streaming
Thought section and final answer section rendering
Thread history (new/switch/delete)
Edit-and-regenerate from a previous user turn
Tunable generation parameters

Default model:

lukey03/Qwen3.5-9B-abliterated-MLX-8bit

What This Project Is

This is a local chat stack:

Backend: FastAPI
Inference: mlx-lm (load, stream_generate, sampler/logits processors)
Frontend: single-page HTML/CSS/JS (no heavy framework)

Important:

Inference runs on your local machine (Apple Silicon).
The model is downloaded from Hugging Face on first use.
After download, model files are reused from local cache.
This project does not proxy requests to cloud LLM APIs.

Features

Multi-turn chat with history passed to the model each turn
Send, Stop, and New Chat
Session thread list with switch/delete
User-message Edit & Regenerate
Parameter presets (save/select/update/delete)
Custom model management from UI (add/remove model IDs)
Optional streaming output (SSE)
Optional thinking mode (model/template dependent)
Language forcing (Auto, 中文, English, 日本語, 한국어)
Persistent settings in browser storage

Requirements

macOS (Apple Silicon)
Python 3.10+

Quick Start

cd <your-project-directory>
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
uvicorn app:app --host 127.0.0.1 --port 8000 --reload

Open:

http://127.0.0.1:8000

Model Configuration

Default model id:

lukey03/Qwen3.5-9B-abliterated-MLX-8bit

You can append additional models to the dropdown by environment variable:

EXTRA_MODELS="mlx-community/Qwen2.5-7B-Instruct-4bit,mlx-community/Llama-3.2-3B-Instruct-4bit" \
uvicorn app:app --host 127.0.0.1 --port 8000

You can also add custom model IDs directly in the UI:

Enter a model id in 模型管理（自定义模型 ID）
Click 添加模型
The model is saved in browser localStorage and appears in the model dropdown
删除当前模型 removes only custom-added models (not built-in defaults)

Generation Parameters

Temperature: higher = more diverse, lower = more deterministic
Top P: nucleus sampling probability mass
Top K: sample only from top K candidate tokens
Repetition Penalty: discourages repetition
Repeat Context: token window used by repetition penalty
Max Tokens: max newly generated tokens per response (backend cap: 4096)
Streaming: stream token deltas or wait for full response
Thinking: enables thinking mode in chat template (if model supports it)
System Prompt: global behavior instruction
Response Language: language forcing rule

Note: Max Tokens is generation length per reply, not total context window.

Session and Persistence

Threads and messages are stored in sessionStorage
Refreshing the tab keeps the current browser-session history
Closing the browser clears session chat history
Settings are stored in localStorage (persist across reopen)

Project Structure

app.py: API routes, model loading, prompt building, generation loop
static/index.html: UI layout, chat rendering, history management
requirements.txt: Python dependencies
.model_overrides/: runtime compatibility overrides for model config

FAQ

Why is the first response slow?

First request may download and initialize the model. Later requests are faster.

How do I remove local model cache?

Model files are managed by huggingface_hub cache.
You can inspect cache root with:

python -c "from huggingface_hub import scan_cache_dir; print(scan_cache_dir().cache_dir)"

Then remove the target repo cache directory if needed.

Why is `Max Tokens` limited to 4096 here?

This app intentionally caps generation length for stability and predictable latency.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
static		static
.gitignore		.gitignore
README.md		README.md
README.zh-CN.md		README.zh-CN.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Local MLX Chat

What This Project Is

Features

Requirements

Quick Start

Model Configuration

Generation Parameters

Session and Persistence

Project Structure

FAQ

Why is the first response slow?

How do I remove local model cache?

Why is `Max Tokens` limited to 4096 here?

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Local MLX Chat

What This Project Is

Features

Requirements

Quick Start

Model Configuration

Generation Parameters

Session and Persistence

Project Structure

FAQ

Why is the first response slow?

How do I remove local model cache?

Why is Max Tokens limited to 4096 here?

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Why is `Max Tokens` limited to 4096 here?

Packages