Mac base/Pro 16 GB support: Qwen 2.5 14B, ChatML stop markers, <tools> parser, offline leak fix#32
Mac base/Pro 16 GB support: Qwen 2.5 14B, ChatML stop markers, <tools> parser, offline leak fix#32tadrianonet wants to merge 8 commits intonicedreamzapp:mainfrom
Conversation
The clean_response function only handled Gemma 4 stop markers (<turn|>, <|turn>). When the server runs other model families the markers were not stripped and ended up bleeding into the visible response. Symptom from Qwen 3.5 4B: replies arrived as "<|endoftext|><|im_start|>user\n<system-". Add the ChatML and Llama 3.x markers to the truncation list: - <|im_end|>, <|im_start|>, <|endoftext|> (Qwen, Mistral, ChatML) - <|eot_id|>, <|end_of_text|> (Llama 3.x) This is a generic fix that benefits any non-Gemma model loaded via MLX_MODEL, including the official Qwen 3.5 122B fighter and any mlx-community ChatML-based model. Co-authored-by: Cursor <cursoragent@cursor.com>
Two new double-clickable launchers tuned for Apple Silicon Macs with 16-32 GB unified memory, where the existing Gemma/Llama/Qwen launchers hit three reproducible issues that the upstream README does not cover. The launchers apply three workarounds: 1. macOS keychain auth bug (issues anthropics/claude-code#25069 and #27900): export ANTHROPIC_AUTH_TOKEN alongside ANTHROPIC_API_KEY and set DISABLE_LOGIN_COMMAND=1 so the local API key is honored in interactive mode instead of falling through to the OAuth login selection screen. 2. Claude Code 2.1 extended thinking: pass --effort low to disable the two-call thinking-then-answer flow. Small/quantized models exhaust their budget on the first call and emit empty replies on the second one (the "(No output)" symptom). 3. Mac base RAM pressure: default to Gemma 4 31B Abliterated 4-bit (recommended by upstream for tool-call reliability) but document that it will swap on 16 GB and run at ~5-8 tok/s. Claude Chat.command - tools disabled (--tools ""), conversation only. Stable on the Qwen 3.5 4B fallback too. Claude Agentico.command - tools enabled, --permission-mode auto. Suitable only with Gemma 31B+ on 16 GB. Both reuse lib/claude-local-common.sh (resolve_mlx_model, ensure_mlx_server) so model-aware restart and local-cache resolution work the same as the upstream launchers. Co-authored-by: Cursor <cursoragent@cursor.com>
Step-by-step guide explaining the three issues a user with M1/M2/M3/M4 base or Pro (16-32 GB) hits when following the official quick start, plus the fix applied in this branch: 1. macOS keychain auth bug → ANTHROPIC_AUTH_TOKEN + hasCompletedOnboarding 2. Stop-token leakage on non-Gemma models → clean_response patch 3. (No output) on every reply → --effort low Also includes a model recommendation table for 16 GB (Qwen 4B vs Llama 8B vs Qwen Coder 14B vs Gemma 31B), with disk/RAM/speed/tool-call trade-offs measured during testing on M2 Pro 16 GB, and a realistic list of what works and what does not on this hardware tier. Written in Portuguese (PT-BR) since the upstream documentation is English-only and this fork specifically targets a hardware profile underserved by the original docs. Open to translating if there is demand. Co-authored-by: Cursor <cursoragent@cursor.com>
Qwen 2.5 Coder fine-tunes (mlx-community/Qwen2.5-Coder-14B-Instruct-4bit)
emit tool calls inside <tools>...</tools> instead of the <tool_call>
wrapper used by Qwen 3.5 / generic ChatML. parse_tool_calls() didn't
recognize this format, so calls were returned as plain text and Claude
Code never executed them.
Add Format 3.5 to parse_tool_calls() that:
- matches <tools>...</tools> with re.DOTALL
- parses the inner JSON as either a single object or a list
- accepts both "arguments" and "parameters" payload keys
- falls back to recover_garbled_tool_json on JSONDecodeError
Validated: Qwen 2.5 Coder 14B 4-bit MLX returns stop_reason=tool_use
with Bash({"command":"pwd"}) on a single-tool prompt.
Co-authored-by: Cursor <cursoragent@cursor.com>
Gemma 4 31B 4-bit (the upstream default) crashes with
kIOGPUCommandBufferCallbackErrorOutOfMemory on the very first inference
on an M2 Pro 16 GB: 18 GB of weights + KV cache simply don't fit
alongside macOS and any open apps.
Qwen 2.5 Coder 14B 4-bit MLX is the right ceiling for 16 GB:
- 7.8 GB weights → fits with no swap
- native MLX 4-bit, no GGUF translation
- tool-calls validated end-to-end after the Format 3.5 parser fix
- strong code reasoning, decent PT-BR
Updates: - launchers/Claude Chat.command, launchers/Claude Agentico.command:
point MLX_MODEL_DEFAULT at the Qwen 14B HF id, refresh the loading
message and tok/s expectations
- docs/MAC-BASE-SETUP.md: document the 4th failure mode (Qwen 2.5
<tools> not parsed), explain the OOM on Gemma 31B, refresh the
model recommendation table to mark Qwen 14B as the new default
and Gemma 31B as inviavel em 16 GB
Co-authored-by: Cursor <cursoragent@cursor.com>
…nthropic.com on startup Confirmed via lsof that Claude Code 2.1.x opens a TCP connection to api.anthropic.com (160.79.104.10, owned by Anthropic PBC per whois) on startup even when ANTHROPIC_BASE_URL points at localhost. The CLI fires telemetry, statsig feature flags, marketplace autoinstall and autoupdater checks directly against api.anthropic.com, bypassing the configured base URL. That means "100% offline / 0$/mo" setups that don't set these env vars are NOT actually offline — code, prompts and telemetry leave the machine. Fix: export 4 env vars in both launchers (and document them in docs/MAC-BASE-SETUP.md as the 5th failure mode): CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 DISABLE_AUTOUPDATER=1 CLAUDE_CODE_DISABLE_OFFICIAL_MARKETPLACE_AUTOINSTALL=1 CLAUDE_CODE_DISABLE_BACKGROUND_TASKS=1 Validated end-to-end: with the env vars set, lsof shows the claude process only connects to localhost:4000 (the local MLX server), and the prompt "Hi" gets a clean response from Qwen 2.5 Coder 14B 4-bit MLX with stop_reason=end_turn. Also documents that --print needs stdin closed (`</dev/null`) or the CLI hangs waiting for input even with the prompt passed as arg. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Thanks @tadrianonet — this is a really substantial contribution and I appreciate the depth here. The standout for me is #5 — the The ChatML stop-marker extension, the Qwen 2.5 Two small things before merge: 1. 2. Tiny ordering issue in the for stop_marker in ['<turn|>', '<|turn>', '<|im_end|>', '<|endoftext|>',
'<|im_start|>', '<|end_of_text|>', '<|eot_id|>']:
if stop_marker in text:
text = text[:text.index(stop_marker)].strip()
breakThis breaks on the first marker that appears in list order, not the one with the earliest position in the text. If two markers both show up, list order decides where we truncate instead of position — so output past the actual earliest marker can leak through. Suggested change: markers = ['<turn|>', '<|turn>', '<|im_end|>', '<|endoftext|>',
'<|im_start|>', '<|end_of_text|>', '<|eot_id|>']
positions = [text.index(m) for m in markers if m in text]
if positions:
text = text[:min(positions)].strip()Edge case in practice, but worth fixing while you're here. Once those two are in, this is good to go. Thanks again — this is the kind of PR that makes me glad I open-sourced this thing. |
Claude Code 2.1 sends `stream: true` on every request that involves tools and silently discards a non-streaming JSON response. Without this, every agentic session ended in `(No output)` because the CLI threw away the tool_use, retried the same prompt once, and gave up. Add `send_anthropic_stream` that replays a fully-generated message as the SSE event sequence the Anthropic Messages API documents: message_start -> content_block_start/delta/stop (per block) -> message_delta -> message_stop. text_delta for text blocks, input_json_delta for tool_use blocks. `Connection: close` so the client doesn't keep waiting after `message_stop`. Also: - filter tool_use input keys against the request's input_schema so hallucinated extra fields can't cause Claude Code to silently drop the call (the schema for Bash actually does accept "description", but Glob/Read/etc. are stricter). - add MLX_DEBUG_REQUEST and MLX_DEBUG_RESPONSE env flags so users on smaller Macs can diagnose tool-call regressions without having to patch the server. Validated end-to-end on M2 Pro 16 GB with Qwen 2.5 Coder 14B 4-bit: single-tool (Bash creating a file) and multi-tool (Write + Read in the same response) both round-trip cleanly through Claude Code. Co-authored-by: Cursor <cursoragent@cursor.com>
Without aggressive KV-cache quantization, agentic sessions on a 16 GB Mac OOM mid-prefill on Claude Code 2.1's ~5860-token system prompt (kIOGPUCommandBufferCallbackErrorOutOfMemory). Default the Agentico and Chat launchers to MLX_KV_BITS=4 / MLX_KV_QUANT_START=0 so the ceiling stays comfortably under 16 GB even with multi-turn context. Plumb both env vars through ensure_mlx_server / force_restart_mlx_server in the common library so any launcher can override per-model. Document the SSE-streaming fix as section 6 in MAC-BASE-SETUP.md — this is the bug that gated the entire agentic mode on 16 GB Macs and took the longest to diagnose. Includes the symptom (No output, same prompt retried), root cause (Claude Code 2.1 always sends stream=true when tools are present), the SSE event sequence the server now emits, the Connection: close detail, and the validated end-to-end output. Co-authored-by: Cursor <cursoragent@cursor.com>
Why
The README targets Mac Max/Ultra (64-128 GB). On a Mac base/Pro 16 GB
five reproducible failures stop the project from working out of the box,
the worst of which silently breaks the "100% offline / 0$/mo" promise.
This branch documents and fixes all five.
What's broken on 16 GB (and how this branch fixes each)
macOS keychain bug — login screen on every launch even with
ANTHROPIC_API_KEYset. The launcher's--bareflag doesn'texist on Claude Code 2.1.76. → Workaround: set
hasCompletedOnboardingin
~/.claude.json, exportANTHROPIC_AUTH_TOKENalongsideANTHROPIC_API_KEY, setDISABLE_LOGIN_COMMAND=1. Documented +automated in the new launchers.
clean_responseonly handled Gemma 4 stop markers (<turn|>,<|turn>). On any ChatML/Llama 3.x model the special tokens leakedinto the visible answer (
<|im_end|>,<|endoftext|>,<|eot_id|>).→ Extended the marker list (commit 1).
Claude Code 2.1's "extended thinking" exhausts small models
on the first call (the thinking pass), leaving the answer pass with
no budget →
(No output). → Force--effort lowin the launchers(disables the two-step flow). Documented in MAC-BASE-SETUP.md §3.
parse_tool_callsdidn't recognize Qwen 2.5's<tools>wrapper.Qwen 2.5 Coder fine-tunes emit tool calls inside
<tools>...</tools>instead of
<tool_call>(Qwen 3.5 / generic ChatML). Without theparser knowing this, tool calls were returned as plain text and
Claude Code never executed them. → Added Format 3.5 to the parser
(commit 4).
(Most serious) Claude Code calls
api.anthropic.comon startupeven with
ANTHROPIC_BASE_URLset. Confirmed vialsof:claude TCP mac:63057->160.79.104.10:https (ESTABLISHED)
whois 160.79.104.10→OrgName: Anthropic, PBC.dig +short api.anthropic.com→160.79.104.10.The CLI fires telemetry, statsig feature flags, marketplace
auto-install and autoupdater checks against
api.anthropic.comdirectly, bypassing the configured base URL. That means setups
following the current README are not actually offline.
→ Export 4 env vars in the launchers (and document):
CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 DISABLE_AUTOUPDATER=1 CLAUDE_CODE_DISABLE_OFFICIAL_MARKETPLACE_AUTOINSTALL=1 CLAUDE_CODE_DISABLE_BACKGROUND_TASKS=1
With these set,
lsofshows the claude process only connects tolocalhost:4000. Validated end-to-end.Recommended model for 16 GB
mlx-community/Qwen2.5-Coder-14B-Instruct-4bit:full Claude Code system prompt (~3000 tokens)
Gemma 4 31B Abliterated (the upstream default) was tested first and
crashes with
kIOGPUCommandBufferCallbackErrorOutOfMemoryon the veryfirst inference on 16 GB — 18 GB of weights + KV cache simply don't
fit alongside macOS. The MAC-BASE-SETUP.md table has been updated to
reflect this.
Files
proxy/server.py: stop-marker fix + Format 3.5 parserlaunchers/Claude Chat.command: chat-only (--tools "")launchers/Claude Agentico.command: tools enabled (--permission-mode auto)docs/MAC-BASE-SETUP.md: full PT-BR diagnosis of all 5 failures withreproducible commands and validation logs
Test plan
Qwen2.5-Coder-14B-Instruct-4bitcleanly (~30s)/v1/messagesPT-BR call: clean text, no token leakage,stop_reason=end_turn/v1/messageswithtools=[Bash]: returnsstop_reason=tool_usewith
Bash({"command":"pwd"})--print "Hi"returnsHello! How can I assist you today?from the local modellsof -p $CLAUDE_PIDconfirms zero connections toapi.anthropic.com, onlylocalhost:4000