Mac base/Pro 16 GB support: Qwen 2.5 14B, ChatML stop markers, <tools> parser, offline leak fix by tadrianonet · Pull Request #32 · nicedreamzapp/claude-code-local

tadrianonet · 2026-05-04T00:43:49Z

Why

The README targets Mac Max/Ultra (64-128 GB). On a Mac base/Pro 16 GB
five reproducible failures stop the project from working out of the box,
the worst of which silently breaks the "100% offline / 0$/mo" promise.

This branch documents and fixes all five.

What's broken on 16 GB (and how this branch fixes each)

macOS keychain bug — login screen on every launch even with
ANTHROPIC_API_KEY set. The launcher's --bare flag doesn't
exist on Claude Code 2.1.76. → Workaround: set hasCompletedOnboarding
in ~/.claude.json, export ANTHROPIC_AUTH_TOKEN alongside
ANTHROPIC_API_KEY, set DISABLE_LOGIN_COMMAND=1. Documented +
automated in the new launchers.
clean_response only handled Gemma 4 stop markers (<turn|>,
<|turn>). On any ChatML/Llama 3.x model the special tokens leaked
into the visible answer (<|im_end|>, <|endoftext|>, <|eot_id|>).
→ Extended the marker list (commit 1).
Claude Code 2.1's "extended thinking" exhausts small models
on the first call (the thinking pass), leaving the answer pass with
no budget → (No output). → Force --effort low in the launchers
(disables the two-step flow). Documented in MAC-BASE-SETUP.md §3.
parse_tool_calls didn't recognize Qwen 2.5's <tools> wrapper.
Qwen 2.5 Coder fine-tunes emit tool calls inside <tools>...</tools>
instead of <tool_call> (Qwen 3.5 / generic ChatML). Without the
parser knowing this, tool calls were returned as plain text and
Claude Code never executed them. → Added Format 3.5 to the parser
(commit 4).
(Most serious) Claude Code calls api.anthropic.com on startup
even with ANTHROPIC_BASE_URL set. Confirmed via lsof:

claude TCP mac:63057->160.79.104.10:https (ESTABLISHED)

whois 160.79.104.10 → OrgName: Anthropic, PBC.
dig +short api.anthropic.com → 160.79.104.10.

The CLI fires telemetry, statsig feature flags, marketplace
auto-install and autoupdater checks against api.anthropic.com
directly, bypassing the configured base URL. That means setups
following the current README are not actually offline.

→ Export 4 env vars in the launchers (and document):

CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 DISABLE_AUTOUPDATER=1 CLAUDE_CODE_DISABLE_OFFICIAL_MARKETPLACE_AUTOINSTALL=1 CLAUDE_CODE_DISABLE_BACKGROUND_TASKS=1

With these set, lsof shows the claude process only connects to
localhost:4000. Validated end-to-end.

Recommended model for 16 GB

mlx-community/Qwen2.5-Coder-14B-Instruct-4bit:

7.8 GB weights → fits in 16 GB unified memory with no swap
Tool calls validated end-to-end after Format 3.5 fix
~10–15 tok/s generation, ~0.2–0.4 tok/s with cold prefill of the
full Claude Code system prompt (~3000 tokens)

Gemma 4 31B Abliterated (the upstream default) was tested first and
crashes with kIOGPUCommandBufferCallbackErrorOutOfMemory on the very
first inference on 16 GB — 18 GB of weights + KV cache simply don't
fit alongside macOS. The MAC-BASE-SETUP.md table has been updated to
reflect this.

Files

proxy/server.py: stop-marker fix + Format 3.5 parser
launchers/Claude Chat.command: chat-only (--tools "")
launchers/Claude Agentico.command: tools enabled (--permission-mode auto)
docs/MAC-BASE-SETUP.md: full PT-BR diagnosis of all 5 failures with
reproducible commands and validation logs

Test plan

Server loads Qwen2.5-Coder-14B-Instruct-4bit cleanly (~30s)
Plain /v1/messages PT-BR call: clean text, no token leakage,
stop_reason=end_turn
/v1/messages with tools=[Bash]: returns stop_reason=tool_use
with Bash({"command":"pwd"})
End-to-end with Claude Code CLI 2.1.76: --print "Hi" returns
Hello! How can I assist you today? from the local model
lsof -p $CLAUDE_PID confirms zero connections to
api.anthropic.com, only localhost:4000

Two new double-clickable launchers tuned for Apple Silicon Macs with 16-32 GB unified memory, where the existing Gemma/Llama/Qwen launchers hit three reproducible issues that the upstream README does not cover. The launchers apply three workarounds: 1. macOS keychain auth bug (issues anthropics/claude-code#25069 and #27900): export ANTHROPIC_AUTH_TOKEN alongside ANTHROPIC_API_KEY and set DISABLE_LOGIN_COMMAND=1 so the local API key is honored in interactive mode instead of falling through to the OAuth login selection screen. 2. Claude Code 2.1 extended thinking: pass --effort low to disable the two-call thinking-then-answer flow. Small/quantized models exhaust their budget on the first call and emit empty replies on the second one (the "(No output)" symptom). 3. Mac base RAM pressure: default to Gemma 4 31B Abliterated 4-bit (recommended by upstream for tool-call reliability) but document that it will swap on 16 GB and run at ~5-8 tok/s. Claude Chat.command - tools disabled (--tools ""), conversation only. Stable on the Qwen 3.5 4B fallback too. Claude Agentico.command - tools enabled, --permission-mode auto. Suitable only with Gemma 31B+ on 16 GB. Both reuse lib/claude-local-common.sh (resolve_mlx_model, ensure_mlx_server) so model-aware restart and local-cache resolution work the same as the upstream launchers. Co-authored-by: Cursor <cursoragent@cursor.com>

Step-by-step guide explaining the three issues a user with M1/M2/M3/M4 base or Pro (16-32 GB) hits when following the official quick start, plus the fix applied in this branch: 1. macOS keychain auth bug → ANTHROPIC_AUTH_TOKEN + hasCompletedOnboarding 2. Stop-token leakage on non-Gemma models → clean_response patch 3. (No output) on every reply → --effort low Also includes a model recommendation table for 16 GB (Qwen 4B vs Llama 8B vs Qwen Coder 14B vs Gemma 31B), with disk/RAM/speed/tool-call trade-offs measured during testing on M2 Pro 16 GB, and a realistic list of what works and what does not on this hardware tier. Written in Portuguese (PT-BR) since the upstream documentation is English-only and this fork specifically targets a hardware profile underserved by the original docs. Open to translating if there is demand. Co-authored-by: Cursor <cursoragent@cursor.com>

Qwen 2.5 Coder fine-tunes (mlx-community/Qwen2.5-Coder-14B-Instruct-4bit) emit tool calls inside <tools>...</tools> instead of the <tool_call> wrapper used by Qwen 3.5 / generic ChatML. parse_tool_calls() didn't recognize this format, so calls were returned as plain text and Claude Code never executed them. Add Format 3.5 to parse_tool_calls() that: - matches <tools>...</tools> with re.DOTALL - parses the inner JSON as either a single object or a list - accepts both "arguments" and "parameters" payload keys - falls back to recover_garbled_tool_json on JSONDecodeError Validated: Qwen 2.5 Coder 14B 4-bit MLX returns stop_reason=tool_use with Bash({"command":"pwd"}) on a single-tool prompt. Co-authored-by: Cursor <cursoragent@cursor.com>

Gemma 4 31B 4-bit (the upstream default) crashes with kIOGPUCommandBufferCallbackErrorOutOfMemory on the very first inference on an M2 Pro 16 GB: 18 GB of weights + KV cache simply don't fit alongside macOS and any open apps. Qwen 2.5 Coder 14B 4-bit MLX is the right ceiling for 16 GB: - 7.8 GB weights → fits with no swap - native MLX 4-bit, no GGUF translation - tool-calls validated end-to-end after the Format 3.5 parser fix - strong code reasoning, decent PT-BR Updates: - launchers/Claude Chat.command, launchers/Claude Agentico.command: point MLX_MODEL_DEFAULT at the Qwen 14B HF id, refresh the loading message and tok/s expectations - docs/MAC-BASE-SETUP.md: document the 4th failure mode (Qwen 2.5 <tools> not parsed), explain the OOM on Gemma 31B, refresh the model recommendation table to mark Qwen 14B as the new default and Gemma 31B as inviavel em 16 GB Co-authored-by: Cursor <cursoragent@cursor.com>

…nthropic.com on startup Confirmed via lsof that Claude Code 2.1.x opens a TCP connection to api.anthropic.com (160.79.104.10, owned by Anthropic PBC per whois) on startup even when ANTHROPIC_BASE_URL points at localhost. The CLI fires telemetry, statsig feature flags, marketplace autoinstall and autoupdater checks directly against api.anthropic.com, bypassing the configured base URL. That means "100% offline / 0$/mo" setups that don't set these env vars are NOT actually offline — code, prompts and telemetry leave the machine. Fix: export 4 env vars in both launchers (and document them in docs/MAC-BASE-SETUP.md as the 5th failure mode): CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 DISABLE_AUTOUPDATER=1 CLAUDE_CODE_DISABLE_OFFICIAL_MARKETPLACE_AUTOINSTALL=1 CLAUDE_CODE_DISABLE_BACKGROUND_TASKS=1 Validated end-to-end: with the env vars set, lsof shows the claude process only connects to localhost:4000 (the local MLX server), and the prompt "Hi" gets a clean response from Qwen 2.5 Coder 14B 4-bit MLX with stop_reason=end_turn. Also documents that --print needs stdin closed (`</dev/null`) or the CLI hangs waiting for input even with the prompt passed as arg. Co-authored-by: Cursor <cursoragent@cursor.com>

nicedreamzapp · 2026-05-04T01:19:06Z

Thanks @tadrianonet — this is a really substantial contribution and I appreciate the depth here.

The standout for me is #5 — the api.anthropic.com leak on startup. The repo's whole pitch is "100% offline / $0/mo" and your lsof evidence shows that wasn't actually true without CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 and friends. That's the kind of finding I'd never have caught on Max/Ultra hardware where the auth path "just works" and nobody packet-watches the startup. Glad you did, and validating with lsof after the fix is exactly the right receipts. Merging that bit alone is worth the PR.

The ChatML stop-marker extension, the Qwen 2.5 <tools> Format 3.5 parser (nicely gated behind if not tool_calls: so it can't fight existing paths), and the --effort low workaround for the small-model two-pass thinking issue are all clean and well-motivated. The keychain workaround is the known upstream pattern, so no surprises there.

Two small things before merge:

1. MAC-BASE-SETUP.md is PT-BR only — could you add an English version, or at minimum an English summary at the top? Most of the repo's audience is English-speaking, and the diagnostics in this doc (especially the offline-leak section) are valuable enough that I want them to land for everyone. Keeping the PT-BR alongside as MAC-BASE-SETUP.pt-BR.md would be totally welcome.

2. Tiny ordering issue in the clean_response stop-marker loop:

for stop_marker in ['<turn|>', '<|turn>', '<|im_end|>', '<|endoftext|>',
                    '<|im_start|>', '<|end_of_text|>', '<|eot_id|>']:
    if stop_marker in text:
        text = text[:text.index(stop_marker)].strip()
        break

This breaks on the first marker that appears in list order, not the one with the earliest position in the text. If two markers both show up, list order decides where we truncate instead of position — so output past the actual earliest marker can leak through. Suggested change:

markers = ['<turn|>', '<|turn>', '<|im_end|>', '<|endoftext|>',
           '<|im_start|>', '<|end_of_text|>', '<|eot_id|>']
positions = [text.index(m) for m in markers if m in text]
if positions:
    text = text[:min(positions)].strip()

Edge case in practice, but worth fixing while you're here.

Once those two are in, this is good to go. Thanks again — this is the kind of PR that makes me glad I open-sourced this thing.

Claude Code 2.1 sends `stream: true` on every request that involves tools and silently discards a non-streaming JSON response. Without this, every agentic session ended in `(No output)` because the CLI threw away the tool_use, retried the same prompt once, and gave up. Add `send_anthropic_stream` that replays a fully-generated message as the SSE event sequence the Anthropic Messages API documents: message_start -> content_block_start/delta/stop (per block) -> message_delta -> message_stop. text_delta for text blocks, input_json_delta for tool_use blocks. `Connection: close` so the client doesn't keep waiting after `message_stop`. Also: - filter tool_use input keys against the request's input_schema so hallucinated extra fields can't cause Claude Code to silently drop the call (the schema for Bash actually does accept "description", but Glob/Read/etc. are stricter). - add MLX_DEBUG_REQUEST and MLX_DEBUG_RESPONSE env flags so users on smaller Macs can diagnose tool-call regressions without having to patch the server. Validated end-to-end on M2 Pro 16 GB with Qwen 2.5 Coder 14B 4-bit: single-tool (Bash creating a file) and multi-tool (Write + Read in the same response) both round-trip cleanly through Claude Code. Co-authored-by: Cursor <cursoragent@cursor.com>

Without aggressive KV-cache quantization, agentic sessions on a 16 GB Mac OOM mid-prefill on Claude Code 2.1's ~5860-token system prompt (kIOGPUCommandBufferCallbackErrorOutOfMemory). Default the Agentico and Chat launchers to MLX_KV_BITS=4 / MLX_KV_QUANT_START=0 so the ceiling stays comfortably under 16 GB even with multi-turn context. Plumb both env vars through ensure_mlx_server / force_restart_mlx_server in the common library so any launcher can override per-model. Document the SSE-streaming fix as section 6 in MAC-BASE-SETUP.md — this is the bug that gated the entire agentic mode on 16 GB Macs and took the longest to diagnose. Includes the symptom (No output, same prompt retried), root cause (Claude Code 2.1 always sends stream=true when tools are present), the SSE event sequence the server now emits, the Connection: close detail, and the validated end-to-end output. Co-authored-by: Cursor <cursoragent@cursor.com>

Thiago Adriano and others added 6 commits May 3, 2026 06:40

Thiago Adriano and others added 2 commits May 3, 2026 23:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mac base/Pro 16 GB support: Qwen 2.5 14B, ChatML stop markers, <tools> parser, offline leak fix#32

Mac base/Pro 16 GB support: Qwen 2.5 14B, ChatML stop markers, <tools> parser, offline leak fix#32
tadrianonet wants to merge 8 commits intonicedreamzapp:mainfrom
tadrianonet:mac-base-fixes

tadrianonet commented May 4, 2026 •

edited

Loading

Uh oh!

nicedreamzapp commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tadrianonet commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What's broken on 16 GB (and how this branch fixes each)

Recommended model for 16 GB

Files

Test plan

Uh oh!

nicedreamzapp commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tadrianonet commented May 4, 2026 •

edited

Loading