Skip to content

Mac base/Pro 16 GB support: Qwen 2.5 14B, ChatML stop markers, <tools> parser, offline leak fix#32

Open
tadrianonet wants to merge 8 commits intonicedreamzapp:mainfrom
tadrianonet:mac-base-fixes
Open

Mac base/Pro 16 GB support: Qwen 2.5 14B, ChatML stop markers, <tools> parser, offline leak fix#32
tadrianonet wants to merge 8 commits intonicedreamzapp:mainfrom
tadrianonet:mac-base-fixes

Conversation

@tadrianonet
Copy link
Copy Markdown

@tadrianonet tadrianonet commented May 4, 2026

Why

The README targets Mac Max/Ultra (64-128 GB). On a Mac base/Pro 16 GB
five reproducible failures stop the project from working out of the box,
the worst of which silently breaks the "100% offline / 0$/mo" promise.

This branch documents and fixes all five.

What's broken on 16 GB (and how this branch fixes each)

  1. macOS keychain bug — login screen on every launch even with
    ANTHROPIC_API_KEY set.
    The launcher's --bare flag doesn't
    exist on Claude Code 2.1.76. → Workaround: set hasCompletedOnboarding
    in ~/.claude.json, export ANTHROPIC_AUTH_TOKEN alongside
    ANTHROPIC_API_KEY, set DISABLE_LOGIN_COMMAND=1. Documented +
    automated in the new launchers.

  2. clean_response only handled Gemma 4 stop markers (<turn|>,
    <|turn>). On any ChatML/Llama 3.x model the special tokens leaked
    into the visible answer (<|im_end|>, <|endoftext|>, <|eot_id|>).
    → Extended the marker list (commit 1).

  3. Claude Code 2.1's "extended thinking" exhausts small models
    on the first call (the thinking pass), leaving the answer pass with
    no budget → (No output). → Force --effort low in the launchers
    (disables the two-step flow). Documented in MAC-BASE-SETUP.md §3.

  4. parse_tool_calls didn't recognize Qwen 2.5's <tools> wrapper.
    Qwen 2.5 Coder fine-tunes emit tool calls inside <tools>...</tools>
    instead of <tool_call> (Qwen 3.5 / generic ChatML). Without the
    parser knowing this, tool calls were returned as plain text and
    Claude Code never executed them. → Added Format 3.5 to the parser
    (commit 4).

  5. (Most serious) Claude Code calls api.anthropic.com on startup
    even with ANTHROPIC_BASE_URL set.
    Confirmed via lsof:

    claude TCP mac:63057->160.79.104.10:https (ESTABLISHED)

whois 160.79.104.10OrgName: Anthropic, PBC.
dig +short api.anthropic.com160.79.104.10.

The CLI fires telemetry, statsig feature flags, marketplace
auto-install and autoupdater checks against api.anthropic.com
directly, bypassing the configured base URL. That means setups
following the current README are not actually offline.

→ Export 4 env vars in the launchers (and document):

CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 DISABLE_AUTOUPDATER=1 CLAUDE_CODE_DISABLE_OFFICIAL_MARKETPLACE_AUTOINSTALL=1 CLAUDE_CODE_DISABLE_BACKGROUND_TASKS=1

With these set, lsof shows the claude process only connects to
localhost:4000. Validated end-to-end.

Recommended model for 16 GB

mlx-community/Qwen2.5-Coder-14B-Instruct-4bit:

  • 7.8 GB weights → fits in 16 GB unified memory with no swap
  • Tool calls validated end-to-end after Format 3.5 fix
  • ~10–15 tok/s generation, ~0.2–0.4 tok/s with cold prefill of the
    full Claude Code system prompt (~3000 tokens)

Gemma 4 31B Abliterated (the upstream default) was tested first and
crashes with kIOGPUCommandBufferCallbackErrorOutOfMemory on the very
first inference on 16 GB — 18 GB of weights + KV cache simply don't
fit alongside macOS. The MAC-BASE-SETUP.md table has been updated to
reflect this.

Files

  • proxy/server.py: stop-marker fix + Format 3.5 parser
  • launchers/Claude Chat.command: chat-only (--tools "")
  • launchers/Claude Agentico.command: tools enabled (--permission-mode auto)
  • docs/MAC-BASE-SETUP.md: full PT-BR diagnosis of all 5 failures with
    reproducible commands and validation logs

Test plan

  • Server loads Qwen2.5-Coder-14B-Instruct-4bit cleanly (~30s)
  • Plain /v1/messages PT-BR call: clean text, no token leakage,
    stop_reason=end_turn
  • /v1/messages with tools=[Bash]: returns stop_reason=tool_use
    with Bash({"command":"pwd"})
  • End-to-end with Claude Code CLI 2.1.76: --print "Hi" returns
    Hello! How can I assist you today? from the local model
  • lsof -p $CLAUDE_PID confirms zero connections to
    api.anthropic.com, only localhost:4000

Thiago Adriano and others added 6 commits May 3, 2026 06:40
The clean_response function only handled Gemma 4 stop markers
(<turn|>, <|turn>). When the server runs other model families the
markers were not stripped and ended up bleeding into the visible
response. Symptom from Qwen 3.5 4B: replies arrived as
"<|endoftext|><|im_start|>user\n<system-".

Add the ChatML and Llama 3.x markers to the truncation list:
- <|im_end|>, <|im_start|>, <|endoftext|>  (Qwen, Mistral, ChatML)
- <|eot_id|>, <|end_of_text|>              (Llama 3.x)

This is a generic fix that benefits any non-Gemma model loaded via
MLX_MODEL, including the official Qwen 3.5 122B fighter and any
mlx-community ChatML-based model.

Co-authored-by: Cursor <cursoragent@cursor.com>
Two new double-clickable launchers tuned for Apple Silicon Macs with
16-32 GB unified memory, where the existing Gemma/Llama/Qwen launchers
hit three reproducible issues that the upstream README does not cover.

The launchers apply three workarounds:

1. macOS keychain auth bug (issues anthropics/claude-code#25069 and
   #27900): export ANTHROPIC_AUTH_TOKEN alongside ANTHROPIC_API_KEY and
   set DISABLE_LOGIN_COMMAND=1 so the local API key is honored in
   interactive mode instead of falling through to the OAuth login
   selection screen.

2. Claude Code 2.1 extended thinking: pass --effort low to disable the
   two-call thinking-then-answer flow. Small/quantized models exhaust
   their budget on the first call and emit empty replies on the
   second one (the "(No output)" symptom).

3. Mac base RAM pressure: default to Gemma 4 31B Abliterated 4-bit
   (recommended by upstream for tool-call reliability) but document
   that it will swap on 16 GB and run at ~5-8 tok/s.

Claude Chat.command   - tools disabled (--tools ""), conversation only.
                        Stable on the Qwen 3.5 4B fallback too.
Claude Agentico.command - tools enabled, --permission-mode auto.
                          Suitable only with Gemma 31B+ on 16 GB.

Both reuse lib/claude-local-common.sh (resolve_mlx_model,
ensure_mlx_server) so model-aware restart and local-cache resolution
work the same as the upstream launchers.

Co-authored-by: Cursor <cursoragent@cursor.com>
Step-by-step guide explaining the three issues a user with M1/M2/M3/M4
base or Pro (16-32 GB) hits when following the official quick start,
plus the fix applied in this branch:

1. macOS keychain auth bug → ANTHROPIC_AUTH_TOKEN + hasCompletedOnboarding
2. Stop-token leakage on non-Gemma models → clean_response patch
3. (No output) on every reply → --effort low

Also includes a model recommendation table for 16 GB (Qwen 4B vs
Llama 8B vs Qwen Coder 14B vs Gemma 31B), with disk/RAM/speed/tool-call
trade-offs measured during testing on M2 Pro 16 GB, and a realistic
list of what works and what does not on this hardware tier.

Written in Portuguese (PT-BR) since the upstream documentation is
English-only and this fork specifically targets a hardware profile
underserved by the original docs. Open to translating if there is
demand.

Co-authored-by: Cursor <cursoragent@cursor.com>
Qwen 2.5 Coder fine-tunes (mlx-community/Qwen2.5-Coder-14B-Instruct-4bit)
emit tool calls inside <tools>...</tools> instead of the <tool_call>
wrapper used by Qwen 3.5 / generic ChatML. parse_tool_calls() didn't
recognize this format, so calls were returned as plain text and Claude
Code never executed them.

Add Format 3.5 to parse_tool_calls() that:
  - matches <tools>...</tools> with re.DOTALL
  - parses the inner JSON as either a single object or a list
  - accepts both "arguments" and "parameters" payload keys
  - falls back to recover_garbled_tool_json on JSONDecodeError

Validated: Qwen 2.5 Coder 14B 4-bit MLX returns stop_reason=tool_use
with Bash({"command":"pwd"}) on a single-tool prompt.

Co-authored-by: Cursor <cursoragent@cursor.com>
Gemma 4 31B 4-bit (the upstream default) crashes with
kIOGPUCommandBufferCallbackErrorOutOfMemory on the very first inference
on an M2 Pro 16 GB: 18 GB of weights + KV cache simply don't fit
alongside macOS and any open apps.

Qwen 2.5 Coder 14B 4-bit MLX is the right ceiling for 16 GB:
  - 7.8 GB weights → fits with no swap
  - native MLX 4-bit, no GGUF translation
  - tool-calls validated end-to-end after the Format 3.5 parser fix
  - strong code reasoning, decent PT-BR

Updates: - launchers/Claude Chat.command, launchers/Claude Agentico.command:
    point MLX_MODEL_DEFAULT at the Qwen 14B HF id, refresh the loading
    message and tok/s expectations
  - docs/MAC-BASE-SETUP.md: document the 4th failure mode (Qwen 2.5
    <tools> not parsed), explain the OOM on Gemma 31B, refresh the
    model recommendation table to mark Qwen 14B as the new default
    and Gemma 31B as inviavel em 16 GB
Co-authored-by: Cursor <cursoragent@cursor.com>
…nthropic.com on startup

Confirmed via lsof that Claude Code 2.1.x opens a TCP connection to
api.anthropic.com (160.79.104.10, owned by Anthropic PBC per whois) on
startup even when ANTHROPIC_BASE_URL points at localhost. The CLI
fires telemetry, statsig feature flags, marketplace autoinstall and
autoupdater checks directly against api.anthropic.com, bypassing the
configured base URL.

That means "100% offline / 0$/mo" setups that don't set these env vars
are NOT actually offline — code, prompts and telemetry leave the machine.

Fix: export 4 env vars in both launchers (and document them in
docs/MAC-BASE-SETUP.md as the 5th failure mode):

  CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1
  DISABLE_AUTOUPDATER=1
  CLAUDE_CODE_DISABLE_OFFICIAL_MARKETPLACE_AUTOINSTALL=1
  CLAUDE_CODE_DISABLE_BACKGROUND_TASKS=1

Validated end-to-end: with the env vars set, lsof shows the claude
process only connects to localhost:4000 (the local MLX server), and
the prompt "Hi" gets a clean response from Qwen 2.5 Coder 14B 4-bit
MLX with stop_reason=end_turn.

Also documents that --print needs stdin closed (`</dev/null`) or the
CLI hangs waiting for input even with the prompt passed as arg.

Co-authored-by: Cursor <cursoragent@cursor.com>
@nicedreamzapp
Copy link
Copy Markdown
Owner

Thanks @tadrianonet — this is a really substantial contribution and I appreciate the depth here.

The standout for me is #5 — the api.anthropic.com leak on startup. The repo's whole pitch is "100% offline / $0/mo" and your lsof evidence shows that wasn't actually true without CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 and friends. That's the kind of finding I'd never have caught on Max/Ultra hardware where the auth path "just works" and nobody packet-watches the startup. Glad you did, and validating with lsof after the fix is exactly the right receipts. Merging that bit alone is worth the PR.

The ChatML stop-marker extension, the Qwen 2.5 <tools> Format 3.5 parser (nicely gated behind if not tool_calls: so it can't fight existing paths), and the --effort low workaround for the small-model two-pass thinking issue are all clean and well-motivated. The keychain workaround is the known upstream pattern, so no surprises there.

Two small things before merge:

1. MAC-BASE-SETUP.md is PT-BR only — could you add an English version, or at minimum an English summary at the top? Most of the repo's audience is English-speaking, and the diagnostics in this doc (especially the offline-leak section) are valuable enough that I want them to land for everyone. Keeping the PT-BR alongside as MAC-BASE-SETUP.pt-BR.md would be totally welcome.

2. Tiny ordering issue in the clean_response stop-marker loop:

for stop_marker in ['<turn|>', '<|turn>', '<|im_end|>', '<|endoftext|>',
                    '<|im_start|>', '<|end_of_text|>', '<|eot_id|>']:
    if stop_marker in text:
        text = text[:text.index(stop_marker)].strip()
        break

This breaks on the first marker that appears in list order, not the one with the earliest position in the text. If two markers both show up, list order decides where we truncate instead of position — so output past the actual earliest marker can leak through. Suggested change:

markers = ['<turn|>', '<|turn>', '<|im_end|>', '<|endoftext|>',
           '<|im_start|>', '<|end_of_text|>', '<|eot_id|>']
positions = [text.index(m) for m in markers if m in text]
if positions:
    text = text[:min(positions)].strip()

Edge case in practice, but worth fixing while you're here.

Once those two are in, this is good to go. Thanks again — this is the kind of PR that makes me glad I open-sourced this thing.

Thiago Adriano and others added 2 commits May 3, 2026 23:36
Claude Code 2.1 sends `stream: true` on every request that involves
tools and silently discards a non-streaming JSON response. Without
this, every agentic session ended in `(No output)` because the CLI
threw away the tool_use, retried the same prompt once, and gave up.

Add `send_anthropic_stream` that replays a fully-generated message
as the SSE event sequence the Anthropic Messages API documents:
message_start -> content_block_start/delta/stop (per block) ->
message_delta -> message_stop. text_delta for text blocks,
input_json_delta for tool_use blocks. `Connection: close` so the
client doesn't keep waiting after `message_stop`.

Also:
- filter tool_use input keys against the request's input_schema so
  hallucinated extra fields can't cause Claude Code to silently drop
  the call (the schema for Bash actually does accept "description",
  but Glob/Read/etc. are stricter).
- add MLX_DEBUG_REQUEST and MLX_DEBUG_RESPONSE env flags so users
  on smaller Macs can diagnose tool-call regressions without having
  to patch the server.

Validated end-to-end on M2 Pro 16 GB with Qwen 2.5 Coder 14B 4-bit:
single-tool (Bash creating a file) and multi-tool (Write + Read in
the same response) both round-trip cleanly through Claude Code.

Co-authored-by: Cursor <cursoragent@cursor.com>
Without aggressive KV-cache quantization, agentic sessions on a 16 GB
Mac OOM mid-prefill on Claude Code 2.1's ~5860-token system prompt
(kIOGPUCommandBufferCallbackErrorOutOfMemory). Default the Agentico
and Chat launchers to MLX_KV_BITS=4 / MLX_KV_QUANT_START=0 so the
ceiling stays comfortably under 16 GB even with multi-turn context.

Plumb both env vars through ensure_mlx_server / force_restart_mlx_server
in the common library so any launcher can override per-model.

Document the SSE-streaming fix as section 6 in MAC-BASE-SETUP.md —
this is the bug that gated the entire agentic mode on 16 GB Macs and
took the longest to diagnose. Includes the symptom (No output, same
prompt retried), root cause (Claude Code 2.1 always sends stream=true
when tools are present), the SSE event sequence the server now emits,
the Connection: close detail, and the validated end-to-end output.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants