-
Notifications
You must be signed in to change notification settings - Fork 516
Description
I am seeing malformed tool-call output with Qwen3.5-35B-A3B-4bit.
The important part is that this is reproducible with a direct mlx_lm.stream_generate(...) call, without going through any OpenAI-compatible server wrapper and without any downstream parser.
Environment
mlx_lm==0.31.1- model: mlx-community/Qwen3.5-35B-A3B-4bit
- prompt length in this repro:
20043tokens - tool calling enabled
- sampling used in the repro:
temperature=0.7top_p=0.95top_k=0min_p=0.0repetition_penalty=1.3presence_penalty=0.0frequency_penalty=0.0max_tokens=32768seed=3
- chat template kwargs:
{"enable_thinking": false}
What I attached
- request.json
- direct replay bundle for
mlx_lm - contains normalized
messages,tools,sampling, and the fully renderedprompt
- direct replay bundle for
- test_script.py
- minimal script
- just loads the model and streams the raw output
- no parsing, no post-processing, no OpenAI server layer
Reproduction
Edit the model path on line 7 of request.json.
Run:
python test_script.py request.jsonThe script simply does:
- load(model_path)
- mx.random.seed(seed)
- stream_generate(...)
- print raw text to stdout
Expected behavior
The model should either:
- emit a valid, fully closed tool-call block, or
- answer in plain text
It should not emit an incomplete / malformed tool-call block.
Actual behavior
With the attached prompt and seed=3, the direct raw generation produces malformed output like this:
好嘞!我这就用 agent-browser 来查:
<tool_call>
<function=browser>
This is a direct mlx_lm replay, so this is not caused by an external OpenAI-compatible server wrapper or a client-side parser.
Additional observation
This specific attached request is the last clean request before malformed tool-call content started polluting later history.
After one malformed tool-call appears in history, later turns become much more likely to produce even worse corrupted blocks, for example:
- unclosed <tool_call>
- unclosed <function=...>
- stray tags like </arg_key>
- malformed parameter sections such as binaryNameFromEntryPointMap
So there seem to be two separate issues:
- the first malformed tool-call can already happen from a clean context
- once it happens, feeding that malformed output back into history causes strong cascading contaminatione
As a comparison, I tested the GGUF version and the exact same issue does not occur.
Secondary note about mlx_lm.server
I also tested the same context through mlx_lm.server in streaming mode.
In that path, the client only received the leading assistant text, then the stream terminated without [DONE], and the server raised:
ValueError: No function provided.
inside the tool parser.
That server behavior is secondary, though. The main issue here is that the direct raw mlx_lm generation already produces malformed tool-call output.