Skip to content

Qwen3.5-35B-A3B-4bit can emit malformed tool-call output around 20k prompt tokens. #1061

@Christ9038

Description

@Christ9038

I am seeing malformed tool-call output with Qwen3.5-35B-A3B-4bit.

The important part is that this is reproducible with a direct mlx_lm.stream_generate(...) call, without going through any OpenAI-compatible server wrapper and without any downstream parser.

Environment

  • mlx_lm==0.31.1
  • model: mlx-community/Qwen3.5-35B-A3B-4bit
  • prompt length in this repro: 20043 tokens
  • tool calling enabled
  • sampling used in the repro:
    • temperature=0.7
    • top_p=0.95
    • top_k=0
    • min_p=0.0
    • repetition_penalty=1.3
    • presence_penalty=0.0
    • frequency_penalty=0.0
    • max_tokens=32768
    • seed=3
  • chat template kwargs:
    • {"enable_thinking": false}

What I attached

  • request.json
    • direct replay bundle for mlx_lm
    • contains normalized messages, tools, sampling, and the fully rendered prompt
  • test_script.py
    • minimal script
    • just loads the model and streams the raw output
    • no parsing, no post-processing, no OpenAI server layer

Reproduction

Edit the model path on line 7 of request.json.

Run:

python test_script.py request.json

The script simply does:

  • load(model_path)
  • mx.random.seed(seed)
  • stream_generate(...)
  • print raw text to stdout

Expected behavior

The model should either:

  • emit a valid, fully closed tool-call block, or
  • answer in plain text

It should not emit an incomplete / malformed tool-call block.

Actual behavior

With the attached prompt and seed=3, the direct raw generation produces malformed output like this:

好嘞!我这就用 agent-browser 来查:

<tool_call>
<function=browser>

This is a direct mlx_lm replay, so this is not caused by an external OpenAI-compatible server wrapper or a client-side parser.

Additional observation

This specific attached request is the last clean request before malformed tool-call content started polluting later history.

After one malformed tool-call appears in history, later turns become much more likely to produce even worse corrupted blocks, for example:

  • unclosed <tool_call>
  • unclosed <function=...>
  • stray tags like </arg_key>
  • malformed parameter sections such as binaryNameFromEntryPointMap

So there seem to be two separate issues:

  1. the first malformed tool-call can already happen from a clean context
  2. once it happens, feeding that malformed output back into history causes strong cascading contaminatione

As a comparison, I tested the GGUF version and the exact same issue does not occur.

Secondary note about mlx_lm.server

I also tested the same context through mlx_lm.server in streaming mode.

In that path, the client only received the leading assistant text, then the stream terminated without [DONE], and the server raised:

ValueError: No function provided.

inside the tool parser.

That server behavior is secondary, though. The main issue here is that the direct raw mlx_lm generation already produces malformed tool-call output.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions