-
Notifications
You must be signed in to change notification settings - Fork 516
Description
Environment
- Model:
mlx-community/Nemotron-Cascade-2-30B-A3B-8bit - mlx-lm version:
0.31.1 - Apple Silicon M1 Max, 64GB unified memory
- Python 3.12
Problem
When running Nemotron-Cascade-2-30B-A3B with enable_thinking=True, the model never generates </think> — it loops indefinitely in the reasoning block and never produces a response.
With enable_thinking=False, generation stops after only a few tokens with poor quality output.
Findings
</think> is token ID 13, which is not included in the default EOS token set ({2, 11}). Without it in the EOS set, the model has no stop condition for the thinking block.
vLLM handles this via --reasoning-parser nemotron_v3, which extends DeepSeekR1ReasoningParser and watches for </think> to transition from reasoning to content. No equivalent exists in mlx-lm for this model.
Workaround attempted
Manually adding token 13 to eos_token_ids stops the thinking phase cleanly. However, the answer phase immediately re-enters thinking, producing incoherent output:
# Phase 1: thinking
tokenizer.eos_token_ids.add(13)
for r in stream_generate(model, tokenizer, prompt=prompt, max_tokens=4096, sampler=sampler):
thinking += r.text
if r.finish_reason:
break
tokenizer.eos_token_ids.discard(13)
# Phase 2: answer — re-enters thinking loop, output is incoherent
prompt2 = prompt + thinking + "</think>\n"
for r in stream_generate(model, tokenizer, prompt=prompt2, ...):
...Comparison
The same model via Ollama (GGUF Q4_K_M) works correctly — thinking terminates naturally and produces coherent output. This suggests the issue is in how mlx-lm handles the thinking state machine for this architecture, not the model weights themselves.
Request
Would it be possible to implement a reasoning parser equivalent for nemotron_h / Nemotron-Cascade in mlx-lm, similar to how vLLM's nemotron_v3 parser handles the <think>/</think> transition?
The model runs well on Apple Silicon (MLX 8-bit loads correctly at ~33GB) and should be significantly faster than the GGUF alternative once thinking mode is handled correctly.