Skip to content

Nemotron-Cascade-2: thinking mode infinite loop on Apple Silicon #1050

@roeex5

Description

@roeex5

Environment

  • Model: mlx-community/Nemotron-Cascade-2-30B-A3B-8bit
  • mlx-lm version: 0.31.1
  • Apple Silicon M1 Max, 64GB unified memory
  • Python 3.12

Problem

When running Nemotron-Cascade-2-30B-A3B with enable_thinking=True, the model never generates </think> — it loops indefinitely in the reasoning block and never produces a response.

With enable_thinking=False, generation stops after only a few tokens with poor quality output.

Findings

</think> is token ID 13, which is not included in the default EOS token set ({2, 11}). Without it in the EOS set, the model has no stop condition for the thinking block.

vLLM handles this via --reasoning-parser nemotron_v3, which extends DeepSeekR1ReasoningParser and watches for </think> to transition from reasoning to content. No equivalent exists in mlx-lm for this model.

Workaround attempted

Manually adding token 13 to eos_token_ids stops the thinking phase cleanly. However, the answer phase immediately re-enters thinking, producing incoherent output:

# Phase 1: thinking
tokenizer.eos_token_ids.add(13)
for r in stream_generate(model, tokenizer, prompt=prompt, max_tokens=4096, sampler=sampler):
    thinking += r.text
    if r.finish_reason:
        break
tokenizer.eos_token_ids.discard(13)

# Phase 2: answer — re-enters thinking loop, output is incoherent
prompt2 = prompt + thinking + "</think>\n"
for r in stream_generate(model, tokenizer, prompt=prompt2, ...):
    ...

Comparison

The same model via Ollama (GGUF Q4_K_M) works correctly — thinking terminates naturally and produces coherent output. This suggests the issue is in how mlx-lm handles the thinking state machine for this architecture, not the model weights themselves.

Request

Would it be possible to implement a reasoning parser equivalent for nemotron_h / Nemotron-Cascade in mlx-lm, similar to how vLLM's nemotron_v3 parser handles the <think>/</think> transition?

The model runs well on Apple Silicon (MLX 8-bit loads correctly at ~33GB) and should be significantly faster than the GGUF alternative once thinking mode is handled correctly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions