Nemotron-Cascade-2: thinking mode infinite loop on Apple Silicon

## Environment
- Model: `mlx-community/Nemotron-Cascade-2-30B-A3B-8bit`
- mlx-lm version: `0.31.1`
- Apple Silicon M1 Max, 64GB unified memory
- Python 3.12

## Problem

When running `Nemotron-Cascade-2-30B-A3B` with `enable_thinking=True`, the model never generates `</think>` — it loops indefinitely in the reasoning block and never produces a response.

With `enable_thinking=False`, generation stops after only a few tokens with poor quality output.

## Findings

`</think>` is token ID **13**, which is **not** included in the default EOS token set (`{2, 11}`). Without it in the EOS set, the model has no stop condition for the thinking block.

vLLM handles this via `--reasoning-parser nemotron_v3`, which extends `DeepSeekR1ReasoningParser` and watches for `</think>` to transition from reasoning to content. No equivalent exists in mlx-lm for this model.

## Workaround attempted

Manually adding token 13 to `eos_token_ids` stops the thinking phase cleanly. However, the answer phase immediately re-enters thinking, producing incoherent output:

```python
# Phase 1: thinking
tokenizer.eos_token_ids.add(13)
for r in stream_generate(model, tokenizer, prompt=prompt, max_tokens=4096, sampler=sampler):
    thinking += r.text
    if r.finish_reason:
        break
tokenizer.eos_token_ids.discard(13)

# Phase 2: answer — re-enters thinking loop, output is incoherent
prompt2 = prompt + thinking + "</think>\n"
for r in stream_generate(model, tokenizer, prompt=prompt2, ...):
    ...
```

## Comparison

The same model via Ollama (GGUF Q4_K_M) works correctly — thinking terminates naturally and produces coherent output. This suggests the issue is in how mlx-lm handles the thinking state machine for this architecture, not the model weights themselves.

## Request

Would it be possible to implement a reasoning parser equivalent for `nemotron_h` / Nemotron-Cascade in mlx-lm, similar to how vLLM's `nemotron_v3` parser handles the `<think>`/`</think>` transition?

The model runs well on Apple Silicon (MLX 8-bit loads correctly at ~33GB) and should be significantly faster than the GGUF alternative once thinking mode is handled correctly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nemotron-Cascade-2: thinking mode infinite loop on Apple Silicon #1050

Environment

Problem

Findings

Workaround attempted

Comparison

Request

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Nemotron-Cascade-2: thinking mode infinite loop on Apple Silicon #1050

Description

Environment

Problem

Findings

Workaround attempted

Comparison

Request

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions