Skip to content

Fix per-request enable_thinking toggle in server#1030

Open
eyupcanakman wants to merge 1 commit intoml-explore:mainfrom
eyupcanakman:fix/per-request-thinking-toggle-914
Open

Fix per-request enable_thinking toggle in server#1030
eyupcanakman wants to merge 1 commit intoml-explore:mainfrom
eyupcanakman:fix/per-request-thinking-toggle-914

Conversation

@eyupcanakman
Copy link
Copy Markdown
Contributor

fixes #914.

PR #829 added chat_template_kwargs support so clients can send enable_thinking per request, but has_thinking was still read from the static tokenizer property. The per-request value was applied to the chat template but ignored during response handling.

Added _has_thinking() that checks per-request kwargs first, then CLI --chat-template-args, then falls back to tokenizer.has_thinking. Both batch and single generation paths use it.

The chat_template_kwargs from client requests (e.g. enable_thinking)
were applied to the chat template during tokenization but ignored when
building the GenerationContext. This meant the response handler always
used the static tokenizer.has_thinking flag, so reasoning detection
and prompt checkpointing could not be toggled per request.

Add _has_thinking() that resolves the effective thinking state from
per-request kwargs, CLI --chat-template-args, then the tokenizer
default. Use it in both generation paths and prompt checkpointing.

Fixes ml-explore#914
@Thump604
Copy link
Copy Markdown

The priority chain (per-request kwargs > CLI args > tokenizer default) is the right design and fixes a real bug — setting enable_thinking: false per-request currently still triggers think-token stripping.

Heads up: this conflicts with PR #1006 which also modifies _compute_prompt_checkpoint (same function signature, both add args parameter). Whichever lands first, the other needs a rebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Adjusting thinking/reasoning on a per-message basis?

2 participants