Skip to content

gpt-oss: implement harmony parsing #15181

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

aldehir
Copy link
Collaborator

@aldehir aldehir commented Aug 8, 2025

This is my attempt at implementing a harmony parser for gpt-oss.

Implementation

  • Reasoning format support - both auto and none are supported. When none, <|channel|>analysis<|message|>{reasoning content}<|end|> is added to the content.
  • Tool parsing - tool parsing and grammar implemented. If parse_tool_calls == false, tool calls are added to the content verbatim--which aligns with other implementations.
  • Commentary preamble - the harmony format allows for a preamble in the commentary channel. If present, it is added to the content.
  • Tests added - perhaps too many test cases. I wanted to ensure proper parsing of partial messages.

Remaining Work

  • The harmony format specifies that reasoning content from the assistant's last tool call should be included in the next prompt. This implementation assumes it comes from the client in reasoning_content. However, none of the clients I tested send it. A simple workaround is to use reasoning_format = none, or add the reasoning to the content in tool calls.

@aldehir aldehir requested a review from ngxson as a code owner August 8, 2025 18:51
@github-actions github-actions bot added testing Everything test related examples server labels Aug 8, 2025
@abc-nix
Copy link

abc-nix commented Aug 8, 2025

Thanks. It finally made it much easier to use tools in Cherry Studio. And it generates thinking boxes properly.

@dagbs
Copy link

dagbs commented Aug 8, 2025

Without the PR:
image

With the PR:
using gpt-oss-20b:f16 from unsloth with the updated gguf
image

It's better, easily more usable, but there might be some issues around tool calling still.

@aldehir
Copy link
Collaborator Author

aldehir commented Aug 9, 2025

@dagbs try setting function calling to native in open-webui
image

@aldehir aldehir force-pushed the feature/harmony-parser branch from d65e556 to 981886f Compare August 9, 2025 03:18
@victorb
Copy link

victorb commented Aug 9, 2025

I tried this PR yesterday and compared it to #15158 (+ my own fixes on top of that PR) and there was a couple of issues with this PR (that I was gonna share this morning), but since da67163 was pushed, it seems to finally work better than that PR. In my (albeit limited) testing, seems tool calling and it's formatting is working a lot better. Thanks a ton for this patch @aldehir!

All the unit tests pass as well, compared to the other PR, and code organization at a glance seems better too, but granted I'm not cpp expert, just an generalist.

@victorb
Copy link

victorb commented Aug 9, 2025

Hmm, seems to still be breaking sometimes, tried to understand why but to no avail. Most of the time, it works perfectly fine, but seems some edge-case breaks it. Running da67163 right now.

If I repeatably use the same weather example maybe 10 times, I end up getting a badly parsed (on llama.cpp's side) maybe once.

Good run looks like this:

ChatCompletionResponse {
    choices: [
        Choice {
            message: ResponseMessage {
                content: Some(
                    "Here are the current conditions for the three cities, sorted by temperature (highest\u{202f}\u{202f}lowest):\n\n- **Barcelona**: ☀\u{fe0f}\u{202f}+25\u{202f}°C  \n- **Lima**: ⛅\u{fe0f}\u{202f}+16\u{202f}°C  \n- **Stockholm**: ☀\u{fe0f}\u{202f}+13\u{202f}°C  \n\n*(Temperatures are taken from the latest weather data at the time of the query.)*",
                ),
                reasoning_content: Some(
                    "The user asks: \"What is the current weather in Barcelona, Stockholm, and Lima? And also, display them in a list sorted by their temperatures, highest first.\"\n\nWe have fetched weather for each location via the get_weather function. The function returns a JSON string with \"result\": \"Barcelona: ☀\u{fe0f}   +25°C\\n\". Similarly for Stockholm: \"Stockholm: ☀\u{fe0f}   +13°C\\n\". Lima: \"Lima: ⛅\u{fe0f}  +16°C\\n\". We need to parse these results, extract the temperature values, sort them descending, and display them in a list.\n\nWe need to produce a final answer that includes the weather for each location sorted by temperature highest first. The user wants a list sorted by temperature, highest first. So we need to sort: Barcelona +25°C, Lima +16°C, Stockholm +13°C.\n\nThus the sorted list: Barcelona: ☀\u{fe0f} +25°C, Lima: ⛅\u{fe0f} +16°C, Stockholm: ☀\u{fe0f} +13°C.\n\nWe should present them as a list, maybe bullet points.\n\nWe need to ensure we include the weather icons and temperature values as given.\n\nThus answer: \n\n- Barcelona: ☀\u{fe0f} +25°C\n- Lima: ⛅\u{fe0f} +16°C\n- Stockholm: ☀\u{fe0f} +13°C\n\nWe could also include the original strings.\n\nThus final answer: a list sorted by temperature highest first.\n\nWe should also note that the data is from the function calls.\n\nThus answer: \"Here are the current weather conditions for the three cities, sorted by temperature (highest first): ...\"\n\nWe should also mention that the temperatures are approximate and may change.\n\nThus final answer.",
                ),
                tool_calls: [],
            },
        },
    ],
}
sending:
[
    ChatMessage {
        role: "system",
        content: Some(
            "You are a helpful assistant. Help the user with whatever they need.\n",
        ),
        channel: None,
        recipient: None,
        tool_calls: None,
        tool_call_id: None,
    },
    ChatMessage {
        role: "user",
        content: Some(
            "What is the current weather in Barcelona, Stockholm, and Lima? And also, display them in a list sorted by their temperatures, highest first.",
        ),
        channel: None,
        recipient: None,
        tool_calls: None,
        tool_call_id: None,
    },
    ChatMessage {
        role: "assistant",
        content: Some(
            "",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: Some(
            [
                ToolCall {
                    id: "ItCkpCeXs6jXspSwbFLidTHuATWM8MIj",
                    type: "function",
                    function: ToolCallFunction {
                        name: "get_weather",
                        arguments: "{\"location\":\"Barcelona\"}",
                    },
                },
            ],
        ),
        tool_call_id: None,
    },
    ChatMessage {
        role: "tool",
        content: Some(
            "{\"result\":\"Barcelona: ☀\u{fe0f}   +25°C\\n\"}",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: None,
        tool_call_id: Some(
            "ItCkpCeXs6jXspSwbFLidTHuATWM8MIj",
        ),
    },
    ChatMessage {
        role: "assistant",
        content: Some(
            "",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: Some(
            [
                ToolCall {
                    id: "d92fjsjS8L5xBMTxSCmSWcNyhgISwo4u",
                    type: "function",
                    function: ToolCallFunction {
                        name: "get_weather",
                        arguments: "{\"location\":\"Stockholm\"}",
                    },
                },
            ],
        ),
        tool_call_id: None,
    },
    ChatMessage {
        role: "tool",
        content: Some(
            "{\"result\":\"Stockholm: ☀\u{fe0f}   +13°C\\n\"}",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: None,
        tool_call_id: Some(
            "d92fjsjS8L5xBMTxSCmSWcNyhgISwo4u",
        ),
    },
    ChatMessage {
        role: "assistant",
        content: Some(
            "",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: Some(
            [
                ToolCall {
                    id: "0rIM7Xm598gzrRALjB4yMGZnuKRjOrSh",
                    type: "function",
                    function: ToolCallFunction {
                        name: "get_weather",
                        arguments: "{\"location\":\"Lima\"}",
                    },
                },
            ],
        ),
        tool_call_id: None,
    },
    ChatMessage {
        role: "tool",
        content: Some(
            "{\"result\":\"Lima: ⛅\u{fe0f}  +16°C\\n\"}",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: None,
        tool_call_id: Some(
            "0rIM7Xm598gzrRALjB4yMGZnuKRjOrSh",
        ),
    },
]
[src/lib.rs:38:9] &val = Object {
    "choices": Array [
        Object {
            "finish_reason": String("stop"),
            "index": Number(0),
            "message": Object {
                "role": String("assistant"),
                "reasoning_content": String("The user asks: \"What is the current weather in Barcelona, Stockholm, and Lima? And also, display them in a list sorted by their temperatures, highest first.\"\n\nWe have fetched weather for each location via the get_weather function. The function returns a JSON string with \"result\": \"Barcelona: ☀\u{fe0f}   +25°C\\n\". Similarly for Stockholm: \"Stockholm: ☀\u{fe0f}   +13°C\\n\". Lima: \"Lima: ⛅\u{fe0f}  +16°C\\n\". We need to parse these results, extract the temperature values, sort them descending, and display them in a list.\n\nWe need to produce a final answer that includes the weather for each location sorted by temperature highest first. The user wants a list sorted by temperature, highest first. So we need to sort: Barcelona +25°C, Lima +16°C, Stockholm +13°C.\n\nThus the sorted list: Barcelona: ☀\u{fe0f} +25°C, Lima: ⛅\u{fe0f} +16°C, Stockholm: ☀\u{fe0f} +13°C.\n\nWe should present them as a list, maybe bullet points.\n\nWe need to ensure we include the weather icons and temperature values as given.\n\nThus answer: \n\n- Barcelona: ☀\u{fe0f} +25°C\n- Lima: ⛅\u{fe0f} +16°C\n- Stockholm: ☀\u{fe0f} +13°C\n\nWe could also include the original strings.\n\nThus final answer: a list sorted by temperature highest first.\n\nWe should also note that the data is from the function calls.\n\nThus answer: \"Here are the current weather conditions for the three cities, sorted by temperature (highest first): ...\"\n\nWe should also mention that the temperatures are approximate and may change.\n\nThus final answer."),
                "content": String("Here are the current conditions for the three cities, sorted by temperature (highest\u{202f}→\u{202f}lowest):\n\n- **Barcelona**: ☀\u{fe0f}\u{202f}+25\u{202f}°C  \n- **Lima**:  ⛅\u{fe0f}\u{202f}+16\u{202f}°C  \n- **Stockholm**: ☀\u{fe0f}\u{202f}+13\u{202f}°C  \n\n*(Temperatures are taken from the latest weather data at the time of the query.)*"),
            },
        },
    ],
    "created": Number(1754730237),
    "model": String("gpt-oss-20b-MXFP4.gguf"),
    "system_fingerprint": String("b6124-da671637"),
    "object": String("chat.completion"),
    "usage": Object {
        "completion_tokens": Number(440),
        "prompt_tokens": Number(361),
        "total_tokens": Number(801),
    },
    "id": String("chatcmpl-efjEpQIpXzIGe9j4F4gnC1X39B7mHOa3"),
    "__verbose": Object {
        "index": Number(0),
        "content": String("<|channel|>analysis<|message|>The user asks: \"What is the current weather in Barcelona, Stockholm, and Lima? And also, display them in a list sorted by their temperatures, highest first.\"\n\nWe have fetched weather for each location via the get_weather function. The function returns a JSON string with \"result\": \"Barcelona: ☀\u{fe0f}   +25°C\\n\". Similarly for Stockholm: \"Stockholm: ☀\u{fe0f}   +13°C\\n\". Lima: \"Lima: ⛅\u{fe0f}  +16°C\\n\". We need to parse these results, extract the temperature values, sort them descending, and display them in a list.\n\nWe need to produce a final answer that includes the weather for each location sorted by temperature highest first. The user wants a list sorted by temperature, highest first. So we need to sort: Barcelona +25°C, Lima +16°C, Stockholm +13°C.\n\nThus the sorted list: Barcelona: ☀\u{fe0f} +25°C, Lima: ⛅\u{fe0f} +16°C, Stockholm: ☀\u{fe0f} +13°C.\n\nWe should present them as a list, maybe bullet points.\n\nWe need to ensure we include the weather icons and temperature values as given.\n\nThus answer: \n\n- Barcelona: ☀\u{fe0f} +25°C\n- Lima: ⛅\u{fe0f} +16°C\n- Stockholm: ☀\u{fe0f} +13°C\n\nWe could also include the original strings.\n\nThus final answer: a list sorted by temperature highest first.\n\nWe should also note that the data is from the function calls.\n\nThus answer: \"Here are the current weather conditions for the three cities, sorted by temperature (highest first): ...\"\n\nWe should also mention that the temperatures are approximate and may change.\n\nThus final answer.\n\n<|end|><|start|>assistant<|channel|>final<|message|>Here are the current conditions for the three cities, sorted by temperature (highest\u{202f}→\u{202f}lowest):\n\n- **Barcelona**: ☀\u{fe0f}\u{202f}+25\u{202f}°C  \n- **Lima**: ⛅\u{fe0f}\u{202f}+16\u{202f}°C  \n- **Stockholm**: ☀\u{fe0f}\u{202f}+13\u{202f}°C  \n\n*(Temperatures are taken from the latest weather data at the time of the query.)*"),
        "tokens": Array [],
        "id_slot": Number(0),
        "stop": Bool(true),
        "model": String("gpt-oss-20b-MXFP4.gguf"),
        "tokens_predicted": Number(440),
        "tokens_evaluated": Number(361),
        "generation_settings": Object {
            "n_predict": Number(4096),
            "seed": Number(4294967295),
            "temperature": Number(1.0),
            "dynatemp_range": Number(0.0),
            "dynatemp_exponent": Number(1.0),
            "top_k": Number(40),
            "top_p": Number(1.0),
            "min_p": Number(1.0),
            "top_n_sigma": Number(-1.0),
            "xtc_probability": Number(0.0),
            "xtc_threshold": Number(0.10000000149011612),
            "typical_p": Number(1.0),
            "repeat_last_n": Number(64),
            "repeat_penalty": Number(1.0),
            "presence_penalty": Number(0.0),
            "frequency_penalty": Number(0.0),
            "dry_multiplier": Number(0.0),
            "dry_base": Number(1.75),
            "dry_allowed_length": Number(2),
            "dry_penalty_last_n": Number(131072),
            "dry_sequence_breakers": Array [
                String("\n"),
                String(":"),
                String("\""),
                String("*"),
            ],
            "mirostat": Number(0),
            "mirostat_tau": Number(5.0),
            "mirostat_eta": Number(0.10000000149011612),
            "stop": Array [],
            "max_tokens": Number(4096),
            "n_keep": Number(0),
            "n_discard": Number(0),
            "ignore_eos": Bool(false),
            "stream": Bool(false),
            "logit_bias": Array [],
            "n_probs": Number(0),
            "min_keep": Number(0),
            "grammar": String("add-args ::= \"{\" space add-args-a-kv \",\" space add-args-b-kv \"}\" space\nadd-args-a-kv ::= \"\\\"a\\\"\" space \":\" space number\nadd-args-b-kv ::= \"\\\"b\\\"\" space \":\" space number\nadd-call ::= \"add\" space \"<|constrain|>\"? \"json\" space \"<|message|>\" add-args\nchar ::= [^\"\\\\\\x7F\\x00-\\x1F] | [\\\\] ([\"\\\\bfnrt] | \"u\" [0-9a-fA-F]{4})\ndecimal-part ::= [0-9]{1,16}\nget-weather-args ::= \"{\" space get-weather-args-location-kv \"}\" space\nget-weather-args-location-kv ::= \"\\\"location\\\"\" space \":\" space string\nget-weather-call ::= \"get_weather\" space \"<|constrain|>\"? \"json\" space \"<|message|>\" get-weather-args\nintegral-part ::= [0] | [1-9] [0-9]{0,15}\nmultiply-args ::= \"{\" space multiply-args-a-kv \",\" space multiply-args-b-kv \"}\" space\nmultiply-args-a-kv ::= \"\\\"a\\\"\" space \":\" space number\nmultiply-args-b-kv ::= \"\\\"b\\\"\" space \":\" space number\nmultiply-call ::= \"multiply\" space \"<|constrain|>\"? \"json\" space \"<|message|>\" multiply-args\nnumber ::= (\"-\"? integral-part) (\".\" decimal-part)? ([eE] [-+]? integral-part)? space\nroot ::= \"<|channel|>commentary to=functions.\" tool-call\nspace ::= | \" \" | \"\\n\"{1,2} [ \\t]{0,20}\nstring ::= \"\\\"\" char* \"\\\"\" space\ntool-call ::= add-call | multiply-call | get-weather-call\n"),
            "grammar_lazy": Bool(true),
            "grammar_triggers": Array [
                Object {
                    "type": Number(2),
                    "value": String("<\\|channel\\|>commentary to"),
                },
            ],
            "preserved_tokens": Array [
                Number(200003),
                Number(200005),
                Number(200006),
                Number(200007),
                Number(200008),
            ],
            "chat_format": String("GPT-OSS"),
            "reasoning_format": String("auto"),
            "reasoning_in_content": Bool(false),
            "thinking_forced_open": Bool(false),
            "samplers": Array [
                String("top_p"),
                String("min_p"),
                String("temperature"),
            ],
            "speculative.n_max": Number(16),
            "speculative.n_min": Number(0),
            "speculative.p_min": Number(0.75),
            "timings_per_token": Bool(false),
            "post_sampling_probs": Bool(false),
            "lora": Array [],
        },
        "prompt": String("<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.\nKnowledge cutoff: 2024-06\nCurrent date: 2025-08-09\n\nReasoning: high\n\n# Valid channels: analysis, commentary, final. Channel must be included for every message.\nCalls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>developer<|message|># Instructions\n\nYou are a helpful assistant. Help the user with whatever they need.\n\n\n# Tools\n\n## functions\n\nnamespace functions {\n\n// adds two numbers\ntype add = (_: {\na: number,\nb: number\n}) => any;\n\n// multiplies two numbers\ntype multiply = (_: {\na: number,\nb: number\n}) => any;\n\n// Get the weather for the specified location\ntype get_weather = (_: {\nlocation: string\n}) => any;\n\n} // namespace functions<|end|><|start|>user<|message|>What is the current weather in Barcelona, Stockholm, and Lima? And also, display them in a list sorted by their temperatures, highest first.<|end|><|start|>assistant to=functions.get_weather<|channel|>commentary json<|message|>{\"location\": \"Barcelona\"}<|call|><|start|>functions.get_weather to=assistant<|channel|>commentary<|message|>\"{\\\"result\\\":\\\"Barcelona: ☀\u{fe0f}   +25°C\\\\n\\\"}\"<|end|><|start|>assistant to=functions.get_weather<|channel|>commentary json<|message|>{\"location\": \"Stockholm\"}<|call|><|start|>functions.get_weather to=assistant<|channel|>commentary<|message|>\"{\\\"result\\\":\\\"Stockholm: ☀\u{fe0f}   +13°C\\\\n\\\"}\"<|end|><|start|>assistant to=functions.get_weather<|channel|>commentary json<|message|>{\"location\": \"Lima\"}<|call|><|start|>functions.get_weather to=assistant<|channel|>commentary<|message|>\"{\\\"result\\\":\\\"Lima: ⛅\u{fe0f}  +16°C\\\\n\\\"}\"<|end|><|start|>assistant"),
        "has_new_line": Bool(true),
        "truncated": Bool(false),
        "stop_type": String("eos"),
        "stopping_word": String(""),
        "tokens_cached": Number(800),
        "timings": Object {
            "prompt_n": Number(49),
            "prompt_ms": Number(80.661),
            "prompt_per_token_ms": Number(1.6461428571428571),
            "prompt_per_second": Number(607.4806907923283),
            "predicted_n": Number(440),
            "predicted_ms": Number(2565.854),
            "predicted_per_token_ms": Number(5.831486363636364),
            "predicted_per_second": Number(171.48286691292645),
        },
    },
    "timings": Object {
        "prompt_n": Number(49),
        "prompt_ms": Number(80.661),
        "prompt_per_token_ms": Number(1.6461428571428571),
        "prompt_per_second": Number(607.4806907923283),
        "predicted_n": Number(440),
        "predicted_ms": Number(2565.854),
        "predicted_per_token_ms": Number(5.831486363636364),
        "predicted_per_second": Number(171.48286691292645),
    },
}
got:
ChatCompletionResponse {
    choices: [
        Choice {
            message: ResponseMessage {
                content: Some(
                    "Here are the current conditions for the three cities, sorted by temperature (highest\u{202f}→\u{202f}lowest):\n\n- **Barcelona**: ☀\u{fe0f}\u{202f}+25\u{202f}°C  \n- **Lima**: ⛅\u{fe0f}\u{202f}+16\u{202f}°C  \n- **Stockholm**: ☀\u{fe0f}\u{202f}+13\u{202f}°C  \n\n*(Temperatures are taken from the latest weather data at the time of the query.)*",
                ),
                reasoning_content: Some(
                    "The user asks: \"What is the current weather in Barcelona, Stockholm, and Lima? And also, display them in a list sorted by their temperatures, highest first.\"\n\nWe have fetched weather for each location via the get_weather function. The function returns a JSON string with \"result\": \"Barcelona: ☀\u{fe0f}   +25°C\\n\". Similarly for Stockholm: \"Stockholm: ☀\u{fe0f}   +13°C\\n\". Lima: \"Lima: ⛅\u{fe0f}  +16°C\\n\". We need to parse these results, extract the temperature values, sort them descending, and display them in a list.\n\nWe need to produce a final answer that includes the weather for each location sorted by temperature highest first. The user wants a list sorted by temperature, highest first. So we need to sort: Barcelona +25°C, Lima +16°C, Stockholm +13°C.\n\nThus the sorted list: Barcelona: ☀\u{fe0f} +25°C, Lima: ⛅\u{fe0f} +16°C, Stockholm: ☀\u{fe0f} +13°C.\n\nWe should present them as a list, maybe bullet points.\n\nWe need to ensure we include the weather icons and temperature values as given.\n\nThus answer: \n\n- Barcelona: ☀\u{fe0f} +25°C\n- Lima: ⛅\u{fe0f} +16°C\n- Stockholm: ☀\u{fe0f} +13°C\n\nWe could also include the original strings.\n\nThus final answer: a list sorted by temperature highest first.\n\nWe should also note that the data is from the function calls.\n\nThus answer: \"Here are the current weather conditions for the three cities, sorted by temperature (highest first): ...\"\n\nWe should also mention that the temperatures are approximate and may change.\n\nThus final answer.",
                ),
                tool_calls: [],
            },
        },
    ],
}
############# SHOULD BE RETURNING NOW< ALL DONE

Assistant: Here are the current conditions for the three cities, sorted by temperature (highest → lowest):

- **Barcelona**: ☀️ +25 °C
- **Lima**: ⛅️ +16 °C
- **Stockholm**: ☀️ +13 °C

*(Temperatures are taken from the latest weather data at the time of the query.)*

Meanwhile, a bad runs ends up with:

ChatCompletionResponse {
    choices: [
        Choice {
            message: ResponseMessage {
                content: Some(
                    " to=functions.get_weather\u{a0}\u{200b}\u{200b}\u{a0}\u{a0}\n\n\n\n",
                ),
                reasoning_content: None,
                tool_calls: [],
            },
        },
    ],
}

Full logs from bad run:

sending:
[
    ChatMessage {
        role: "system",
        content: Some(
            "You are a helpful assistant. Help the user with whatever they need.\n",
        ),
        channel: None,
        recipient: None,
        tool_calls: None,
        tool_call_id: None,
    },
    ChatMessage {
        role: "user",
        content: Some(
            "What is the current weather in Barcelona, Stockholm, and Lima? And also, display them in a list sorted by their temperatures, highest first.",
        ),
        channel: None,
        recipient: None,
        tool_calls: None,
        tool_call_id: None,
    },
    ChatMessage {
        role: "assistant",
        content: Some(
            "",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: Some(
            [
                ToolCall {
                    id: "uoYcwKVzv9haFDLHzVI9PcnAcICFcXmy",
                    type: "function",
                    function: ToolCallFunction {
                        name: "get_weather",
                        arguments: "{\"location\":\"Barcelona\"}",
                    },
                },
            ],
        ),
        tool_call_id: None,
    },
    ChatMessage {
        role: "tool",
        content: Some(
            "{\"result\":\"Barcelona: ☀\u{fe0f}   +25°C\\n\"}",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: None,
        tool_call_id: Some(
            "uoYcwKVzv9haFDLHzVI9PcnAcICFcXmy",
        ),
    },
    ChatMessage {
        role: "assistant",
        content: Some(
            "",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: Some(
            [
                ToolCall {
                    id: "qiY0di8Ec9BxfVuJa5Nw4flvAsEhs9DY",
                    type: "function",
                    function: ToolCallFunction {
                        name: "get_weather",
                        arguments: "{\"location\":\"Stockholm\"}",
                    },
                },
            ],
        ),
        tool_call_id: None,
    },
    ChatMessage {
        role: "tool",
        content: Some(
            "{\"result\":\"Stockholm: ☀\u{fe0f}   +13°C\\n\"}",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: None,
        tool_call_id: Some(
            "qiY0di8Ec9BxfVuJa5Nw4flvAsEhs9DY",
        ),
    },
]
[src/lib.rs:38:9] &val = Object {
    "choices": Array [
        Object {
            "finish_reason": String("stop"),
            "index": Number(0),
            "message": Object {
                "role": String("assistant"),
                "content": String(" to=functions.get_weather\u{a0}\u{200b}\u{200b}\u{a0}\u{a0}\n\n\n\n"),
            },
        },
    ],
    "created": Number(1754730110),
    "model": String("gpt-oss-20b-MXFP4.gguf"),
    "system_fingerprint": String("b6124-da671637"),
    "object": String("chat.completion"),
    "usage": Object {
        "completion_tokens": Number(12),
        "prompt_tokens": Number(310),
        "total_tokens": Number(322),
    },
    "id": String("chatcmpl-MKwVwT9hOE93A4IvYowdcn7f7mvFOvaR"),
    "__verbose": Object {
        "index": Number(0),
        "content": String(" to=functions.get_weather\u{a0}\u{200b}\u{200b}\u{a0}\u{a0}\n\n\n\n"),
        "tokens": Array [],
        "id_slot": Number(0),
        "stop": Bool(true),
        "model": String("gpt-oss-20b-MXFP4.gguf"),
        "tokens_predicted": Number(12),
        "tokens_evaluated": Number(310),
        "generation_settings": Object {
            "n_predict": Number(4096),
            "seed": Number(4294967295),
            "temperature": Number(1.0),
            "dynatemp_range": Number(0.0),
            "dynatemp_exponent": Number(1.0),
            "top_k": Number(40),
            "top_p": Number(1.0),
            "min_p": Number(1.0),
            "top_n_sigma": Number(-1.0),
            "xtc_probability": Number(0.0),
            "xtc_threshold": Number(0.10000000149011612),
            "typical_p": Number(1.0),
            "repeat_last_n": Number(64),
            "repeat_penalty": Number(1.0),
            "presence_penalty": Number(0.0),
            "frequency_penalty": Number(0.0),
            "dry_multiplier": Number(0.0),
            "dry_base": Number(1.75),
            "dry_allowed_length": Number(2),
            "dry_penalty_last_n": Number(131072),
            "dry_sequence_breakers": Array [
                String("\n"),
                String(":"),
                String("\""),
                String("*"),
            ],
            "mirostat": Number(0),
            "mirostat_tau": Number(5.0),
            "mirostat_eta": Number(0.10000000149011612),
            "stop": Array [],
            "max_tokens": Number(4096),
            "n_keep": Number(0),
            "n_discard": Number(0),
            "ignore_eos": Bool(false),
            "stream": Bool(false),
            "logit_bias": Array [],
            "n_probs": Number(0),
            "min_keep": Number(0),
            "grammar": String("add-args ::= \"{\" space add-args-a-kv \",\" space add-args-b-kv \"}\" space\nadd-args-a-kv ::= \"\\\"a\\\"\" space \":\" space number\nadd-args-b-kv ::= \"\\\"b\\\"\" space \":\" space number\nadd-call ::= \"add\" space \"<|constrain|>\"? \"json\" space \"<|message|>\" add-args\nchar ::= [^\"\\\\\\x7F\\x00-\\x1F] | [\\\\] ([\"\\\\bfnrt] | \"u\" [0-9a-fA-F]{4})\ndecimal-part ::= [0-9]{1,16}\nget-weather-args ::= \"{\" space get-weather-args-location-kv \"}\" space\nget-weather-args-location-kv ::= \"\\\"location\\\"\" space \":\" space string\nget-weather-call ::= \"get_weather\" space \"<|constrain|>\"? \"json\" space \"<|message|>\" get-weather-args\nintegral-part ::= [0] | [1-9] [0-9]{0,15}\nmultiply-args ::= \"{\" space multiply-args-a-kv \",\" space multiply-args-b-kv \"}\" space\nmultiply-args-a-kv ::= \"\\\"a\\\"\" space \":\" space number\nmultiply-args-b-kv ::= \"\\\"b\\\"\" space \":\" space number\nmultiply-call ::= \"multiply\" space \"<|constrain|>\"? \"json\" space \"<|message|>\" multiply-args\nnumber ::= (\"-\"? integral-part) (\".\" decimal-part)? ([eE] [-+]? integral-part)? space\nroot ::= \"<|channel|>commentary to=functions.\" tool-call\nspace ::= | \" \" | \"\\n\"{1,2} [ \\t]{0,20}\nstring ::= \"\\\"\" char* \"\\\"\" space\ntool-call ::= add-call | multiply-call | get-weather-call\n"),
            "grammar_lazy": Bool(true),
            "grammar_triggers": Array [
                Object {
                    "type": Number(2),
                    "value": String("<\\|channel\\|>commentary to"),
                },
            ],
            "preserved_tokens": Array [
                Number(200003),
                Number(200005),
                Number(200006),
                Number(200007),
                Number(200008),
            ],
            "chat_format": String("GPT-OSS"),
            "reasoning_format": String("auto"),
            "reasoning_in_content": Bool(false),
            "thinking_forced_open": Bool(false),
            "samplers": Array [
                String("top_p"),
                String("min_p"),
                String("temperature"),
            ],
            "speculative.n_max": Number(16),
            "speculative.n_min": Number(0),
            "speculative.p_min": Number(0.75),
            "timings_per_token": Bool(false),
            "post_sampling_probs": Bool(false),
            "lora": Array [],
        },
        "prompt": String("<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.\nKnowledge cutoff: 2024-06\nCurrent date: 2025-08-09\n\nReasoning: low\n\n# Valid channels: analysis, commentary, final. Channel must be included for every message.\nCalls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>developer<|message|># Instructions\n\nYou are a helpful assistant. Help the user with whatever they need.\n\n\n# Tools\n\n## functions\n\nnamespace functions {\n\n// adds two numbers\ntype add = (_: {\na: number,\nb: number\n}) => any;\n\n// multiplies two numbers\ntype multiply = (_: {\na: number,\nb: number\n}) => any;\n\n// Get the weather for the specified location\ntype get_weather = (_: {\nlocation: string\n}) => any;\n\n} // namespace functions<|end|><|start|>user<|message|>What is the current weather in Barcelona, Stockholm, and Lima? And also, display them in a list sorted by their temperatures, highest first.<|end|><|start|>assistant to=functions.get_weather<|channel|>commentary json<|message|>{\"location\": \"Barcelona\"}<|call|><|start|>functions.get_weather to=assistant<|channel|>commentary<|message|>\"{\\\"result\\\":\\\"Barcelona: ☀\u{fe0f}   +25°C\\\\n\\\"}\"<|end|><|start|>assistant to=functions.get_weather<|channel|>commentary json<|message|>{\"location\": \"Stockholm\"}<|call|><|start|>functions.get_weather to=assistant<|channel|>commentary<|message|>\"{\\\"result\\\":\\\"Stockholm: ☀\u{fe0f}   +13°C\\\\n\\\"}\"<|end|><|start|>assistant"),
        "has_new_line": Bool(true),
        "truncated": Bool(false),
        "stop_type": String("eos"),
        "stopping_word": String(""),
        "tokens_cached": Number(321),
        "timings": Object {
            "prompt_n": Number(50),
            "prompt_ms": Number(78.391),
            "prompt_per_token_ms": Number(1.5678200000000002),
            "prompt_per_second": Number(637.8283221288157),
            "predicted_n": Number(12),
            "predicted_ms": Number(64.481),
            "predicted_per_token_ms": Number(5.3734166666666665),
            "predicted_per_second": Number(186.1013321753695),
        },
    },
    "timings": Object {
        "prompt_n": Number(50),
        "prompt_ms": Number(78.391),
        "prompt_per_token_ms": Number(1.5678200000000002),
        "prompt_per_second": Number(637.8283221288157),
        "predicted_n": Number(12),
        "predicted_ms": Number(64.481),
        "predicted_per_token_ms": Number(5.3734166666666665),
        "predicted_per_second": Number(186.1013321753695),
    },
}
got:
ChatCompletionResponse {
    choices: [
        Choice {
            message: ResponseMessage {
                content: Some(
                    " to=functions.get_weather\u{a0}\u{200b}\u{200b}\u{a0}\u{a0}\n\n\n\n",
                ),
                reasoning_content: None,
                tool_calls: [],
            },
        },
    ],
}
############# SHOULD BE RETURNING NOW< ALL DONE

Assistant: to=functions.get_weather ​​

Seems to happen more often when reasoning_effort is set to low, compared to when it's set to high, but I'm not 100% sure I'm imagining this. But if true, could be inference problem from the model itself, where it gets the syntax wrong? I'm really not sure what's going on here.

@Mushoz
Copy link

Mushoz commented Aug 9, 2025

@victorb maybe use temperature= 0 and/or top-k 1? If inference is the issue, making it deterministic would fix it.

@victorb
Copy link

victorb commented Aug 9, 2025

@Mushoz

maybe use temperature= 0 and/or top-k 1? If inference is the issue, making it deterministic would fix it.

Running with these inference parameters for example:

{
        temperature: 0.0,
        top_p: 1.0,
        min_p: 0.0,
        top_k: 0,
        samplers: [
            "top_k",
            "top_p",
            "min_p",
            "temperature",
        ],
}

Seems to correctly give me deterministic responses, which once I get one good response, they always work well, but the ones that break, always break, so I guess useful for testing at the very least. Here's one example of broken parsing I'm currently getting, even with temperature=0 and top-k to various values:

ChatCompletionResponse {
    choices: [
        Choice {
            message: ResponseMessage {
                content: Some(
                    " to=function\u{a0}\u{a0}...",
                ),
                reasoning_content: None,
                tool_calls: [],
            },
        },
    ],
}
sending:
[
    ChatMessage {
        role: "system",
        content: Some(
            "You are a helpful assistant. Help the user with whatever they need.\n",
        ),
        channel: None,
        recipient: None,
        tool_calls: None,
        tool_call_id: None,
    },
    ChatMessage {
        role: "user",
        content: Some(
            "What is the current weather in Barcelona, Stockholm, and Beijing? And also, display them in a list sorted by their temperatures, highest first.",
        ),
        channel: None,
        recipient: None,
        tool_calls: None,
        tool_call_id: None,
    },
    ChatMessage {
        role: "assistant",
        content: Some(
            "",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: Some(
            [
                ToolCall {
                    id: "h4fZmZGG2zWXlE6IOqBzTnzdrUFavQFu",
                    type: "function",
                    function: ToolCallFunction {
                        name: "get_weather",
                        arguments: "{\"location\":\"Barcelona\"}",
                    },
                },
            ],
        ),
        tool_call_id: None,
    },
    ChatMessage {
        role: "tool",
        content: Some(
            "{\"result\":\"Barcelona: ☀\u{fe0f}  19°C (mocked)\"}",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: None,
        tool_call_id: Some(
            "h4fZmZGG2zWXlE6IOqBzTnzdrUFavQFu",
        ),
    },
    ChatMessage {
        role: "assistant",
        content: Some(
            "",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: Some(
            [
                ToolCall {
                    id: "lek3lo184KRjWObZv6y1rgkdKkBgoFj7",
                    type: "function",
                    function: ToolCallFunction {
                        name: "get_weather",
                        arguments: "{\"location\":\"Stockholm\"}",
                    },
                },
            ],
        ),
        tool_call_id: None,
    },
    ChatMessage {
        role: "tool",
        content: Some(
            "{\"result\":\"Stockholm: ☀\u{fe0f}  26°C (mocked)\"}",
        ),
        channel: Some(
            "commentary",
        ),
        recipient: None,
        tool_calls: None,
        tool_call_id: Some(
            "lek3lo184KRjWObZv6y1rgkdKkBgoFj7",
        ),
    },
]
[src/lib.rs:38:9] &val = Object {
    "choices": Array [
        Object {
            "finish_reason": String("stop"),
            "index": Number(0),
            "message": Object {
                "role": String("assistant"),
                "content": String(" to=function\u{a0}\u{a0}..."),
            },
        },
    ],
    "created": Number(1754735169),
    "model": String("gpt-oss-120b-MXFP4.gguf"),
    "system_fingerprint": String("b6124-da671637"),
    "object": String("chat.completion"),
    "usage": Object {
        "completion_tokens": Number(7),
        "prompt_tokens": Number(314),
        "total_tokens": Number(321),
    },
    "id": String("chatcmpl-kXPt4WpoM4AUGLhbku8VlKSwZkktJUDA"),
    "__verbose": Object {
        "index": Number(0),
        "content": String(" to=function\u{a0}\u{a0}..."),
        "tokens": Array [],
        "id_slot": Number(0),
        "stop": Bool(true),
        "model": String("gpt-oss-120b-MXFP4.gguf"),
        "tokens_predicted": Number(7),
        "tokens_evaluated": Number(314),
        "generation_settings": Object {
            "n_predict": Number(4096),
            "seed": Number(4294967295),
            "temperature": Number(0.0),
            "dynatemp_range": Number(0.0),
            "dynatemp_exponent": Number(1.0),
            "top_k": Number(0),
            "top_p": Number(1.0),
            "min_p": Number(0.0),
            "top_n_sigma": Number(-1.0),
            "xtc_probability": Number(0.0),
            "xtc_threshold": Number(0.10000000149011612),
            "typical_p": Number(1.0),
            "repeat_last_n": Number(64),
            "repeat_penalty": Number(1.0),
            "presence_penalty": Number(0.0),
            "frequency_penalty": Number(0.0),
            "dry_multiplier": Number(0.0),
            "dry_base": Number(1.75),
            "dry_allowed_length": Number(2),
            "dry_penalty_last_n": Number(131072),
            "dry_sequence_breakers": Array [
                String("\n"),
                String(":"),
                String("\""),
                String("*"),
            ],
            "mirostat": Number(0),
            "mirostat_tau": Number(5.0),
            "mirostat_eta": Number(0.10000000149011612),
            "stop": Array [],
            "max_tokens": Number(4096),
            "n_keep": Number(0),
            "n_discard": Number(0),
            "ignore_eos": Bool(false),
            "stream": Bool(false),
            "logit_bias": Array [],
            "n_probs": Number(0),
            "min_keep": Number(0),
            "grammar": String("add-args ::= \"{\" space add-args-a-kv \",\" space add-args-b-kv \"}\" space\nadd-args-a-kv ::= \"\\\"a\\\"\" space \":\" space number\nadd-args-b-kv ::= \"\\\"b\\\"\" space \":\" space number\nadd-call ::= \"add\" space \"<|constrain|>\"? \"json\" space \"<|message|>\" add-args\nchar ::= [^\"\\\\\\x7F\\x00-\\x1F] | [\\\\] ([\"\\\\bfnrt] | \"u\" [0-9a-fA-F]{4})\ndecimal-part ::= [0-9]{1,16}\nget-weather-args ::= \"{\" space get-weather-args-location-kv \"}\" space\nget-weather-args-location-kv ::= \"\\\"location\\\"\" space \":\" space string\nget-weather-call ::= \"get_weather\" space \"<|constrain|>\"? \"json\" space \"<|message|>\" get-weather-args\nintegral-part ::= [0] | [1-9] [0-9]{0,15}\nmultiply-args ::= \"{\" space multiply-args-a-kv \",\" space multiply-args-b-kv \"}\" space\nmultiply-args-a-kv ::= \"\\\"a\\\"\" space \":\" space number\nmultiply-args-b-kv ::= \"\\\"b\\\"\" space \":\" space number\nmultiply-call ::= \"multiply\" space \"<|constrain|>\"? \"json\" space \"<|message|>\" multiply-args\nnumber ::= (\"-\"? integral-part) (\".\" decimal-part)? ([eE] [-+]? integral-part)? space\nroot ::= \"<|channel|>commentary to=functions.\" tool-call\nspace ::= | \" \" | \"\\n\"{1,2} [ \\t]{0,20}\nstring ::= \"\\\"\" char* \"\\\"\" space\ntool-call ::= add-call | multiply-call | get-weather-call\n"),
            "grammar_lazy": Bool(true),
            "grammar_triggers": Array [
                Object {
                    "type": Number(2),
                    "value": String("<\\|channel\\|>commentary to"),
                },
            ],
            "preserved_tokens": Array [
                Number(200003),
                Number(200005),
                Number(200006),
                Number(200007),
                Number(200008),
            ],
            "chat_format": String("GPT-OSS"),
            "reasoning_format": String("auto"),
            "reasoning_in_content": Bool(false),
            "thinking_forced_open": Bool(false),
            "samplers": Array [
                String("top_k"),
                String("top_p"),
                String("min_p"),
                String("temperature"),
            ],
            "speculative.n_max": Number(16),
            "speculative.n_min": Number(0),
            "speculative.p_min": Number(0.75),
            "timings_per_token": Bool(false),
            "post_sampling_probs": Bool(false),
            "lora": Array [],
        },
        "prompt": String("<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.\nKnowledge cutoff: 2024-06\nCurrent date: 2025-08-09\n\nReasoning: low\n\n# Valid channels: analysis, commentary, final. Channel must be included for every message.\nCalls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>developer<|message|># Instructions\n\nYou are a helpful assistant. Help the user with whatever they need.\n\n\n# Tools\n\n## functions\n\nnamespace functions {\n\n// adds two numbers\ntype add = (_: {\na: number,\nb: number\n}) => any;\n\n// multiplies two numbers\ntype multiply = (_: {\na: number,\nb: number\n}) => any;\n\n// Get the weather for the specified location\ntype get_weather = (_: {\nlocation: string\n}) => any;\n\n} // namespace functions<|end|><|start|>user<|message|>What is the current weather in Barcelona, Stockholm, and Beijing? And also, display them in a list sorted by their temperatures, highest first.<|end|><|start|>assistant to=functions.get_weather<|channel|>commentary json<|message|>{\"location\": \"Barcelona\"}<|call|><|start|>functions.get_weather to=assistant<|channel|>commentary<|message|>\"{\\\"result\\\":\\\"Barcelona: ☀\u{fe0f}  19°C (mocked)\\\"}\"<|end|><|start|>assistant to=functions.get_weather<|channel|>commentary json<|message|>{\"location\": \"Stockholm\"}<|call|><|start|>functions.get_weather to=assistant<|channel|>commentary<|message|>\"{\\\"result\\\":\\\"Stockholm: ☀\u{fe0f}  26°C (mocked)\\\"}\"<|end|><|start|>assistant"),
        "has_new_line": Bool(false),
        "truncated": Bool(false),
        "stop_type": String("eos"),
        "stopping_word": String(""),
        "tokens_cached": Number(320),
        "timings": Object {
            "prompt_n": Number(52),
            "prompt_ms": Number(82.733),
            "prompt_per_token_ms": Number(1.5910192307692308),
            "prompt_per_second": Number(628.5279151003831),
            "predicted_n": Number(7),
            "predicted_ms": Number(54.931),
            "predicted_per_token_ms": Number(7.8472857142857135),
            "predicted_per_second": Number(127.4325972583787),
        },
    },
    "timings": Object {
        "prompt_n": Number(52),
        "prompt_ms": Number(82.733),
        "prompt_per_token_ms": Number(1.5910192307692308),
        "prompt_per_second": Number(628.5279151003831),
        "predicted_n": Number(7),
        "predicted_ms": Number(54.931),
        "predicted_per_token_ms": Number(7.8472857142857135),
        "predicted_per_second": Number(127.4325972583787),
    },
}
got:
ChatCompletionResponse {
    choices: [
        Choice {
            message: ResponseMessage {
                content: Some(
                    " to=function\u{a0}\u{a0}...",
                ),
                reasoning_content: None,
                tool_calls: [],
            },
        },
    ],
}
############# SHOULD BE RETURNING NOW< ALL DONE

Assistant: to=function  ...

Tried setting top-k to 0, 1 and 100 and get the same results.

@aldehir
Copy link
Collaborator Author

aldehir commented Aug 9, 2025

@victorb thank you for that extensive testing. I can't seem to reproduce this on gpt-oss-20b. Can you provide the last entry in the server log where it begins parsing:

srv  update_chat_: Parsing chat message: <|channel|>analysis<|message|>We need to list sorted by temperature. Pr...

That will help me better understand the problem. It appears the model is emitting unicode space characters, but I wasn't aware the space symbol in the grammar would accept unicode. Still digging more into that.

@aldehir
Copy link
Collaborator Author

aldehir commented Aug 9, 2025

I managed to get gpt-oss-120b running, albeit slowly.

Looks like I missed a scenario where the model outputs the recipient (to=) in the role and the message in a commentary or analysis channel:

<|start|>assistant to=functions.get_weather<|channel|>commentary <|constrain|>json<|message|>{ ... }

I have yet to see the gpt-oss-20b model exhibit this behavior, but it is documented in the harmony docs.

I updated the parsing and grammar rule to handle this. It should at least parse the tool calls now.

I found performance degrades by the third call. I get queries to "Lima??", "Lima?", or some variation with garbage at the end. However, if I pass reasoning_content to every message, I get good results. I was able to extend the query to 5 cities by doing so.

Give cf9a0d6 a shot.

@aldehir
Copy link
Collaborator Author

aldehir commented Aug 10, 2025

For those interested, I implemented a basic cache for reasoning content in my fork aldehir#1.

Without prior reasoning content for tool calls, gpt-oss seems to perform poorly on multi-turn scenarios. No client I know passes reasoning_content back to the model, so a cache on the server end is the easiest way to address it. If this PR gets accepted, I'll submit it for review.

@victorb
Copy link

victorb commented Aug 10, 2025

It should at least parse the tool calls now.

Awesome @aldehir, did a bunch of testing yesterday with 20b and 120b and tool parsing didn't fail once! 🎉

I do see the same inference quality degradation after a few messages, mainly hallucinations for the tool arguments (calling get_weather("...") or get_weather("?") for example) with both 20b and 120b.

However, trying out the --reasoning_cache quickly for ~30 minutes (before going offline for a week!) seems to alleviate that particular issue, nicely done in figuring that out, seems to help a lot!

Overall, seems solid to me now. Since cf9a0d6, the parsing of Harmony seems complete in all the examples I've tried to run, everything goes into the right place and tool calls/responses all look correct now.

@tarruda
Copy link

tarruda commented Aug 10, 2025

@aldehir using your gpt-oss-inject-reasoning branch, there seems to be something wrong with tool calling: When I provide it with more than one tool, it always seems to call the first tool. For example, in my local CLI agent, I give it the prompt: "explore this project" and two tools:

  • list_files
  • read_file

If read_file appears first in the tool list, then it reasons: "I need to use the list_files tool", followed by a read_file call.

If list_files appears first, then it calls it successfully. Once it sees the tool return and it contains a README.md, it follows up with a reasoning: "There's a readme, I need to call read_file", and then it calls list_files again.

I wonder if this is related to the grammar generation for the tool calls which is somehow constraining it to always use the first tool.

BTW this is the first model I've tried with llama-server that can mix reasoning with tool calls, so it is definitely in the right direction!

@aldehir
Copy link
Collaborator Author

aldehir commented Aug 10, 2025

When I provide it with more than one tool, it always seems to call the first tool.

@tarruda good catch. I forgot to group up the tool calls when I reworked the grammar to account for the recipient in the role. I've updated both this PR and the one in my fork.

@tarruda
Copy link

tarruda commented Aug 10, 2025

@tarruda good catch. I forgot to group up the tool calls when I reworked the grammar to account for the recipient in the role. I've updated both this PR and the one in my fork.

Thanks a lot, seems to be working perfectly now!

@tarruda
Copy link

tarruda commented Aug 10, 2025

I've also been playing with calling tools in its CoT and confirm it is working correctly. For example, if I provide this tool to the LLM:

async def arithmetic(code: str) -> str:
    """
    Evaluates arithmetic expression and returns the result.

    ANY arithmetic questions (no matter how trivial) should make use of this tool in your chain of thought. Always return this tool's response even if it is wrong!
    """
    return f"{eval("5 + 5")}"

Then it will always use it during reasoning.

There's something I'm wondering though: Looking at the template, I can see it tells the LLM about 2 possible builtin tools it can use in its CoT (browser and python). I imagine that GPT-OSS was trained to make use of these tools. What I'm wondering is why these tools are treated specially, and if it makes sense to "merge" the user provided tools with these builtin tools.

@pwilkin
Copy link
Contributor

pwilkin commented Aug 11, 2025

@tarruda what I mean is that before, for example for DeepSeek, tool calls between think tags were not treated as real calls, but as the LLM thinking about the tool calls and possibly constructing them in the thinking process. Only calls outside the think block were parsed.

@pwilkin
Copy link
Contributor

pwilkin commented Aug 11, 2025

So, to give an example: this was a valid tool call:

<think>I will now call a search tool.</think>
<tool_call>
{"name": "search", "arguments": { "query": "foo" }}
</tool_call>

This was not:

<think>Let's call a search tool: 
<tool_call>
{"name": "search", "arguments": { "query": "foo" }}
</tool_call>
</think>

I have successfully called a search tool and am expecting the results.

@ggerganov
Copy link
Member

I have been vibing a bit with Claude Code + this PR + gpt-oss-120b and tool calling seems to work as expected. Seems like a good job - let's wait for @ngxson to say the final word.

@tarruda
Copy link

tarruda commented Aug 11, 2025

@pwilkin

What I meant was that on the OpenAI API (and its extensions), there's no nesting between the message parts, so it must all be parsed and exposed as a flat stream of message/parts by the server.

In your example above, the server could parse the tool call within the thinking tags, and it would be exposed as the following pseudo stream:

  • {"type": "think", "data": "Let's call a search tool:"}
  • {"type": "tool-call", "data": {"name": "search", "arguments": { "query": "foo" }}
  • {"type": "text", "data": "I have successfully called a search tool and am expecting the results."}

AFAICT there's no ambiguity on the client side. Client will take all the tool call parts in a message, invoke the actual implementations, and put the tool response parts in the next request.

@pwilkin
Copy link
Contributor

pwilkin commented Aug 11, 2025

@pwilkin

What I meant was that on the OpenAI API (and its extensions), there's no nesting between the message parts, so it must all be parsed and exposed as a flat stream of message/parts by the server.

In your example above, the server could parse the tool call within the thinking tags, and it would be exposed as the following pseudo stream:

  • {"type": "think", "data": "Let's call a search tool:"}
  • {"type": "tool-call", "data": {"name": "search", "arguments": { "query": "foo" }}
  • {"type": "text", "data": "I have successfully called a search tool and am expecting the results."}

AFAICT there's no ambiguity on the client side. Client will take all the tool call parts in a message, invoke the actual implementations, and put the tool response parts in the next request.

The server could, but that's not what it did or what it was expected to do by the clients. See the following snippet from chat.cpp:

static void common_chat_parse_deepseek_r1(common_chat_msg_parser & builder) {
    builder.try_parse_reasoning("<think>", "</think>");
    if (!builder.syntax().parse_tool_calls) {
        builder.add_content(builder.consume_rest());
        return;
    }

    static const common_regex tool_calls_begin("(?:<|tool▁calls▁begin|>|<|tool_calls_begin|>|<|tool calls begin|>|<|tool\\\\_calls\\\\_begin|>|<|tool▁calls|>)");
    static const common_regex tool_calls_end("<|tool▁calls▁end|>");
    static const common_regex function_regex("(?:<|tool▁call▁begin|>)?function<|tool▁sep|>([^\n]+)\n```json\n");
    static const common_regex close_regex("```[\\s\\r\\n]*<|tool▁call▁end|>");

    parse_json_tool_calls(
        builder,
        /* block_open= */ tool_calls_begin,
        /* function_regex_start_only= */ std::nullopt,
        function_regex,
        close_regex,
        tool_calls_end);
}

The reasoning block is consumed and only then tool call parsing is performed. This was the expected behavior pre-OSS for thinking models. OSS introduces a completely new structure that breaks this implicit contract that tool calls within thinking blocks get ignored.

@createthis
Copy link

Whoa! I just compiled this and ran it in conjunction with my Open Hands AI "Reasoning model" switch patch. The conversation wasn't very long, but it seems to be working! Here's the transcript if anyone is interested: https://github.com/createthis/open_hands_gpt_oss?tab=readme-ov-file#harmony-patch

This model is insanely fast on my system:
Screenshot 2025-08-11 at 1 12 36 PM

140 tok/s at 14k context. Wild. I look forward to seeing how competent it is now that we have a working agentic coding path forward.

@mallorbc
Copy link

I have been vibing a bit with Claude Code + this PR + gpt-oss-120b and tool calling seems to work as expected. Seems like a good job - let's wait for @ngxson to say the final word.

I have been doing the same thing. I am using a lite llm proxy to do so(perhaps you are as well). Do you know how the system prompt works in conjuction with Claude Code? Claude sets its own system prompt, does this overwrite the reasoning: high setting? Do they co-exist?

@ggerganov
Copy link
Member

ggerganov commented Aug 11, 2025

I am using CC + CC Router as explained in #14758.

Claude sets its own system prompt, does this overwrite the reasoning: high setting?

My guess would be that CC keeps default reasoning settings. Btw, you can change this with llama-server ... --chat-template-kwargs '{"reasoning_effort": "high"}'

For the moment, I am just vibing and not looking at the actual requests in details. But I think there is a lot that can be optimize in llama-server for this use case. As soon as we get tool call working correctly, I'll start looking at improving the agentic performance even further.

@createthis
Copy link

Btw, you can change this with llama-server ... --chat-template-kwargs '{"reasoning_effort": "high"}'

FYI, this doesn't work for me. See #15130

@tarruda
Copy link

tarruda commented Aug 11, 2025

My guess would be that CC keeps default reasoning settings. Btw, you can change this with llama-server ... --chat-template-kwargs '{"reasoning_effort": "high"}'

Thanks for that trick!

For the moment, I am just vibing and not looking at the actual requests in details. But I think there is a lot that can be optimize in llama-server for this use case. As soon as we get tool call working correctly, I'll start looking at improving the agentic performance even further.

Claude code doesn't seem very efficient in its system prompt handling which might be fine for claude but not very good for local models. Using mitmproxy, I extracted parts of the conversation into a nicely readable markdown format: https://gist.github.com/tarruda/edda27617da8e219e70eb4b9b9503a5e

One of the issues is that they do inject a bunch of information about the date, working directory and repository/branches in the system prompt, which can mess a bit with kv-caching.

Another issue is the amount of tools available by default, which consume a lot of context and are rendered after the system prompt (which has variable parts such as branches), not to mention there's a lot of tools which are rarely used (such as notebook manipulation tools). I think they could make it much more efficient by splitting tasks into multiple sub-agents that only have access to a small subset of tools.

@mallorbc
Copy link

I am using CC + CC Router as explained in #14758.

Claude sets its own system prompt, does this overwrite the reasoning: high setting?

My guess would be that CC keeps default reasoning settings. Btw, you can change this with llama-server ... --chat-template-kwargs '{"reasoning_effort": "high"}'

For the moment, I am just vibing and not looking at the actual requests in details. But I think there is a lot that can be optimize in llama-server for this use case. As soon as we get tool call working correctly, I'll start looking at improving the agentic performance even further.

I found the answer to my own question and figured I would share:

I ran llama-server with verbose to allow me to see the prompt. I also modified the chat template to hardcode reasoning to high.

Here is some of it:
"prompt":"<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.\nKnowledge cutoff: 2024-06\nCurrent date: 2025-08-11\n\nReasoning: high\n\n# Valid channels: analysis, commentary, final. Channel must be included for every message.\nCalls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>developer<|message|># Instructions\n\nYou are Claude Code, Anthropic's official CLI for Claude.\n\nYou are an in

So it works as expected. To me it seems that a good way of putting it is the system prompt is now the "super system prompt", and the developer prompt is the old system prompt. Both of which of correctly injecting what they should in the right place with my setup using llama-sever with lite llm proxy.

@createthis
Copy link

createthis commented Aug 11, 2025

I tried this branch on a larger agentic task with Open Hands and enabled --verbose in llama.cpp. Here's an excerpt from the transcript:

        "choices": [
            {
                "finish_reason": "stop",
                "index": 0,
                "message": {
                    "role": "assistant",
                    "reasoning_content": "We need to add unit tests for the Lambda at /workspace/scripts-utilities/LambdasESM/SFTPAlert-
redacted. Use existing test as template. We need to explore repository to see structure, code of lambda, existing tests, etc.\n\nF
irst, list directory.We need to explore the repository. Use execute_bash to list directories.",
                    "content": "We will run a bash command to list the workspace."
                }
            }
        ],
        "created": 1754936545,
        "model": "gpt-oss-120b-F16",
        "system_fingerprint": "b6130-6343a7f5",
        "object": "chat.completion",
        "usage": {
            "completion_tokens": 141,
            "prompt_tokens": 7839,
            "total_tokens": 7980
        },
        "id": "chatcmpl-1tA9GYCxIo1ofkzW6FS6IFA2jVjgOnQB",
        "__verbose": {
            "index": 0,
            "content": "<|channel|>analysis<|message|>We need to add unit tests for the Lambda at /workspace/scripts-utilities/LambdasESM/SFTPAlert-redacted. Use existing test as template. We need to explore repository to see structure, code of lambda, existing tests, etc.\n\nFirst, list directory.<|end|><|start|>assistant<|channel|>analysis<|message|>We need to explore the repository. Use execute_bash to list directories.<|end|><|start|>assistant<|channel|>commentary<|message|>We will run a bash command to list the workspace.<|end|><|start|>assistant<|channel|>commentary to=execute_bash <|constrain|>json<|message|>{\n  \"command\": \"ls -R /workspace/scripts-utilities/LambdasESM | head -n 200\"\n}",

It looks like it tries to run execute_bash (I haven't read the harmony spec yet, so I'm just taking a wild guess):

<|start|>assistant<|channel|>commentary to=execute_bash <|constrain|>json<|message|>{\n  \"command\": \"ls -R /workspace/scripts-utilities/LambdasESM | head -n 200\"\n}

However, it never generates the execute_bash command, so the conversation stops.

EDIT: Here's the error from the llama.cpp log:

First, list directory.<|end|><|start|>assistant<|channel|>analysis<|message|>We need to explore the repository. Use execute_bash to list directories.<|end|><|start|>assistant<|channel|>commentary<|message|>We will run a bash command to list the workspace.<|end|><|start|>assistant<|channel|>commentary to=execute_bash <|constrain|>json<|message|>{
  "command": "ls -R /workspace/scripts-utilities/LambdasESM | head -n 200"
}
Parse error: expected function name, got: execute_bash <|constrain|>json<|message|>{
  "command": "ls -R /workspace/scripts-utilities/LambdasESM | head -n 200"
}

Sorry, I can't publish this transcript in full because it has some proprietary data in it, but I'll do my best to provide any additional details from it that I can.

@aldehir
Copy link
Collaborator Author

aldehir commented Aug 11, 2025

@createthis thank you for that output, having the content from the verbose object is more than enough to help isolate the issue. It seems the grammar rules might need a bit more tweaking, it doesn't appear to have triggered on the last commentary message.

@ggerganov
Copy link
Member

I can't tell if this is a valid tool call (ignore my comment if it is valid).

But just in case, @createthis could you try with the model from https://huggingface.co/ggml-org/gpt-oss-120b-GGUF - I don't think your model has the latest template fixes. (also, there is no point in using F16 models with gpt-oss)

@createthis
Copy link

@ggerganov This last output was generated with https://huggingface.co/unsloth/gpt-oss-120b-GGUF/resolve/main/gpt-oss-120b-F16.gguf downloaded yesterday.

Startup command was:

./build/bin/llama-server \
    --model /data/gpt-oss-120b-GGUF/gpt-oss-120b-F16.gguf \
    --alias gpt-oss-120b-F16 \
    --no-webui \
    --numa numactl \
    --threads 32 \
    --ctx-size 131072 \
    --n-gpu-layers 37 \
    -ot "exps.*\.blk.*\.ffn_.*=CUDA0" \
    --no-op-offload \
    -ub 4096 -b 4096 \
    --seed 3407 \
    --temp 0.6 \
    --top-p 1.0 \
    --log-colors \
    --flash-attn \
    --host 0.0.0.0 \
    --jinja \
    --chat-template-kwargs '{"reasoning_effort": "high"}' \
    --port 11434 \
    --verbose

I've got the safetensors downloaded too. Can I just regenerate my bf16 from safetensors using this branch, or do I need your GGUF specifically?

@ggerganov
Copy link
Member

@createthis Can't say for sure - there are too many things to keep track of. To keep things simple, let's report results only with the models in https://huggingface.co/ggml-org/gpt-oss-120b-GGUF

@aldehir
Copy link
Collaborator Author

aldehir commented Aug 11, 2025

@createthis if it's not too much to ask for, can you provide the lines "Grammar awaiting trigger", starting from the last <|start|> token? They should be in the verbose logs. I'm curious how the message was tokenized.

@createthis
Copy link

Grammar awaiting trigger

This text does not appear in the transcript, sorry.

@createthis
Copy link

@createthis Can't say for sure - there are too many things to keep track of. To keep things simple, let's report results only with the models in https://huggingface.co/ggml-org/gpt-oss-120b-GGUF

Copy that. Downloading.

@aldehir
Copy link
Collaborator Author

aldehir commented Aug 11, 2025

Grammar awaiting trigger

This text does not appear in the transcript, sorry.

@createthis "Grammar still awaiting trigger", sorry missed a word there. If there are no such logs when -v, perhaps the issue is elsewhere.

@createthis
Copy link

Grammar

These are the only appearances of the word grammar in the transcript with --verbose:

(base) jesse@Jesses-MacBook-Pro open_hands_gpt_oss % mitmdump -nr full_traffic_no_commit.mitm --flow-detail 4 | grep -i grammar
                "grammar": "",
                "grammar_lazy": false,
                "grammar_triggers": [],
                "grammar": "",
                "grammar_lazy": false,
                "grammar_triggers": [],

@aldehir
Copy link
Collaborator Author

aldehir commented Aug 11, 2025

@createthis thank you. I'm assuming OpenHands does function calling via prompt and not natively? That would explain why there is no grammar. You may want to verify OpenHands is configured to use native function calling, it seems like it's supported from a quick search.

@createthis
Copy link

@createthis thank you. I'm assuming OpenHands does function calling via prompt and not natively? That would explain why there is no grammar.

I've never been able to get Open Hands <-> llama.cpp native function calling to work, but I can try it. Yes, this transcript was with native function calling turned off. Open Hands uses LiteLLM under the hood, for what it's worth.

@createthis
Copy link

@createthis Can't say for sure - there are too many things to keep track of. To keep things simple, let's report results only with the models in https://huggingface.co/ggml-org/gpt-oss-120b-GGUF

@ggerganov Seems like the same issue with these GGUFs.

@createthis
Copy link

@aldehir It seems to be working with native tool calling enabled. This is -e LLM_NATIVE_TOOL_CALLING=true in the Open Hands docker startup. I'll keep it running and let you know if it chokes.

@createthis
Copy link

@aldehir I can't parse harmony with my brain yet. I'm seeing this:

        "__verbose": {
            "index": 0,
            "content": "<|channel|>analysis<|message|>The user says: \"You have to give linesAfter if you give lineNumber and want more 
than one line\". This is a note about the read_file tool usage: they need to specify lines_before or lines_after to get multiple lines. 
The user is pointing out that the assistant's previous attempts to view file content used read_file with line_number only, which only re
turns a single line. The user is reminding that to get more lines, we need to specify lines_after or lines_before.\n\nNow we need to res
pond appropriately. The user hasn't asked a question; they gave a note. Possibly they want us to adjust our usage of read_file to includ
e lines_after. We should acknowledge and adjust.\n\nWe need to respond acknowledging the note and perhaps ask if they want us to view mo
re lines. Or we can proceed to view the file with more lines. The user hasn't asked a direct question, but they gave a tip. We can respo
nd: \"Thanks for the tip. I'll include lines_after when needed.\" Or we can ask what they want to see.\n\nGiven the instruction: \"If th
e user asks a question, like 'why is X happening', don't try to fix the problem. Just give an answer to the question.\" Here the user ga
ve a note, not a question. So we can respond acknowledging.\n\nThus we should respond: \"Got it, I'll include lines_after when using rea
d_file to view multiple lines.\" Possibly ask if they want to see any specific file.\n\nThus answer: \"Thanks for the reminder. I'll inc
lude lines_after when needed.\"\n\n<|end|><|start|>assistant<|channel|>final <|constrain|>json<|message|>{\n  \"message\": \"Got it \u20
13 I\u2019ll include `lines_after` (or `lines_before`) when using `read_file` to retrieve multiple lines. Let me know if you\u2019d like
 me to view any specific part of a file.\",\n  \"task_completed\": \"true\"\n}",

This seems to be resulting in:

        "choices": [
            {
                "finish_reason": "stop",
                "index": 0,
                "message": {
                    "role": "assistant",
                    "reasoning_content": "The user says: \"You have to give linesAfter if you give lineNumber and want more than one lin
e\". This is a note about the read_file tool usage: they need to specify lines_before or lines_after to get multiple lines. The user is 
pointing out that the assistant's previous attempts to view file content used read_file with line_number only, which only returns a sing
le line. The user is reminding that to get more lines, we need to specify lines_after or lines_before.\n\nNow we need to respond appropr
iately. The user hasn't asked a question; they gave a note. Possibly they want us to adjust our usage of read_file to include lines_afte
r. We should acknowledge and adjust.\n\nWe need to respond acknowledging the note and perhaps ask if they want us to view more lines. Or
 we can proceed to view the file with more lines. The user hasn't asked a direct question, but they gave a tip. We can respond: \"Thanks
 for the tip. I'll include lines_after when needed.\" Or we can ask what they want to see.\n\nGiven the instruction: \"If the user asks 
a question, like 'why is X happening', don't try to fix the problem. Just give an answer to the question.\" Here the user gave a note, n
ot a question. So we can respond acknowledging.\n\nThus we should respond: \"Got it, I'll include lines_after when using read_file to vi
ew multiple lines.\" Possibly ask if they want to see any specific file.\n\nThus answer: \"Thanks for the reminder. I'll include lines_a
fter when needed.\"\n\n",
                    "content": ""
                }
            }
        ],

Is that correct? Should content be empty? Open Hands hasn't merged the PR to add support for the reasoning_content property yet, so I only see content in the UI, and I never see any output from the model. I only see tool calls.

I don't know if it's relevant, but this part of the conversation is because it keeps failing to follow instructions regarding MCP tool calls. I'm manually instructing it to use linesAfter if it uses lineNumber. It's tending to just give lineNumber.

@aldehir
Copy link
Collaborator Author

aldehir commented Aug 11, 2025

@createthis as a temporary workaround, try using --reasoning-format none. This will put the reasoning in think tags and might yield better results. It looks like the model hallucinated a tool call, they should be completely in the commentary channel. The think tags is subject to change, but I'm curious if it yields better results on clients that don't send reasoning_content.

@createthis
Copy link

@aldehir --reasoning-format none does seem to help. Thanks!

I have specialized MCP server available that gives the model the ability to use patch with unified diffs. This is the model's idea of a unified diff:

*** Begin Patch
*** Update File: /workspace/scripts-utilities/LambdasESM/SFTPAlert-redacted/tests/SFTPAlert-redacted.test.mts
@@
-    const call = cloudWatchMock.commandCalls(PutMetricDataCommand)[0];
-    const input = call.args[0];
-    expect(input.Namespace).toBe('SFTPNewFileAlert');
+    const call = cloudWatchMock.commandCalls(PutMetricDataCommand)[0];
+    // The mocked client receives the command object as the first argument to `send`.
+    // Access the underlying request payload via the `.input` property.
+    const input = call.args[0].input;
+    expect(input.Namespace).toBe('SFTPNewFileAlert');
*** End Patch

Either harmony syntax has special handling for unified diffs, or this model hasn't been fine tuned to generate them, which isn't surprising since it's a small model. I think I just have to turn off my MCP server with this model.

@aldehir
Copy link
Collaborator Author

aldehir commented Aug 11, 2025

@createthis one thing to consider is that this model was most likely trained to use the tools in https://github.com/openai/codex. The diff looks awfully similar to their apply_patch tool.

I'm glad it helps.

@createthis
Copy link

@aldehir yup. That's certainly it: https://github.com/openai/codex/blob/5f8984aa7d550955eb5f894d5c29adc2b9901da2/codex-rs/apply-patch/apply_patch_tool_instructions.md

Cool, I may make an adapter on my end.

It successfully completed a moderately difficult agentic programming task. Here's the startup command I used:

./build/bin/llama-server \
    --model /data/gpt-oss-120b-GGUF/ggml-org/gpt-oss-120b-mxfp4-00001-of-00003.gguf \
    --alias gpt-oss-120b-mxfp4 \
    --no-webui \
    --numa numactl \
    --threads 32 \
    --ctx-size 131072 \
    --n-gpu-layers 37 \
    -ot "exps.*\.blk.*\.ffn_.*=CUDA0" \
    --no-op-offload \
    -ub 4096 -b 4096 \
    --seed 3407 \
    --temp 0.6 \
    --top-p 1.0 \
    --log-colors \
    --flash-attn \
    --host 0.0.0.0 \
    --jinja \
    --chat-template-kwargs '{"reasoning_effort": "high"}' \
    --reasoning-format none \
    --port 11434

Performance is excellent. I look forward to using this more in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples server testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.