-
Notifications
You must be signed in to change notification settings - Fork 12.7k
gpt-oss: implement harmony parsing #15181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Thanks. It finally made it much easier to use tools in Cherry Studio. And it generates thinking boxes properly. |
With the PR: It's better, easily more usable, but there might be some issues around tool calling still. |
@dagbs try setting function calling to |
d65e556
to
981886f
Compare
I tried this PR yesterday and compared it to #15158 (+ my own fixes on top of that PR) and there was a couple of issues with this PR (that I was gonna share this morning), but since da67163 was pushed, it seems to finally work better than that PR. In my (albeit limited) testing, seems tool calling and it's formatting is working a lot better. Thanks a ton for this patch @aldehir! All the unit tests pass as well, compared to the other PR, and code organization at a glance seems better too, but granted I'm not cpp expert, just an generalist. |
Hmm, seems to still be breaking sometimes, tried to understand why but to no avail. Most of the time, it works perfectly fine, but seems some edge-case breaks it. Running da67163 right now. If I repeatably use the same weather example maybe 10 times, I end up getting a badly parsed (on llama.cpp's side) maybe once. Good run looks like this: ChatCompletionResponse {
choices: [
Choice {
message: ResponseMessage {
content: Some(
"Here are the current conditions for the three cities, sorted by temperature (highest\u{202f}→\u{202f}lowest):\n\n- **Barcelona**: ☀\u{fe0f}\u{202f}+25\u{202f}°C \n- **Lima**: ⛅\u{fe0f}\u{202f}+16\u{202f}°C \n- **Stockholm**: ☀\u{fe0f}\u{202f}+13\u{202f}°C \n\n*(Temperatures are taken from the latest weather data at the time of the query.)*",
),
reasoning_content: Some(
"The user asks: \"What is the current weather in Barcelona, Stockholm, and Lima? And also, display them in a list sorted by their temperatures, highest first.\"\n\nWe have fetched weather for each location via the get_weather function. The function returns a JSON string with \"result\": \"Barcelona: ☀\u{fe0f} +25°C\\n\". Similarly for Stockholm: \"Stockholm: ☀\u{fe0f} +13°C\\n\". Lima: \"Lima: ⛅\u{fe0f} +16°C\\n\". We need to parse these results, extract the temperature values, sort them descending, and display them in a list.\n\nWe need to produce a final answer that includes the weather for each location sorted by temperature highest first. The user wants a list sorted by temperature, highest first. So we need to sort: Barcelona +25°C, Lima +16°C, Stockholm +13°C.\n\nThus the sorted list: Barcelona: ☀\u{fe0f} +25°C, Lima: ⛅\u{fe0f} +16°C, Stockholm: ☀\u{fe0f} +13°C.\n\nWe should present them as a list, maybe bullet points.\n\nWe need to ensure we include the weather icons and temperature values as given.\n\nThus answer: \n\n- Barcelona: ☀\u{fe0f} +25°C\n- Lima: ⛅\u{fe0f} +16°C\n- Stockholm: ☀\u{fe0f} +13°C\n\nWe could also include the original strings.\n\nThus final answer: a list sorted by temperature highest first.\n\nWe should also note that the data is from the function calls.\n\nThus answer: \"Here are the current weather conditions for the three cities, sorted by temperature (highest first): ...\"\n\nWe should also mention that the temperatures are approximate and may change.\n\nThus final answer.",
),
tool_calls: [],
},
},
],
}
Meanwhile, a bad runs ends up with: ChatCompletionResponse {
choices: [
Choice {
message: ResponseMessage {
content: Some(
" to=functions.get_weather\u{a0}\u{200b}\u{200b}\u{a0}\u{a0}\n\n\n\n",
),
reasoning_content: None,
tool_calls: [],
},
},
],
} Full logs from bad run:
Seems to happen more often when |
@victorb maybe use temperature= 0 and/or top-k 1? If inference is the issue, making it deterministic would fix it. |
Running with these inference parameters for example: {
temperature: 0.0,
top_p: 1.0,
min_p: 0.0,
top_k: 0,
samplers: [
"top_k",
"top_p",
"min_p",
"temperature",
],
} Seems to correctly give me deterministic responses, which once I get one good response, they always work well, but the ones that break, always break, so I guess useful for testing at the very least. Here's one example of broken parsing I'm currently getting, even with ChatCompletionResponse {
choices: [
Choice {
message: ResponseMessage {
content: Some(
" to=function\u{a0}\u{a0}...",
),
reasoning_content: None,
tool_calls: [],
},
},
],
}
Tried setting |
@victorb thank you for that extensive testing. I can't seem to reproduce this on
That will help me better understand the problem. It appears the model is emitting unicode space characters, but I wasn't aware the |
I managed to get Looks like I missed a scenario where the model outputs the recipient (
I have yet to see the I updated the parsing and grammar rule to handle this. It should at least parse the tool calls now. I found performance degrades by the third call. I get queries to "Lima??", "Lima?", or some variation with garbage at the end. However, if I pass Give cf9a0d6 a shot. |
For those interested, I implemented a basic cache for reasoning content in my fork aldehir#1. Without prior reasoning content for tool calls, |
Awesome @aldehir, did a bunch of testing yesterday with 20b and 120b and tool parsing didn't fail once! 🎉 I do see the same inference quality degradation after a few messages, mainly hallucinations for the tool arguments (calling get_weather("...") or get_weather("?") for example) with both 20b and 120b. However, trying out the Overall, seems solid to me now. Since cf9a0d6, the parsing of Harmony seems complete in all the examples I've tried to run, everything goes into the right place and tool calls/responses all look correct now. |
@aldehir using your
If If I wonder if this is related to the grammar generation for the tool calls which is somehow constraining it to always use the first tool. BTW this is the first model I've tried with llama-server that can mix reasoning with tool calls, so it is definitely in the right direction! |
@tarruda good catch. I forgot to group up the tool calls when I reworked the grammar to account for the recipient in the role. I've updated both this PR and the one in my fork. |
Thanks a lot, seems to be working perfectly now! |
I've also been playing with calling tools in its CoT and confirm it is working correctly. For example, if I provide this tool to the LLM: async def arithmetic(code: str) -> str:
"""
Evaluates arithmetic expression and returns the result.
ANY arithmetic questions (no matter how trivial) should make use of this tool in your chain of thought. Always return this tool's response even if it is wrong!
"""
return f"{eval("5 + 5")}" Then it will always use it during reasoning. There's something I'm wondering though: Looking at the template, I can see it tells the LLM about 2 possible builtin tools it can use in its CoT ( |
@tarruda what I mean is that before, for example for DeepSeek, tool calls between |
So, to give an example: this was a valid tool call: <think>I will now call a search tool.</think>
<tool_call>
{"name": "search", "arguments": { "query": "foo" }}
</tool_call> This was not: <think>Let's call a search tool:
<tool_call>
{"name": "search", "arguments": { "query": "foo" }}
</tool_call>
</think>
I have successfully called a search tool and am expecting the results. |
I have been vibing a bit with Claude Code + this PR + |
What I meant was that on the OpenAI API (and its extensions), there's no nesting between the message parts, so it must all be parsed and exposed as a flat stream of message/parts by the server. In your example above, the server could parse the tool call within the thinking tags, and it would be exposed as the following pseudo stream:
AFAICT there's no ambiguity on the client side. Client will take all the tool call parts in a message, invoke the actual implementations, and put the tool response parts in the next request. |
The server could, but that's not what it did or what it was expected to do by the clients. See the following snippet from static void common_chat_parse_deepseek_r1(common_chat_msg_parser & builder) {
builder.try_parse_reasoning("<think>", "</think>");
if (!builder.syntax().parse_tool_calls) {
builder.add_content(builder.consume_rest());
return;
}
static const common_regex tool_calls_begin("(?:<|tool▁calls▁begin|>|<|tool_calls_begin|>|<|tool calls begin|>|<|tool\\\\_calls\\\\_begin|>|<|tool▁calls|>)");
static const common_regex tool_calls_end("<|tool▁calls▁end|>");
static const common_regex function_regex("(?:<|tool▁call▁begin|>)?function<|tool▁sep|>([^\n]+)\n```json\n");
static const common_regex close_regex("```[\\s\\r\\n]*<|tool▁call▁end|>");
parse_json_tool_calls(
builder,
/* block_open= */ tool_calls_begin,
/* function_regex_start_only= */ std::nullopt,
function_regex,
close_regex,
tool_calls_end);
} The reasoning block is consumed and only then tool call parsing is performed. This was the expected behavior pre-OSS for thinking models. OSS introduces a completely new structure that breaks this implicit contract that tool calls within thinking blocks get ignored. |
Whoa! I just compiled this and ran it in conjunction with my Open Hands AI "Reasoning model" switch patch. The conversation wasn't very long, but it seems to be working! Here's the transcript if anyone is interested: https://github.com/createthis/open_hands_gpt_oss?tab=readme-ov-file#harmony-patch This model is insanely fast on my system: 140 tok/s at 14k context. Wild. I look forward to seeing how competent it is now that we have a working agentic coding path forward. |
I have been doing the same thing. I am using a lite llm proxy to do so(perhaps you are as well). Do you know how the system prompt works in conjuction with Claude Code? Claude sets its own system prompt, does this overwrite the reasoning: high setting? Do they co-exist? |
I am using CC + CC Router as explained in #14758.
My guess would be that CC keeps default reasoning settings. Btw, you can change this with For the moment, I am just vibing and not looking at the actual requests in details. But I think there is a lot that can be optimize in |
FYI, this doesn't work for me. See #15130 |
Thanks for that trick!
Claude code doesn't seem very efficient in its system prompt handling which might be fine for claude but not very good for local models. Using mitmproxy, I extracted parts of the conversation into a nicely readable markdown format: https://gist.github.com/tarruda/edda27617da8e219e70eb4b9b9503a5e One of the issues is that they do inject a bunch of information about the date, working directory and repository/branches in the system prompt, which can mess a bit with kv-caching. Another issue is the amount of tools available by default, which consume a lot of context and are rendered after the system prompt (which has variable parts such as branches), not to mention there's a lot of tools which are rarely used (such as notebook manipulation tools). I think they could make it much more efficient by splitting tasks into multiple sub-agents that only have access to a small subset of tools. |
I found the answer to my own question and figured I would share: I ran llama-server with verbose to allow me to see the prompt. I also modified the chat template to hardcode reasoning to high. Here is some of it: So it works as expected. To me it seems that a good way of putting it is the system prompt is now the "super system prompt", and the developer prompt is the old system prompt. Both of which of correctly injecting what they should in the right place with my setup using llama-sever with lite llm proxy. |
I tried this branch on a larger agentic task with Open Hands and enabled "choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"reasoning_content": "We need to add unit tests for the Lambda at /workspace/scripts-utilities/LambdasESM/SFTPAlert-
redacted. Use existing test as template. We need to explore repository to see structure, code of lambda, existing tests, etc.\n\nF
irst, list directory.We need to explore the repository. Use execute_bash to list directories.",
"content": "We will run a bash command to list the workspace."
}
}
],
"created": 1754936545,
"model": "gpt-oss-120b-F16",
"system_fingerprint": "b6130-6343a7f5",
"object": "chat.completion",
"usage": {
"completion_tokens": 141,
"prompt_tokens": 7839,
"total_tokens": 7980
},
"id": "chatcmpl-1tA9GYCxIo1ofkzW6FS6IFA2jVjgOnQB",
"__verbose": {
"index": 0,
"content": "<|channel|>analysis<|message|>We need to add unit tests for the Lambda at /workspace/scripts-utilities/LambdasESM/SFTPAlert-redacted. Use existing test as template. We need to explore repository to see structure, code of lambda, existing tests, etc.\n\nFirst, list directory.<|end|><|start|>assistant<|channel|>analysis<|message|>We need to explore the repository. Use execute_bash to list directories.<|end|><|start|>assistant<|channel|>commentary<|message|>We will run a bash command to list the workspace.<|end|><|start|>assistant<|channel|>commentary to=execute_bash <|constrain|>json<|message|>{\n \"command\": \"ls -R /workspace/scripts-utilities/LambdasESM | head -n 200\"\n}", It looks like it tries to run
However, it never generates the EDIT: Here's the error from the
Sorry, I can't publish this transcript in full because it has some proprietary data in it, but I'll do my best to provide any additional details from it that I can. |
@createthis thank you for that output, having the content from the verbose object is more than enough to help isolate the issue. It seems the grammar rules might need a bit more tweaking, it doesn't appear to have triggered on the last commentary message. |
I can't tell if this is a valid tool call (ignore my comment if it is valid). But just in case, @createthis could you try with the model from https://huggingface.co/ggml-org/gpt-oss-120b-GGUF - I don't think your model has the latest template fixes. (also, there is no point in using F16 models with |
@ggerganov This last output was generated with https://huggingface.co/unsloth/gpt-oss-120b-GGUF/resolve/main/gpt-oss-120b-F16.gguf downloaded yesterday. Startup command was: ./build/bin/llama-server \
--model /data/gpt-oss-120b-GGUF/gpt-oss-120b-F16.gguf \
--alias gpt-oss-120b-F16 \
--no-webui \
--numa numactl \
--threads 32 \
--ctx-size 131072 \
--n-gpu-layers 37 \
-ot "exps.*\.blk.*\.ffn_.*=CUDA0" \
--no-op-offload \
-ub 4096 -b 4096 \
--seed 3407 \
--temp 0.6 \
--top-p 1.0 \
--log-colors \
--flash-attn \
--host 0.0.0.0 \
--jinja \
--chat-template-kwargs '{"reasoning_effort": "high"}' \
--port 11434 \
--verbose I've got the safetensors downloaded too. Can I just regenerate my |
@createthis Can't say for sure - there are too many things to keep track of. To keep things simple, let's report results only with the models in https://huggingface.co/ggml-org/gpt-oss-120b-GGUF |
@createthis if it's not too much to ask for, can you provide the lines "Grammar awaiting trigger", starting from the last |
This text does not appear in the transcript, sorry. |
Copy that. Downloading. |
@createthis "Grammar still awaiting trigger", sorry missed a word there. If there are no such logs when |
These are the only appearances of the word (base) jesse@Jesses-MacBook-Pro open_hands_gpt_oss % mitmdump -nr full_traffic_no_commit.mitm --flow-detail 4 | grep -i grammar
"grammar": "",
"grammar_lazy": false,
"grammar_triggers": [],
"grammar": "",
"grammar_lazy": false,
"grammar_triggers": [], |
@createthis thank you. |
I've never been able to get Open Hands <-> llama.cpp native function calling to work, but I can try it. Yes, this transcript was with native function calling turned off. Open Hands uses LiteLLM under the hood, for what it's worth. |
@ggerganov Seems like the same issue with these GGUFs. |
@aldehir It seems to be working with native tool calling enabled. This is |
@aldehir I can't parse harmony with my brain yet. I'm seeing this: "__verbose": {
"index": 0,
"content": "<|channel|>analysis<|message|>The user says: \"You have to give linesAfter if you give lineNumber and want more
than one line\". This is a note about the read_file tool usage: they need to specify lines_before or lines_after to get multiple lines.
The user is pointing out that the assistant's previous attempts to view file content used read_file with line_number only, which only re
turns a single line. The user is reminding that to get more lines, we need to specify lines_after or lines_before.\n\nNow we need to res
pond appropriately. The user hasn't asked a question; they gave a note. Possibly they want us to adjust our usage of read_file to includ
e lines_after. We should acknowledge and adjust.\n\nWe need to respond acknowledging the note and perhaps ask if they want us to view mo
re lines. Or we can proceed to view the file with more lines. The user hasn't asked a direct question, but they gave a tip. We can respo
nd: \"Thanks for the tip. I'll include lines_after when needed.\" Or we can ask what they want to see.\n\nGiven the instruction: \"If th
e user asks a question, like 'why is X happening', don't try to fix the problem. Just give an answer to the question.\" Here the user ga
ve a note, not a question. So we can respond acknowledging.\n\nThus we should respond: \"Got it, I'll include lines_after when using rea
d_file to view multiple lines.\" Possibly ask if they want to see any specific file.\n\nThus answer: \"Thanks for the reminder. I'll inc
lude lines_after when needed.\"\n\n<|end|><|start|>assistant<|channel|>final <|constrain|>json<|message|>{\n \"message\": \"Got it \u20
13 I\u2019ll include `lines_after` (or `lines_before`) when using `read_file` to retrieve multiple lines. Let me know if you\u2019d like
me to view any specific part of a file.\",\n \"task_completed\": \"true\"\n}", This seems to be resulting in: "choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"reasoning_content": "The user says: \"You have to give linesAfter if you give lineNumber and want more than one lin
e\". This is a note about the read_file tool usage: they need to specify lines_before or lines_after to get multiple lines. The user is
pointing out that the assistant's previous attempts to view file content used read_file with line_number only, which only returns a sing
le line. The user is reminding that to get more lines, we need to specify lines_after or lines_before.\n\nNow we need to respond appropr
iately. The user hasn't asked a question; they gave a note. Possibly they want us to adjust our usage of read_file to include lines_afte
r. We should acknowledge and adjust.\n\nWe need to respond acknowledging the note and perhaps ask if they want us to view more lines. Or
we can proceed to view the file with more lines. The user hasn't asked a direct question, but they gave a tip. We can respond: \"Thanks
for the tip. I'll include lines_after when needed.\" Or we can ask what they want to see.\n\nGiven the instruction: \"If the user asks
a question, like 'why is X happening', don't try to fix the problem. Just give an answer to the question.\" Here the user gave a note, n
ot a question. So we can respond acknowledging.\n\nThus we should respond: \"Got it, I'll include lines_after when using read_file to vi
ew multiple lines.\" Possibly ask if they want to see any specific file.\n\nThus answer: \"Thanks for the reminder. I'll include lines_a
fter when needed.\"\n\n",
"content": ""
}
}
], Is that correct? Should I don't know if it's relevant, but this part of the conversation is because it keeps failing to follow instructions regarding MCP tool calls. I'm manually instructing it to use |
@createthis as a temporary workaround, try using |
@aldehir I have specialized MCP server available that gives the model the ability to use *** Begin Patch
*** Update File: /workspace/scripts-utilities/LambdasESM/SFTPAlert-redacted/tests/SFTPAlert-redacted.test.mts
@@
- const call = cloudWatchMock.commandCalls(PutMetricDataCommand)[0];
- const input = call.args[0];
- expect(input.Namespace).toBe('SFTPNewFileAlert');
+ const call = cloudWatchMock.commandCalls(PutMetricDataCommand)[0];
+ // The mocked client receives the command object as the first argument to `send`.
+ // Access the underlying request payload via the `.input` property.
+ const input = call.args[0].input;
+ expect(input.Namespace).toBe('SFTPNewFileAlert');
*** End Patch Either harmony syntax has special handling for unified diffs, or this model hasn't been fine tuned to generate them, which isn't surprising since it's a small model. I think I just have to turn off my MCP server with this model. |
@createthis one thing to consider is that this model was most likely trained to use the tools in https://github.com/openai/codex. The diff looks awfully similar to their I'm glad it helps. |
@aldehir yup. That's certainly it: https://github.com/openai/codex/blob/5f8984aa7d550955eb5f894d5c29adc2b9901da2/codex-rs/apply-patch/apply_patch_tool_instructions.md Cool, I may make an adapter on my end. It successfully completed a moderately difficult agentic programming task. Here's the startup command I used: ./build/bin/llama-server \
--model /data/gpt-oss-120b-GGUF/ggml-org/gpt-oss-120b-mxfp4-00001-of-00003.gguf \
--alias gpt-oss-120b-mxfp4 \
--no-webui \
--numa numactl \
--threads 32 \
--ctx-size 131072 \
--n-gpu-layers 37 \
-ot "exps.*\.blk.*\.ffn_.*=CUDA0" \
--no-op-offload \
-ub 4096 -b 4096 \
--seed 3407 \
--temp 0.6 \
--top-p 1.0 \
--log-colors \
--flash-attn \
--host 0.0.0.0 \
--jinja \
--chat-template-kwargs '{"reasoning_effort": "high"}' \
--reasoning-format none \
--port 11434 Performance is excellent. I look forward to using this more in the future. |
This is my attempt at implementing a harmony parser for gpt-oss.
Implementation
auto
andnone
are supported. Whennone
,<|channel|>analysis<|message|>{reasoning content}<|end|>
is added to the content.parse_tool_calls == false
, tool calls are added to the content verbatim--which aligns with other implementations.Remaining Work
reasoning_content
. However, none of the clients I tested send it. A simple workaround is to usereasoning_format = none
, or add the reasoning to the content in tool calls.