-
Notifications
You must be signed in to change notification settings - Fork 13.9k
server : add Anthropic Messages API support #17425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Apparently, when you do PRs from an organization, you can't allow maintainers to edit the source branch for some reason. If you want I can close this PR and re-create it from a personal repository, allowing you to edit my branch. |
|
Does this also support interleaved thinking? I know the official Anthropic API endpoint does, as Sonnet 4.5 uses interleaved thinking. But how would that work for llamacpp? Since some models do support interleaved thinking (Eg, kimi k2 thinking, gpt-oss, minimax-m2, etc), while others don't or at least aren't trained with it in mind (Eg, GLM-4.5/4.6 (air), Qwen thinking models, etc) |
|
No, it currently doesn't support interleaved thinking. I could perhaps try to implement it, though. From what I found after some searching is that it's mainly very large models that support it, like Kimi K2. I don't really have the hardware to run models of that size so testing it would be an issue. Do you know of any smaller (~30B) model that does interleaved thinking? Also, does llama.cpp support interleaved thinking? |
So one of the smaller models that support it is gpt-oss-20b, but I doubt it's a good candidate due to the harmony format & parsing. But maybe it's still useful? As for interleaved thinking, there is two ways how it's supported:
|
|
Ah ok, I'll try with gpt-oss-20b tomorrow and see how it behaves. Thanks for the explanation. |
i think another alternative might be minimax m2 (the reap version also makes it even smaller) which is a lot smaller than kimi k2. |
ngxson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR adds quite a lot of code, while to user demand maybe not very high (I haven't seen many users asking about this feature). Therefore, I'm quite hesitate to merge it, possibly pollute the code base with rarely-used features.
Also, just to point out, there are already many projects allowing to translate/proxy anthropic <--> openai format:
|
@Mushoz IMO your comments can be a bit off-topic, as the current PR only introduce the "formatting" to make the API returns the anthropic format. The behavior stays the same. The reasoning parsing behavior is controlled by |
|
Regarding interleaved thinking I tested with quite a few models now and they seem to work fine, at least in Claude Code.
I did find a bug in regards to streaming responses and tool calling for which there's a fix now. |
|
Yes, there are proxies. I kind of have the opposite take-away from that than you though: there are a lot of popular proxies because demand is high. I was kind of hoping to negate the need for that, it's certainly easier to just start llama-server + Claude Code. It also seems like the industry is moving towards supporting Anthropic's API with eg. Moonshoot, MiniMax and DeepSeek providing first party support. An added benefit is that an implementation inside llama-server allows to properly implement some endpoints of the Anthropic's API such as count_tokens, top_k, thinking parameter, etc. I'll update the code with the suggestions you provided. |
FWIW I'd support merging this as long as it's properly feature-separated, moved to separate files etc. I think there's not that much pressure currently because there are workaround, but it would really add to the llama.cpp marketing capabilities if it could be used "out-of-the-box" with Claude Code (and other Anthropic-based apps) without the need for a proxy. |
I agree that it would be cleaner and better code separation to have it in separate files, but since
I can move it to separate files if that is the consensus. |
|
hi @noname22 Thanks for the great work. When use with curl -X POST \
https://api.anthropic.com/v1/messages \
-H "x-api-key: $YOUR_API_KEY" \
-H "anthropic-version: 2023-06-01" \
-H "Content-Type: application/json" \
-d '{
"model": "claude-3-opus-20240229",
"max_tokens": 1024,
"messages": [
{"role": "user", "content": "Hello, Claude"}
]
}'llama-server expect them as: |
|
@ calvin2021y Oh ok! I didn't even think of API keys to be honest. Nice catch. What's the best approach here? Any way to "convince" llama-server to also accept API keys as an x-api-key header? |
|
@noname22 Maybe here? llama.cpp/tools/server/server-http.cpp Lines 120 to 166 in d414db0
|
|
I added support for x-api-key headers and verified that it does work with claude code like this: |
|
The conflicts with master were quite involved. Seems like there was quite a large refactoring effort there with server. It will take a bit to fix. |
|
@noname22 @ngxson was refactoring the code to split it into parts. I think at this point your best bet would be to take the final code changes from your PR and overlay it onto a clean master build, it's really hard to merge "artificial" merge conflicts that arise from refactoring and moving code between files. |
93868f9 to
aa6192d
Compare
|
@pwilkin yep, that's what I ended up doing 👍 |
ngxson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some part of the code (like tools and messages handling) still internally converts the schema from anthropic format to openai format, which defeats the whole point of separating them into 2 functions that I asked in the previous review.
And worse, this ends up duplicating a lot of code between anthropic_params_from_json and params_from_json_cmpl, which makes the code much more difficult to maintain.
So I'm now thinking it's better to leave the older version, converting anthropic to openai, and try to improve it in the future. Just remember to name the function to be more intuitively, like convert_anthropic_to_oai(...)
…se64_with_multimodal_model in test_anthropic_api.py
…response handler for /v1/chat/completions and use unordered_set instead of set in to_json_anthropic_stream()
|
I believe everything has been addressed. |
|
@noname22 I cannot push some commits to clean things up, can you open a new PR from a personal account? |
|
Here's the new PR: #17570 |
Summary
This PR adds Anthropic Messages API compatibility to llama-server. The implementation converts Anthropic's format to OpenAI-compatible internal format, reusing existing inference pipeline.
Motivation
Features Implemented
Endpoints:
Functionality:
Testing