Skip to content

Conversation

@noname22
Copy link
Contributor

@noname22 noname22 commented Nov 21, 2025

claude-code

Summary

This PR adds Anthropic Messages API compatibility to llama-server. The implementation converts Anthropic's format to OpenAI-compatible internal format, reusing existing inference pipeline.

Motivation

  • Enables llama.cpp to serve as a local/self-hosted alternative to Anthropic's Claude API
  • Allows Claude Code and other Anthropic-compatible clients to work with llama-server

Features Implemented

Endpoints:

  • POST /v1/messages - Chat completions with streaming support
  • POST /v1/messages/count_tokens - Token counting for prompts

Functionality:

  • Streaming with proper Anthropic SSE event types (message_start, content_block_delta, etc.)
  • Tool use (function calling) with tool_use/tool_result content blocks
  • Vision support with image content blocks (base64 and URL)
  • System prompts and multi-turn conversations
  • Extended thinking parameter support

Testing

  • Tests in test_anthropic_api.py
  • Tests cover: basic messages, streaming, tools, vision, token counting, parameters, error handling, content block indices

@noname22
Copy link
Contributor Author

noname22 commented Nov 21, 2025

Apparently, when you do PRs from an organization, you can't allow maintainers to edit the source branch for some reason. If you want I can close this PR and re-create it from a personal repository, allowing you to edit my branch.

@Mushoz
Copy link

Mushoz commented Nov 21, 2025

Does this also support interleaved thinking? I know the official Anthropic API endpoint does, as Sonnet 4.5 uses interleaved thinking. But how would that work for llamacpp? Since some models do support interleaved thinking (Eg, kimi k2 thinking, gpt-oss, minimax-m2, etc), while others don't or at least aren't trained with it in mind (Eg, GLM-4.5/4.6 (air), Qwen thinking models, etc)

@noname22
Copy link
Contributor Author

No, it currently doesn't support interleaved thinking. I could perhaps try to implement it, though. From what I found after some searching is that it's mainly very large models that support it, like Kimi K2. I don't really have the hardware to run models of that size so testing it would be an issue. Do you know of any smaller (~30B) model that does interleaved thinking?

Also, does llama.cpp support interleaved thinking?

@Mushoz
Copy link

Mushoz commented Nov 21, 2025

No, it currently doesn't support interleaved thinking. I could perhaps try to implement it, though. From what I found after some searching is that it's mainly very large models that support it, like Kimi K2. I don't really have the hardware to run models of that size so testing it would be an issue. Do you know of any smaller (~30B) model that does interleaved thinking?

Also, does llama.cpp support interleaved thinking?

So one of the smaller models that support it is gpt-oss-20b, but I doubt it's a good candidate due to the harmony format & parsing. But maybe it's still useful? As for interleaved thinking, there is two ways how it's supported:

  1. llamacpp currently sends out the reasoning in the reasoning_content field. If the client sends the reasoning back in the same reasoning_content field, then with the proper chat template it can be embedded in the followup prompts. This requires client support (as it has to send back the reasoning) and support in the template. This is the way how it works in gpt-oss for example.
  2. Another way to support it, is to keep the <think>reasoning content</think> inside the normal response content, so it's automatically sent back by the client upon followup requests. Models with the proper chat template can then split on these and tags, extract the reasoning and add it to the prompt. Minimax-m2 chat template tries to extract it from reasoning_content if present, else it tries to parse the tags manually.

@noname22
Copy link
Contributor Author

Ah ok, I'll try with gpt-oss-20b tomorrow and see how it behaves. Thanks for the explanation.

@fernandaspets
Copy link

Ah ok, I'll try with gpt-oss-20b tomorrow and see how it behaves. Thanks for the explanation.

i think another alternative might be minimax m2 (the reap version also makes it even smaller) which is a lot smaller than kimi k2.

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds quite a lot of code, while to user demand maybe not very high (I haven't seen many users asking about this feature). Therefore, I'm quite hesitate to merge it, possibly pollute the code base with rarely-used features.

Also, just to point out, there are already many projects allowing to translate/proxy anthropic <--> openai format:

@ngxson
Copy link
Collaborator

ngxson commented Nov 22, 2025

@Mushoz IMO your comments can be a bit off-topic, as the current PR only introduce the "formatting" to make the API returns the anthropic format. The behavior stays the same.

The reasoning parsing behavior is controlled by common/chat.cpp, it is unrelated to server.

@noname22
Copy link
Contributor Author

@Mushoz @fernandaspets

Regarding interleaved thinking I tested with quite a few models now and they seem to work fine, at least in Claude Code.

  • gpt-oss-20b
  • gpt-oss-120b
  • MiniMax-M2-UD-IQ1_M
  • Qwen3-Coder-30B-A3B-Instruct-Q4_K_M

I did find a bug in regards to streaming responses and tool calling for which there's a fix now.

@noname22
Copy link
Contributor Author

noname22 commented Nov 22, 2025

@ngxson

Yes, there are proxies. I kind of have the opposite take-away from that than you though: there are a lot of popular proxies because demand is high. I was kind of hoping to negate the need for that, it's certainly easier to just start llama-server + Claude Code.

It also seems like the industry is moving towards supporting Anthropic's API with eg. Moonshoot, MiniMax and DeepSeek providing first party support.

An added benefit is that an implementation inside llama-server allows to properly implement some endpoints of the Anthropic's API such as count_tokens, top_k, thinking parameter, etc.

I'll update the code with the suggestions you provided.

@pwilkin
Copy link
Collaborator

pwilkin commented Nov 23, 2025

This PR adds quite a lot of code, while to user demand maybe not very high (I haven't seen many users asking about this feature). Therefore, I'm quite hesitate to merge it, possibly pollute the code base with rarely-used features.

FWIW I'd support merging this as long as it's properly feature-separated, moved to separate files etc. I think there's not that much pressure currently because there are workaround, but it would really add to the llama.cpp marketing capabilities if it could be used "out-of-the-box" with Claude Code (and other Anthropic-based apps) without the need for a proxy.

@noname22
Copy link
Contributor Author

This PR adds quite a lot of code, while to user demand maybe not very high (I haven't seen many users asking about this feature). Therefore, I'm quite hesitate to merge it, possibly pollute the code base with rarely-used features.

FWIW I'd support merging this as long as it's properly feature-separated, moved to separate files etc. I think there's not that much pressure currently because there are workaround, but it would really add to the llama.cpp marketing capabilities if it could be used "out-of-the-box" with Claude Code (and other Anthropic-based apps) without the need for a proxy.

I agree that it would be cleaner and better code separation to have it in separate files, but since CONTRIBUTING.md has the following line, I didn't do that.

Avoid adding third-party dependencies, extra files, extra headers, etc.

I can move it to separate files if that is the consensus.

@calvin2021y
Copy link

hi @noname22

Thanks for the great work.

When use with --api-key, it will get Unauthorized error because Anthropic api pass key like this:

curl -X POST \
  https://api.anthropic.com/v1/messages \
  -H "x-api-key: $YOUR_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-3-opus-20240229",
    "max_tokens": 1024,
    "messages": [
      {"role": "user", "content": "Hello, Claude"}
    ]
  }'

llama-server expect them as: Authorization: Bearer $YOUR_API_KEY

@noname22
Copy link
Contributor Author

@ calvin2021y

Oh ok! I didn't even think of API keys to be honest. Nice catch.

What's the best approach here? Any way to "convince" llama-server to also accept API keys as an x-api-key header?

@hksdpc255
Copy link
Contributor

@noname22 Maybe here?

auto middleware_validate_api_key = [api_keys = params.api_keys](const httplib::Request & req, httplib::Response & res) {
static const std::unordered_set<std::string> public_endpoints = {
"/health",
"/v1/health",
"/models",
"/v1/models",
"/api/tags"
};
// If API key is not set, skip validation
if (api_keys.empty()) {
return true;
}
// If path is public or is static file, skip validation
if (public_endpoints.find(req.path) != public_endpoints.end() || req.path == "/") {
return true;
}
// Check for API key in the header
auto auth_header = req.get_header_value("Authorization");
std::string prefix = "Bearer ";
if (auth_header.substr(0, prefix.size()) == prefix) {
std::string received_api_key = auth_header.substr(prefix.size());
if (std::find(api_keys.begin(), api_keys.end(), received_api_key) != api_keys.end()) {
return true; // API key is valid
}
}
// API key is invalid or not provided
res.status = 401;
res.set_content(
safe_json_to_str(json {
{"error", {
{"message", "Invalid API Key"},
{"type", "authentication_error"},
{"code", 401}
}}
}),
"application/json; charset=utf-8"
);
LOG_WRN("Unauthorized: Invalid API Key\n");
return false;
};

@noname22
Copy link
Contributor Author

I added support for x-api-key headers and verified that it does work with claude code like this:

bin/llama-server --api-key mykey [...]
ANTHROPIC_API_KEY=mykey ANTHROPIC_BASE_URL=http://localhost:8080 claude

@noname22
Copy link
Contributor Author

The conflicts with master were quite involved. Seems like there was quite a large refactoring effort there with server. It will take a bit to fix.

@pwilkin
Copy link
Collaborator

pwilkin commented Nov 25, 2025

@noname22 @ngxson was refactoring the code to split it into parts. I think at this point your best bet would be to take the final code changes from your PR and overlay it onto a clean master build, it's really hard to merge "artificial" merge conflicts that arise from refactoring and moving code between files.

@noname22 noname22 force-pushed the feature/anthropic-api-support branch from 93868f9 to aa6192d Compare November 25, 2025 16:05
@noname22
Copy link
Contributor Author

@pwilkin yep, that's what I ended up doing 👍

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some part of the code (like tools and messages handling) still internally converts the schema from anthropic format to openai format, which defeats the whole point of separating them into 2 functions that I asked in the previous review.

And worse, this ends up duplicating a lot of code between anthropic_params_from_json and params_from_json_cmpl, which makes the code much more difficult to maintain.

So I'm now thinking it's better to leave the older version, converting anthropic to openai, and try to improve it in the future. Just remember to name the function to be more intuitively, like convert_anthropic_to_oai(...)

@noname22
Copy link
Contributor Author

I believe everything has been addressed.

@ngxson
Copy link
Collaborator

ngxson commented Nov 27, 2025

@noname22 I cannot push some commits to clean things up, can you open a new PR from a personal account?

@noname22
Copy link
Contributor Author

noname22 commented Nov 28, 2025

Here's the new PR: #17570

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants