server : add Anthropic Messages API support #17425

noname22 · 2025-11-21T12:14:42Z

Summary

This PR adds Anthropic Messages API compatibility to llama-server. The implementation converts Anthropic's format to OpenAI-compatible internal format, reusing existing inference pipeline.

Motivation

Enables llama.cpp to serve as a local/self-hosted alternative to Anthropic's Claude API
Allows Claude Code and other Anthropic-compatible clients to work with llama-server

Features Implemented

Endpoints:

POST /v1/messages - Chat completions with streaming support
POST /v1/messages/count_tokens - Token counting for prompts

Functionality:

Streaming with proper Anthropic SSE event types (message_start, content_block_delta, etc.)
Tool use (function calling) with tool_use/tool_result content blocks
Vision support with image content blocks (base64 and URL)
System prompts and multi-turn conversations
Extended thinking parameter support

Testing

Tests in test_anthropic_api.py
Tests cover: basic messages, streaming, tools, vision, token counting, parameters, error handling, content block indices

noname22 · 2025-11-21T12:20:16Z

Apparently, when you do PRs from an organization, you can't allow maintainers to edit the source branch for some reason. If you want I can close this PR and re-create it from a personal repository, allowing you to edit my branch.

Mushoz · 2025-11-21T14:24:33Z

Does this also support interleaved thinking? I know the official Anthropic API endpoint does, as Sonnet 4.5 uses interleaved thinking. But how would that work for llamacpp? Since some models do support interleaved thinking (Eg, kimi k2 thinking, gpt-oss, minimax-m2, etc), while others don't or at least aren't trained with it in mind (Eg, GLM-4.5/4.6 (air), Qwen thinking models, etc)

noname22 · 2025-11-21T15:58:57Z

No, it currently doesn't support interleaved thinking. I could perhaps try to implement it, though. From what I found after some searching is that it's mainly very large models that support it, like Kimi K2. I don't really have the hardware to run models of that size so testing it would be an issue. Do you know of any smaller (~30B) model that does interleaved thinking?

Also, does llama.cpp support interleaved thinking?

Mushoz · 2025-11-21T20:12:14Z

No, it currently doesn't support interleaved thinking. I could perhaps try to implement it, though. From what I found after some searching is that it's mainly very large models that support it, like Kimi K2. I don't really have the hardware to run models of that size so testing it would be an issue. Do you know of any smaller (~30B) model that does interleaved thinking?

Also, does llama.cpp support interleaved thinking?

So one of the smaller models that support it is gpt-oss-20b, but I doubt it's a good candidate due to the harmony format & parsing. But maybe it's still useful? As for interleaved thinking, there is two ways how it's supported:

llamacpp currently sends out the reasoning in the reasoning_content field. If the client sends the reasoning back in the same reasoning_content field, then with the proper chat template it can be embedded in the followup prompts. This requires client support (as it has to send back the reasoning) and support in the template. This is the way how it works in gpt-oss for example.
Another way to support it, is to keep the <think>reasoning content</think> inside the normal response content, so it's automatically sent back by the client upon followup requests. Models with the proper chat template can then split on these and tags, extract the reasoning and add it to the prompt. Minimax-m2 chat template tries to extract it from reasoning_content if present, else it tries to parse the tags manually.

noname22 · 2025-11-21T21:52:25Z

Ah ok, I'll try with gpt-oss-20b tomorrow and see how it behaves. Thanks for the explanation.

fernandaspets · 2025-11-22T06:06:15Z

Ah ok, I'll try with gpt-oss-20b tomorrow and see how it behaves. Thanks for the explanation.

i think another alternative might be minimax m2 (the reap version also makes it even smaller) which is a lot smaller than kimi k2.

ngxson

This PR adds quite a lot of code, while to user demand maybe not very high (I haven't seen many users asking about this feature). Therefore, I'm quite hesitate to merge it, possibly pollute the code base with rarely-used features.

Also, just to point out, there are already many projects allowing to translate/proxy anthropic <--> openai format:

tools/server/server.cpp

tools/server/utils.hpp

ngxson · 2025-11-22T17:28:11Z

@Mushoz IMO your comments can be a bit off-topic, as the current PR only introduce the "formatting" to make the API returns the anthropic format. The behavior stays the same.

The reasoning parsing behavior is controlled by common/chat.cpp, it is unrelated to server.

noname22 · 2025-11-22T18:58:47Z

@Mushoz @fernandaspets

Regarding interleaved thinking I tested with quite a few models now and they seem to work fine, at least in Claude Code.

gpt-oss-20b
gpt-oss-120b
MiniMax-M2-UD-IQ1_M
Qwen3-Coder-30B-A3B-Instruct-Q4_K_M

I did find a bug in regards to streaming responses and tool calling for which there's a fix now.

noname22 · 2025-11-22T19:13:28Z

@ngxson

Yes, there are proxies. I kind of have the opposite take-away from that than you though: there are a lot of popular proxies because demand is high. I was kind of hoping to negate the need for that, it's certainly easier to just start llama-server + Claude Code.

It also seems like the industry is moving towards supporting Anthropic's API with eg. Moonshoot, MiniMax and DeepSeek providing first party support.

An added benefit is that an implementation inside llama-server allows to properly implement some endpoints of the Anthropic's API such as count_tokens, top_k, thinking parameter, etc.

I'll update the code with the suggestions you provided.

pwilkin · 2025-11-23T12:33:38Z

This PR adds quite a lot of code, while to user demand maybe not very high (I haven't seen many users asking about this feature). Therefore, I'm quite hesitate to merge it, possibly pollute the code base with rarely-used features.

FWIW I'd support merging this as long as it's properly feature-separated, moved to separate files etc. I think there's not that much pressure currently because there are workaround, but it would really add to the llama.cpp marketing capabilities if it could be used "out-of-the-box" with Claude Code (and other Anthropic-based apps) without the need for a proxy.

noname22 · 2025-11-23T12:40:23Z

This PR adds quite a lot of code, while to user demand maybe not very high (I haven't seen many users asking about this feature). Therefore, I'm quite hesitate to merge it, possibly pollute the code base with rarely-used features.

FWIW I'd support merging this as long as it's properly feature-separated, moved to separate files etc. I think there's not that much pressure currently because there are workaround, but it would really add to the llama.cpp marketing capabilities if it could be used "out-of-the-box" with Claude Code (and other Anthropic-based apps) without the need for a proxy.

I agree that it would be cleaner and better code separation to have it in separate files, but since CONTRIBUTING.md has the following line, I didn't do that.

Avoid adding third-party dependencies, extra files, extra headers, etc.

I can move it to separate files if that is the consensus.

calvin2021y · 2025-11-23T18:18:22Z

hi @noname22

Thanks for the great work.

When use with --api-key, it will get Unauthorized error because Anthropic api pass key like this:

curl -X POST \
  https://api.anthropic.com/v1/messages \
  -H "x-api-key: $YOUR_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-3-opus-20240229",
    "max_tokens": 1024,
    "messages": [
      {"role": "user", "content": "Hello, Claude"}
    ]
  }'

llama-server expect them as: Authorization: Bearer $YOUR_API_KEY

noname22 · 2025-11-23T18:49:15Z

@ calvin2021y

Oh ok! I didn't even think of API keys to be honest. Nice catch.

What's the best approach here? Any way to "convince" llama-server to also accept API keys as an x-api-key header?

hksdpc255 · 2025-11-25T06:36:52Z

@noname22 Maybe here?

llama.cpp/tools/server/server-http.cpp

Lines 120 to 166 in d414db0

    
           auto middleware_validate_api_key = [api_keys = params.api_keys](const httplib::Request & req, httplib::Response & res) { 
        
               static const std::unordered_set<std::string> public_endpoints = { 
        
                   "/health", 
        
                   "/v1/health", 
        
                   "/models", 
        
                   "/v1/models", 
        
                   "/api/tags" 
        
               }; 
        
               // If API key is not set, skip validation 
        
               if (api_keys.empty()) { 
        
                   return true; 
        
               } 
        
               // If path is public or is static file, skip validation 
        
               if (public_endpoints.find(req.path) != public_endpoints.end() || req.path == "/") { 
        
                   return true; 
        
               } 
        
               // Check for API key in the header 
        
               auto auth_header = req.get_header_value("Authorization"); 
        
               std::string prefix = "Bearer "; 
        
               if (auth_header.substr(0, prefix.size()) == prefix) { 
        
                   std::string received_api_key = auth_header.substr(prefix.size()); 
        
                   if (std::find(api_keys.begin(), api_keys.end(), received_api_key) != api_keys.end()) { 
        
                       return true; // API key is valid 
        
                   } 
        
               } 
        
               // API key is invalid or not provided 
        
               res.status = 401; 
        
               res.set_content( 
        
                   safe_json_to_str(json { 
        
                       {"error", { 
        
                           {"message", "Invalid API Key"}, 
        
                           {"type", "authentication_error"}, 
        
                           {"code", 401} 
        
                       }} 
        
                   }), 
        
                   "application/json; charset=utf-8" 
        
               ); 
        
               LOG_WRN("Unauthorized: Invalid API Key\n"); 
        
               return false; 
        
           };

noname22 · 2025-11-25T11:57:58Z

I added support for x-api-key headers and verified that it does work with claude code like this:

bin/llama-server --api-key mykey [...]
ANTHROPIC_API_KEY=mykey ANTHROPIC_BASE_URL=http://localhost:8080 claude

noname22 · 2025-11-25T15:49:24Z

The conflicts with master were quite involved. Seems like there was quite a large refactoring effort there with server. It will take a bit to fix.

pwilkin · 2025-11-25T15:53:50Z

@noname22 @ngxson was refactoring the code to split it into parts. I think at this point your best bet would be to take the final code changes from your PR and overlay it onto a clean master build, it's really hard to merge "artificial" merge conflicts that arise from refactoring and moving code between files.

noname22 · 2025-11-25T16:06:26Z

@pwilkin yep, that's what I ended up doing 👍

tools/server/tests/unit/test_anthropic_api.py

ngxson

Some part of the code (like tools and messages handling) still internally converts the schema from anthropic format to openai format, which defeats the whole point of separating them into 2 functions that I asked in the previous review.

And worse, this ends up duplicating a lot of code between anthropic_params_from_json and params_from_json_cmpl, which makes the code much more difficult to maintain.

So I'm now thinking it's better to leave the older version, converting anthropic to openai, and try to improve it in the future. Just remember to name the function to be more intuitively, like convert_anthropic_to_oai(...)

tools/server/tests/unit/test_anthropic_api.py

tools/server/server-common.cpp

tools/server/server-task.cpp

tools/server/server.cpp

…se64_with_multimodal_model in test_anthropic_api.py

…response handler for /v1/chat/completions and use unordered_set instead of set in to_json_anthropic_stream()

noname22 · 2025-11-27T10:09:49Z

I believe everything has been addressed.

ngxson · 2025-11-27T22:20:24Z

@noname22 I cannot push some commits to clean things up, can you open a new PR from a personal account?

noname22 · 2025-11-28T09:09:44Z

Here's the new PR: #17570

noname22 requested review from ggerganov and ngxson as code owners November 21, 2025 12:14

github-actions bot added examples python python script changes server labels Nov 21, 2025

loci-dev mentioned this pull request Nov 21, 2025

UPSTREAM PR #17425: server : add Anthropic Messages API support auroralabs-loci/llama.cpp#279

Open

ngxson reviewed Nov 22, 2025

View reviewed changes

tools/server/server.cpp Outdated Show resolved Hide resolved

tools/server/server.cpp Outdated Show resolved Hide resolved

tools/server/utils.hpp Outdated Show resolved Hide resolved

hksdpc255 mentioned this pull request Nov 25, 2025

Feature Request: Port upstream Anthropic Messages API support ikawrakow/ik_llama.cpp#1010

Closed

4 tasks

server : add Anthropic Messages API support

aa6192d

noname22 force-pushed the feature/anthropic-api-support branch from 93868f9 to aa6192d Compare November 25, 2025 16:05

hksdpc255 mentioned this pull request Nov 26, 2025

server : add Anthropic Messages API support ikawrakow/ik_llama.cpp#1012

Merged

4 tasks

ngxson reviewed Nov 26, 2025

View reviewed changes

tools/server/tests/unit/test_anthropic_api.py Outdated Show resolved Hide resolved

remove -@pytest.mark.slow from tool calling/jinja tests

f7d463d

ngxson requested changes Nov 26, 2025

View reviewed changes

noname22 added 4 commits November 27, 2025 09:21

server : remove unused code and slow/skip on test_anthropic_vision_ba…

32b65f0

…se64_with_multimodal_model in test_anthropic_api.py

server : removed redundant n field logic in anthropic_params_from_json

c922b4a

server : use single error object instead of error_array in streaming …

f388e35

…response handler for /v1/chat/completions and use unordered_set instead of set in to_json_anthropic_stream()

server : refactor Anthropic API to use OAI conversion

728d4ec

noname22 mentioned this pull request Nov 28, 2025

server : add Anthropic Messages API support #17570

Merged

noname22 closed this Nov 28, 2025

server : add Anthropic Messages API support #17425

server : add Anthropic Messages API support #17425

Uh oh!

Conversation

noname22 commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Features Implemented

Testing

Uh oh!

noname22 commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mushoz commented Nov 21, 2025

Uh oh!

noname22 commented Nov 21, 2025

Uh oh!

Mushoz commented Nov 21, 2025

Uh oh!

noname22 commented Nov 21, 2025

Uh oh!

fernandaspets commented Nov 22, 2025

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ngxson commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

noname22 commented Nov 22, 2025

Uh oh!

noname22 commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pwilkin commented Nov 23, 2025

Uh oh!

noname22 commented Nov 23, 2025

Uh oh!

calvin2021y commented Nov 23, 2025

Uh oh!

noname22 commented Nov 23, 2025

Uh oh!

hksdpc255 commented Nov 25, 2025

Uh oh!

noname22 commented Nov 25, 2025

Uh oh!

noname22 commented Nov 25, 2025

Uh oh!

pwilkin commented Nov 25, 2025

Uh oh!

noname22 commented Nov 25, 2025

Uh oh!

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

noname22 commented Nov 27, 2025

Uh oh!

ngxson commented Nov 27, 2025

Uh oh!

noname22 commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

noname22 commented Nov 21, 2025 •

edited

Loading

noname22 commented Nov 21, 2025 •

edited

Loading

ngxson commented Nov 22, 2025 •

edited

Loading

noname22 commented Nov 22, 2025 •

edited

Loading

noname22 commented Nov 28, 2025 •

edited

Loading