Skip to content

Commit ddf9f94

Browse files
noname22ngxson
andauthored
server : add Anthropic Messages API support (#17570)
* server : add Anthropic Messages API support * remove -@pytest.mark.slow from tool calling/jinja tests * server : remove unused code and slow/skip on test_anthropic_vision_base64_with_multimodal_model in test_anthropic_api.py * server : removed redundant n field logic in anthropic_params_from_json * server : use single error object instead of error_array in streaming response handler for /v1/chat/completions and use unordered_set instead of set in to_json_anthropic_stream() * server : refactor Anthropic API to use OAI conversion * make sure basic test always go first * clean up * clean up api key check, add test --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
1 parent ff55414 commit ddf9f94

File tree

11 files changed

+1553
-70
lines changed

11 files changed

+1553
-70
lines changed

tools/server/README.md

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ Set of LLM REST APIs and a simple web front end to interact with llama.cpp.
77
**Features:**
88
* LLM inference of F16 and quantized models on GPU and CPU
99
* [OpenAI API](https://github.com/openai/openai-openapi) compatible chat completions and embeddings routes
10+
* [Anthropic Messages API](https://docs.anthropic.com/en/api/messages) compatible chat completions
1011
* Reranking endpoint (https://github.com/ggml-org/llama.cpp/pull/9510)
1112
* Parallel decoding with multi-user support
1213
* Continuous batching
@@ -1352,6 +1353,77 @@ See [OpenAI Embeddings API documentation](https://platform.openai.com/docs/api-r
13521353
}'
13531354
```
13541355

1356+
### POST `/v1/messages`: Anthropic-compatible Messages API
1357+
1358+
Given a list of `messages`, returns the assistant's response. Streaming is supported via Server-Sent Events. While no strong claims of compatibility with the Anthropic API spec are made, in our experience it suffices to support many apps.
1359+
1360+
*Options:*
1361+
1362+
See [Anthropic Messages API documentation](https://docs.anthropic.com/en/api/messages). Tool use requires `--jinja` flag.
1363+
1364+
`model`: Model identifier (required)
1365+
1366+
`messages`: Array of message objects with `role` and `content` (required)
1367+
1368+
`max_tokens`: Maximum tokens to generate (default: 4096)
1369+
1370+
`system`: System prompt as string or array of content blocks
1371+
1372+
`temperature`: Sampling temperature 0-1 (default: 1.0)
1373+
1374+
`top_p`: Nucleus sampling (default: 1.0)
1375+
1376+
`top_k`: Top-k sampling
1377+
1378+
`stop_sequences`: Array of stop sequences
1379+
1380+
`stream`: Enable streaming (default: false)
1381+
1382+
`tools`: Array of tool definitions (requires `--jinja`)
1383+
1384+
`tool_choice`: Tool selection mode (`{"type": "auto"}`, `{"type": "any"}`, or `{"type": "tool", "name": "..."}`)
1385+
1386+
*Examples:*
1387+
1388+
```shell
1389+
curl http://localhost:8080/v1/messages \
1390+
-H "Content-Type: application/json" \
1391+
-H "x-api-key: your-api-key" \
1392+
-d '{
1393+
"model": "gpt-4",
1394+
"max_tokens": 1024,
1395+
"system": "You are a helpful assistant.",
1396+
"messages": [
1397+
{"role": "user", "content": "Hello!"}
1398+
]
1399+
}'
1400+
```
1401+
1402+
### POST `/v1/messages/count_tokens`: Token Counting
1403+
1404+
Counts the number of tokens in a request without generating a response.
1405+
1406+
Accepts the same parameters as `/v1/messages`. The `max_tokens` parameter is not required.
1407+
1408+
*Example:*
1409+
1410+
```shell
1411+
curl http://localhost:8080/v1/messages/count_tokens \
1412+
-H "Content-Type: application/json" \
1413+
-d '{
1414+
"model": "gpt-4",
1415+
"messages": [
1416+
{"role": "user", "content": "Hello!"}
1417+
]
1418+
}'
1419+
```
1420+
1421+
*Response:*
1422+
1423+
```json
1424+
{"input_tokens": 10}
1425+
```
1426+
13551427
## More examples
13561428

13571429
### Interactive mode

tools/server/server-common.cpp

Lines changed: 240 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -725,7 +725,6 @@ std::vector<server_tokens> tokenize_input_prompts(const llama_vocab * vocab, mtm
725725
return result;
726726
}
727727

728-
729728
//
730729
// OAI utils
731730
//
@@ -1048,6 +1047,222 @@ json oaicompat_chat_params_parse(
10481047
return llama_params;
10491048
}
10501049

1050+
json convert_anthropic_to_oai(const json & body) {
1051+
json oai_body;
1052+
1053+
// Convert system prompt
1054+
json oai_messages = json::array();
1055+
auto system_param = json_value(body, "system", json());
1056+
if (!system_param.is_null()) {
1057+
std::string system_content;
1058+
1059+
if (system_param.is_string()) {
1060+
system_content = system_param.get<std::string>();
1061+
} else if (system_param.is_array()) {
1062+
for (const auto & block : system_param) {
1063+
if (json_value(block, "type", std::string()) == "text") {
1064+
system_content += json_value(block, "text", std::string());
1065+
}
1066+
}
1067+
}
1068+
1069+
oai_messages.push_back({
1070+
{"role", "system"},
1071+
{"content", system_content}
1072+
});
1073+
}
1074+
1075+
// Convert messages
1076+
if (!body.contains("messages")) {
1077+
throw std::runtime_error("'messages' is required");
1078+
}
1079+
const json & messages = body.at("messages");
1080+
if (messages.is_array()) {
1081+
for (const auto & msg : messages) {
1082+
std::string role = json_value(msg, "role", std::string());
1083+
1084+
if (!msg.contains("content")) {
1085+
if (role == "assistant") {
1086+
continue;
1087+
}
1088+
oai_messages.push_back(msg);
1089+
continue;
1090+
}
1091+
1092+
const json & content = msg.at("content");
1093+
1094+
if (content.is_string()) {
1095+
oai_messages.push_back(msg);
1096+
continue;
1097+
}
1098+
1099+
if (!content.is_array()) {
1100+
oai_messages.push_back(msg);
1101+
continue;
1102+
}
1103+
1104+
json tool_calls = json::array();
1105+
json converted_content = json::array();
1106+
json tool_results = json::array();
1107+
bool has_tool_calls = false;
1108+
1109+
for (const auto & block : content) {
1110+
std::string type = json_value(block, "type", std::string());
1111+
1112+
if (type == "text") {
1113+
converted_content.push_back(block);
1114+
} else if (type == "image") {
1115+
json source = json_value(block, "source", json::object());
1116+
std::string source_type = json_value(source, "type", std::string());
1117+
1118+
if (source_type == "base64") {
1119+
std::string media_type = json_value(source, "media_type", std::string("image/jpeg"));
1120+
std::string data = json_value(source, "data", std::string());
1121+
std::ostringstream ss;
1122+
ss << "data:" << media_type << ";base64," << data;
1123+
1124+
converted_content.push_back({
1125+
{"type", "image_url"},
1126+
{"image_url", {
1127+
{"url", ss.str()}
1128+
}}
1129+
});
1130+
} else if (source_type == "url") {
1131+
std::string url = json_value(source, "url", std::string());
1132+
converted_content.push_back({
1133+
{"type", "image_url"},
1134+
{"image_url", {
1135+
{"url", url}
1136+
}}
1137+
});
1138+
}
1139+
} else if (type == "tool_use") {
1140+
tool_calls.push_back({
1141+
{"id", json_value(block, "id", std::string())},
1142+
{"type", "function"},
1143+
{"function", {
1144+
{"name", json_value(block, "name", std::string())},
1145+
{"arguments", json_value(block, "input", json::object()).dump()}
1146+
}}
1147+
});
1148+
has_tool_calls = true;
1149+
} else if (type == "tool_result") {
1150+
std::string tool_use_id = json_value(block, "tool_use_id", std::string());
1151+
1152+
auto result_content = json_value(block, "content", json());
1153+
std::string result_text;
1154+
if (result_content.is_string()) {
1155+
result_text = result_content.get<std::string>();
1156+
} else if (result_content.is_array()) {
1157+
for (const auto & c : result_content) {
1158+
if (json_value(c, "type", std::string()) == "text") {
1159+
result_text += json_value(c, "text", std::string());
1160+
}
1161+
}
1162+
}
1163+
1164+
tool_results.push_back({
1165+
{"role", "tool"},
1166+
{"tool_call_id", tool_use_id},
1167+
{"content", result_text}
1168+
});
1169+
}
1170+
}
1171+
1172+
if (!converted_content.empty() || has_tool_calls) {
1173+
json new_msg = {{"role", role}};
1174+
if (!converted_content.empty()) {
1175+
new_msg["content"] = converted_content;
1176+
} else if (has_tool_calls) {
1177+
new_msg["content"] = "";
1178+
}
1179+
if (!tool_calls.empty()) {
1180+
new_msg["tool_calls"] = tool_calls;
1181+
}
1182+
oai_messages.push_back(new_msg);
1183+
}
1184+
1185+
for (const auto & tool_msg : tool_results) {
1186+
oai_messages.push_back(tool_msg);
1187+
}
1188+
}
1189+
}
1190+
1191+
oai_body["messages"] = oai_messages;
1192+
1193+
// Convert tools
1194+
if (body.contains("tools")) {
1195+
const json & tools = body.at("tools");
1196+
if (tools.is_array()) {
1197+
json oai_tools = json::array();
1198+
for (const auto & tool : tools) {
1199+
oai_tools.push_back({
1200+
{"type", "function"},
1201+
{"function", {
1202+
{"name", json_value(tool, "name", std::string())},
1203+
{"description", json_value(tool, "description", std::string())},
1204+
{"parameters", tool.contains("input_schema") ? tool.at("input_schema") : json::object()}
1205+
}}
1206+
});
1207+
}
1208+
oai_body["tools"] = oai_tools;
1209+
}
1210+
}
1211+
1212+
// Convert tool_choice
1213+
if (body.contains("tool_choice")) {
1214+
const json & tc = body.at("tool_choice");
1215+
if (tc.is_object()) {
1216+
std::string type = json_value(tc, "type", std::string());
1217+
if (type == "auto") {
1218+
oai_body["tool_choice"] = "auto";
1219+
} else if (type == "any" || type == "tool") {
1220+
oai_body["tool_choice"] = "required";
1221+
}
1222+
}
1223+
}
1224+
1225+
// Convert stop_sequences to stop
1226+
if (body.contains("stop_sequences")) {
1227+
oai_body["stop"] = body.at("stop_sequences");
1228+
}
1229+
1230+
// Handle max_tokens (required in Anthropic, but we're permissive)
1231+
if (body.contains("max_tokens")) {
1232+
oai_body["max_tokens"] = body.at("max_tokens");
1233+
} else {
1234+
oai_body["max_tokens"] = 4096;
1235+
}
1236+
1237+
// Pass through common params
1238+
for (const auto & key : {"temperature", "top_p", "top_k", "stream"}) {
1239+
if (body.contains(key)) {
1240+
oai_body[key] = body.at(key);
1241+
}
1242+
}
1243+
1244+
// Handle Anthropic-specific thinking param
1245+
if (body.contains("thinking")) {
1246+
json thinking = json_value(body, "thinking", json::object());
1247+
std::string thinking_type = json_value(thinking, "type", std::string());
1248+
if (thinking_type == "enabled") {
1249+
int budget_tokens = json_value(thinking, "budget_tokens", 10000);
1250+
oai_body["thinking_budget_tokens"] = budget_tokens;
1251+
}
1252+
}
1253+
1254+
// Handle Anthropic-specific metadata param
1255+
if (body.contains("metadata")) {
1256+
json metadata = json_value(body, "metadata", json::object());
1257+
std::string user_id = json_value(metadata, "user_id", std::string());
1258+
if (!user_id.empty()) {
1259+
oai_body["__metadata_user_id"] = user_id;
1260+
}
1261+
}
1262+
1263+
return oai_body;
1264+
}
1265+
10511266
json format_embeddings_response_oaicompat(const json & request, const json & embeddings, bool use_base64) {
10521267
json data = json::array();
10531268
int32_t n_tokens = 0;
@@ -1211,7 +1426,7 @@ std::string tokens_to_output_formatted_string(const llama_context * ctx, const l
12111426

12121427
// format server-sent event (SSE), return the formatted string to send
12131428
// note: if data is a json array, it will be sent as multiple events, one per item
1214-
std::string format_sse(const json & data) {
1429+
std::string format_oai_sse(const json & data) {
12151430
std::ostringstream ss;
12161431
auto send_single = [&ss](const json & data) {
12171432
ss << "data: " <<
@@ -1230,6 +1445,29 @@ std::string format_sse(const json & data) {
12301445
return ss.str();
12311446
}
12321447

1448+
std::string format_anthropic_sse(const json & data) {
1449+
std::ostringstream ss;
1450+
1451+
auto send_event = [&ss](const json & event_obj) {
1452+
if (event_obj.contains("event") && event_obj.contains("data")) {
1453+
ss << "event: " << event_obj.at("event").get<std::string>() << "\n";
1454+
ss << "data: " << safe_json_to_str(event_obj.at("data")) << "\n\n";
1455+
} else {
1456+
ss << "data: " << safe_json_to_str(event_obj) << "\n\n";
1457+
}
1458+
};
1459+
1460+
if (data.is_array()) {
1461+
for (const auto & event : data) {
1462+
send_event(event);
1463+
}
1464+
} else {
1465+
send_event(data);
1466+
}
1467+
1468+
return ss.str();
1469+
}
1470+
12331471
bool is_valid_utf8(const std::string & str) {
12341472
const unsigned char* bytes = reinterpret_cast<const unsigned char*>(str.data());
12351473
const unsigned char* end = bytes + str.length();

tools/server/server-common.h

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -294,6 +294,9 @@ json oaicompat_chat_params_parse(
294294
const oaicompat_parser_options & opt,
295295
std::vector<raw_buffer> & out_files);
296296

297+
// convert Anthropic Messages API format to OpenAI Chat Completions API format
298+
json convert_anthropic_to_oai(const json & body);
299+
297300
// TODO: move it to server-task.cpp
298301
json format_embeddings_response_oaicompat(const json & request, const json & embeddings, bool use_base64 = false);
299302

@@ -320,7 +323,10 @@ std::string tokens_to_output_formatted_string(const llama_context * ctx, const l
320323

321324
// format server-sent event (SSE), return the formatted string to send
322325
// note: if data is a json array, it will be sent as multiple events, one per item
323-
std::string format_sse(const json & data);
326+
std::string format_oai_sse(const json & data);
327+
328+
// format Anthropic-style SSE with event types
329+
std::string format_anthropic_sse(const json & data);
324330

325331
bool is_valid_utf8(const std::string & str);
326332

0 commit comments

Comments
 (0)