Feature/anthropic api support #1

noname22 · 2025-11-21T10:28:56Z

Make sure to read the contributing guidelines before submitting a PR

* Detect GigaChat3-10-A1.8B as deepseek lite Hardcodes checking number of layers to detect if lite version of deepseek. * Add commnent identifying deepseek lite variants deepseek lite variants include DeepSeek-V2-Lite, GigaChat3-10B-A1.8B

* mmf for rdna4 * align the padding for rdna4 * forbit mul_mat_f for rdna4 * fix as comment * remove device kernels * add constexpr for early return * update based on review comment * change based on the review comment * pass compile error * keep code consistency --------- Co-authored-by: zhang hui <you@example.com>

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

…gml-org#17439) 26.04 provides these Signed-off-by: Eric Curtin <eric.curtin@docker.com>

* support non-contiguous i32 to i32 copy * add tests * rename cpy_flt to cpy_scalar and reindent params

ggml-org#17452) * webui: added a dedicated 'Display' settings section that groups visualization options * webui: added a Display setting to toggle automatic chat scrolling * chore: update webui build output

…check (ggml-org#17212) * hexagon: add buffer support checks for hexagon sessions * refactor: simplify buffer support checks in hexagon operations * hexagon: update buffer support checks to use tensor structure * refactor: streamline buffer initialization for DSP queue in hexagon operations * refactor: simplify buffer initialization in DSP queue for hexagon operations * refactor: optimize hex_supported_buffer function by fold expression * wip * refactor: simplify dspqueue_buffers_init function and its usage in hexagon operations * fix: improve nan handling at hvx_vec_fast_sigmoid_fp32_guard * refactor: optimize hvx_vec_inverse_fp32_guard for better nan handling * refactor: update hvx_vec_fast_sigmoid_fp32_guard to use adjusted exponent limits * refactor: modify hvx_vec_fast_sigmoid_fp32_guard to accept parameters for improved flexibility * refactor: update hvx_vec_exp_fp32_guard to accept max_exp and inf parameters to save some instructions * refactor: move hvx_vec_inverse_fp32_guard implementation to hvx-inverse.c for better perf

* ggml-hexagon: fix build error with GCC Add stdexcept include to fix GCC build errors Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr> * ggml-hexagon: check VTCM acquire failures Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr> * ggml-hexagon: disable destination bypass on older than v73 v68 errors out if having bypass enabled when the VTCM is the destination. At least on v68 this made things actually work... not a proper fix though, so to look at later... Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr> * ggml-hexagon: add initial v68/v69 support v68 is the Hexagon revision notably used on the Snapdragon 8cx Gen 3 and the QCM6490. Also add support for v69. 8MB isn't a supported page size, so relax asked for page size constraint for HAP_compute_res_attr_set_vtcm_param_v2 to optimal. Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr> --------- Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr>

**Description of the problem** `cann_graph_update_required` is redundantly defined and initialized as `false` inside two mutually exclusive macro branches. **Proposed solution** Define it right before the macro so that it could serve both branches.

) * Converted RND1 model to GGUF weights * RND1 llama.cpp support v1 * RND1 llama.cpp support v2 non causal bug * RND1 llama.cpp support v3 doccumentation * RND1 llama.cpp support v4 clean code * linting issues * RND1 pr fixes v1 * RND1 pr fixes v2 Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Diffusion documentation edits --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* ggml: add RISC-V cpu-feats Signed-off-by: Wang Yang <yangwang@iscas.ac.cn> * fix comment[1] --------- Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>

…ml-org#16739) * Enabled q4_K_8x8_q8_K path on ARM * wip: I8mm qs multiplication, pending bias * cpu : arm : REPACK gemm q4_K8x8 implementation Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Guard gemm with proper features, improved superblock scale and min calc Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * cpu: arm: Implemented REPACK gemv for Q4_K Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Removed completed TODO * Fixed missing guards when selecting optimal repack type for Q4_K Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Fixed macro guard for gemv * Fixed wrong comment in GEMV * Fixed warning for unused variable * vdotq_s32 -> ggml_vdotq_s32 Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Clang-format issues * Apply suggestions from code review Co-authored-by: Diego Devesa <slarengh@gmail.com> * Removed unnecessary GGML_UNUSED * Fixed guards in q4_k gemm and gemv (repack) --------- Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> Co-authored-by: Diego Devesa <slarengh@gmail.com>

This commit removes the "-dirty" suffix from the GGML version string. The motivation for this change is to ensure that the version string works with different ways of checking out ggml and using it in projects. By removing the dirty flag from the version string, we avoid potential artifacts like shared libraries getting a -dirty suffix in their names. Instead, if the project is built from a dirty git state, the dirty flag will be appended to the commit hash in the GGML_BUILD_COMMIT variable. This will enable users to still identify that the build was made from from a modified/dirty state even though the version might match a "real" version. For example, the commit can be produces as follows: ```c++ printf("commit: %s\n", ggml_commit()); ``` Which would print the following for a dirty build: ```console commit: 781baf2a-dirty ``` Refs: ggml-org/ggml#1363 (comment)

This commit adds the --kv-unified flag to the usage example in the README.md file for the batched example. The motivation for this is that without this flag the example will fail with the following error: ```console Hello my name is split_equal: sequential split is not supported when there are coupled sequences in the input batch (you may need to use the -kvu flag) decode: failed to find a memory slot for batch of size 4 main: llama_decode() failed ```

…#17362) * add server-task, server-common * add server-queue * rm redundant includes * move enum stop_type to server-task * server : headers cleanup --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* first commit naive test to enable mmq for RDNA4 * adding appropriate WMMA instructions * git rebase on top of master: fixing the correctness of the mat mul operations, updating layout mappings for RDNA4 * clean up merge conflicts * add comments and code clean up * PR clean up, addressed comments * enable MMQ fallback on RDNA4 * addressed comments: add guards in load generic, separate wmma branch for use_mmq function * Revert build-xcframework.sh * Formating: remove trailing whitespace * revert CMake files * clean up after rebase: remove duplicated change, revert cmake files * clean up after rebase: revert changes from build-xcframework.sh * clean up: remove extra space line in mma.cuh * Revert "clean up: remove extra space line in mma.cuh" This reverts commit b39ed57.

This commit adds a check to skip the output reordering logic when n_outputs == 1. With a single output token, the data is trivially sorted and the reordering code is currently doing unnecessary work (resetting and rebuilding output_ids to the same values). The motivation for this change is improved code clarity and avoiding confusion when debugging. While the performance impact is probably negligible, this unnecessary work happens on every decode call in llama-server when processing batches with single-token outputs.

…-org#17120)

…gml-org#17455)

…l-org#17477) * webui: add rehype plugin to restore HTML in Markdown table cells The remark/rehype pipeline neutralizes inline HTML as literal text (remarkLiteralHtml) so that XML/HTML snippets in LLM responses display as-is instead of being rendered. This causes <br> and <ul> markup in table cells to show as plain text. This plugin traverses the HAST post-conversion, parses whitelisted HTML patterns (<br>, <ul><li>) from text nodes, and replaces them with actual HAST element nodes. For lists, adjacent siblings must be combined first as the AST fragmentation breaks pattern matching. Strict validation rejects malformed markup, keeping it as raw text. * chore: update webui build output

Co-authored-by: tianhao <tianhao42@huawei.com>

* Fix convert_hf_to_gguf.py script on s390x Assume converted model data is originally little-endian. Byteswap data on s390x after reading it to put values in correct presentation for any transformation needed, like calculating weight tensors. Then byteswap data to little-endian before passing it to GGUFWriter while GGUFWriter will byteswap data back to big endian if big endian output is requested. byteswap(inplace=True) calls don't work with lazy tensor and array wrappers. Use byteswap with copying data to workaround this behaviour. * Make GGUFWriter accept tensors in native endianness instead of little-endian With this change if no byteswapping is actually needed, 2 excessive byteswaps can be omitted on s390x * Fix byteswapping in convert_hf_to_gguf.py for remote models

* ggml : add ggml_top_k * cont : add ggml_argsort_top_k * metal : add top_k support * ggml : cleanup * tests : add virtual err() function for test_case * ggml : add comments

…se64_with_multimodal_model in test_anthropic_api.py

…response handler for /v1/chat/completions and use unordered_set instead of set in to_json_anthropic_stream()

noname22 force-pushed the feature/anthropic-api-support branch 2 times, most recently from 2f840f5 to b7c322b Compare November 21, 2025 11:32

ubergarm and others added 28 commits November 21, 2025 14:51

opencl: refine condition for kqv mm (ggml-org#17392)

8e9ddba

Revive MUL_MAT_ID to perf testing (ggml-org#17397)

3f3a4fb

ci : switch to BoringSSL on Server workflow (ggml-org#17441)

4949ac0

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

vulkan: remove a couple unnecessary switches (ggml-org#17419)

54d83bb

vulkan: Update docker image to Ubuntu 26.04 to enable glslc features (g…

bc809e9

…gml-org#17439) 26.04 provides these Signed-off-by: Eric Curtin <eric.curtin@docker.com>

cuda : support non-contiguous i32 to i32 copy (ggml-org#17326)

96ac5a2

* support non-contiguous i32 to i32 copy * add tests * rename cpy_flt to cpy_scalar and reindent params

webui: minor settings reorganization and add disable autoscroll option (

0c7220d

ggml-org#17452) * webui: added a dedicated 'Display' settings section that groups visualization options * webui: added a Display setting to toggle automatic chat scrolling * chore: update webui build output

hexagon: add support for ROPE_NEOX (ggml-org#17458)

923ae3c

ggml: add RISC-V cpu-feats (ggml-org#17461)

5f55c38

* ggml: add RISC-V cpu-feats Signed-off-by: Wang Yang <yangwang@iscas.ac.cn> * fix comment[1] --------- Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>

sync : ggml

2d50b9d

convert : allow quantizing lora again (ggml-org#17453)

b61de2b

vulkan: more FA details in vk_perf_logger (ggml-org#17443)

3d07caa

llama: introduce support for model-embedded sampling parameters (ggml…

877566d

…-org#17120)

vulkan: Use fewer rows for scalar FA when HS is not a multiple of 16 (g…

d414db0

…gml-org#17455)

CANN: supports out_prod operator for F32 and F16 (ggml-org#17406)

064c90d

Co-authored-by: tianhao <tianhao42@huawei.com>

slaren and others added 4 commits November 25, 2025 13:00

codeowners : remove slaren (ggml-org#17492)

55ab25c

ggml : add ggml_top_k (ggml-org#17365)

583cb83

* ggml : add ggml_top_k * cont : add ggml_argsort_top_k * metal : add top_k support * ggml : cleanup * tests : add virtual err() function for test_case * ggml : add comments

server : add Anthropic Messages API support

aa6192d

noname22 force-pushed the feature/anthropic-api-support branch from 93868f9 to aa6192d Compare November 25, 2025 16:05

noname22 added 5 commits November 26, 2025 16:49

remove -@pytest.mark.slow from tool calling/jinja tests

f7d463d

server : remove unused code and slow/skip on test_anthropic_vision_ba…

32b65f0

…se64_with_multimodal_model in test_anthropic_api.py

server : removed redundant n field logic in anthropic_params_from_json

c922b4a

server : use single error object instead of error_array in streaming …

f388e35

…response handler for /v1/chat/completions and use unordered_set instead of set in to_json_anthropic_stream()

server : refactor Anthropic API to use OAI conversion

728d4ec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/anthropic api support #1

Feature/anthropic api support #1

Uh oh!

noname22 commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Feature/anthropic api support #1

Are you sure you want to change the base?

Feature/anthropic api support #1

Uh oh!

Conversation

noname22 commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants