mtmd: server: Support multimodal data prompt in /completions and /embeddings endpoint of server #15108

65a · 2025-08-06T01:54:28Z

Editing first comment to match current state:

This pull adds support for a multimodal data in the /completions (and in a similar fashion, /embeddings) API endpoint. Instead of a string, list of tokens, or a mixed string token list as currently supported by that endpoint, this pull add support for a JSON object containing both a prompt_string and and a multimodal_data field. The client should check the result of /models or /v1/models for the multimodal capability, sending multimodal data to a non-multimodal model will result in a request error.

A singular request example is like this:

 {
    ...
    "prompt": { "prompt_string": "string", "multimodal_data": ["base64"] }
    ...
{

A multiple (and mixed-type) request would look like:

 {
    ...
    "prompt": [
        { "prompt_string": "string", "multimodal_data": ["base64"] },
        "What's up?",
        [ 0, 1, 2, 3, "breakfast" ]
    ]
    ...
{

All existing tests pass, new tests are added to cover both the prompt splitting and visual inference. If a prompt is added and the model does not support MTMD, only the text part will be used. The multimodal part should be a base64-encoded media data supported by libmtmd.

With this approach, other server endpoints can be multimodal in the future relatively easily (Rerank and infill are close but would need additional work and testing, etc). Feedback welcome!

Implement a basic way to include multimodal data in the completions endpoint. For now, this just supports directly included data, in base64-encoded format provided as an array of strings under the json key multimodal_data. Documentation updated to match. Local testing shows no regression for without-media case, and successful image processing with media provided from a custom client.

Similar to #14016 but avoids importing the URL fetching logic at all. It could be added later when factored out of the OpenAI-emulation code, but this is simpler for now and avoids need for URL parsing and remote fetch capabilities.

Original referenced issue was #13872

@ngxson ptal, hopefully this works for you.

tools/server/README.md

ngxson · 2025-08-07T00:31:38Z

The proposal looks ok but there will be some edge cases:

What if user enter a list of tokens or mixed list of tokens instead of a raw text prompt?
What if user enter a list of multiple prompts?

I think proper test cases are required for this PR, similar to /chat/completion

65a · 2025-08-07T00:50:40Z

Let me look at those.

The token case is interesting, interested in your thoughts there. The current server uses null tokens, but already knows how many to insert, which seems hard on the client (have to know the multimodal embedding size before sending raw tokens to completion endpoint). A magic token could work similarly to <__media__> but I guess this might require behavior changes. Need to poke at code to understand that path better, test_completions.py has no tests for token prompt requests. Easy option is to just document token + multimodal is not yet supported and throw if it happens.

The multiple text prompt part is also interesting from a usability perspective. I'll think about these and come back. The multi-prompt case should be straightforward to add tests for, not sure how it ought to work yet.

65a · 2025-08-07T01:09:30Z

I have an idea that might be usable, namely that prompt can now contain an array of JSON objects, like "prompt": [{ prompt: "foo", multimodal_data: ["<base64>"] }, { prompt: "bar", multimodal_data: ["<base64>"] }]. This would be a specific documented option for the prompt field in /completions, and only allow string multimodal prompts (for now). It's a bit verbose and requires more code refactoring, but as a user it makes sense to pair my prompts with their respective data. This would extend the currently supported prompt field options, and not change the top level data JSON object. Also easy to add additional tests for. I'll try it out locally, but does that idea work for you @ngxson ?

65a · 2025-08-07T04:37:36Z

Rough draft for the idea here (it compiles and passes existing tests): https://pastebin.com/8zek7ium

Not complete or properly indented, but the idea is to use server_tokens in more places, so that the input tokenizer can branch and use MTMD tokenization where it makes sense to do so. As a side effect, probably got multimodal support in embeddings. Infill needs more work, and rerank would work if I can get the push_back(server_tokens) for server_tokens to work properly, I think.

There are probably better ways to do some of this than I did, feedback welcome.

65a · 2025-08-07T06:04:35Z

Improved version of the rough draft that actually works, ignore indentation: https://pastebin.com/R6NdKQPP

This works locally for my use case, and I've started adding tests. There are a few TODOs to make doc ranking and embeddings support multimodal usecases, and I think the oai case can also be streamlined.

The general approach is as described previously: use server_tokens in more places, break out mtmd prompt parsing into a function, and change various input tokenization calls to handle server_tokens instead of llama_tokens.

The request format for multiple prompts would be like this:

{
    ...
    "prompt": [
            "Prompt 1",
            [ 1, 2, 3 , 4],
            { "prompt": "What is a tomato?", "multimodal_data": "<base64>"
    ],
    ....
}

The JSON entry only supports what mtmd_tokenize supports if multimodal_data is present. A single JSON object can also be provided instead of the array of prompts as above. A JSON object could also be provided containing only the prompt in either location, which is handled similarly to normal prompt strings.

65a · 2025-08-08T03:29:57Z

Added tests including vision test. Should be good for a review pass. There is some potential future work, including supporting multimodal prompts in document rerank and infill. Embeddings may already work, existing tests pass, but I didn't try it and not sure it's expected to provide a stable embedding or not. Further refactoring is possible to streamline the OAI chat path into the rest, but probably a follow up. @ngxson let me know what you think.

65a · 2025-08-09T00:09:57Z

Cleaned up the code quite a bit, and fixed the TODO around server_tokens.push_back(server_tokens). Now the tokenize_inputs handling reads a lot cleaner, which is nice.

oobabooga · 2025-08-10T04:23:13Z

I have tested this PR and it worked perfectly ✅

Here is a simple test with google_gemma-3-4b-it-Q4_K_S.gguf:

The prompt was:

<bos><start_of_turn>user
<__media__>

What do you see here?<end_of_turn>
<start_of_turn>model

The details of my UI integration are here oobabooga/text-generation-webui#7027

tools/server/tests/unit/test_vision_completion.py

ngxson

Looking good, can be merge after my comments are all resolved

ngxson · 2025-08-11T09:01:01Z

tools/server/tests/unit/test_vision_completion.py

Can we merge this file to test_vision_api.py? We don't have many tests atm, so we should reduce number of files

Done. Added some tests for other functionality discussed below, and updated test_completions.py (it doesn't actually have MTMD, so dropped the data).

ngxson · 2025-08-11T09:06:22Z

tools/server/utils.hpp

+    if (json_prompt.is_array() && !json_is_array_with_tokens(json_prompt)) {
+        result.reserve(json_prompt.size());
+        for (const auto & p : json_prompt) {
+            result.push_back(tokenize_input_subprompt(vocab,mctx, p,add_special, parse_special));


Suggested change

result.push_back(tokenize_input_subprompt(vocab,mctx, p,add_special, parse_special));

result.push_back(tokenize_input_subprompt(vocab,mctx, p, add_special, parse_special));

ngxson · 2025-08-11T09:07:07Z

tools/server/utils.hpp

+        // array of tokens
+        llama_tokens tmp = json_prompt.get<llama_tokens>();
+        return server_tokens(tmp, false);
+    } else if (json_prompt.find("prompt") != json_prompt.end()) {


Suggested change

} else if (json_prompt.find("prompt") != json_prompt.end()) {

} else if (json_prompt.contains("prompt")) {

ngxson · 2025-08-11T09:07:20Z

tools/server/utils.hpp

+        return server_tokens(tmp, false);
+    } else if (json_prompt.find("prompt") != json_prompt.end()) {
+        // JSON object with prompt key.
+        if (has_mtmd && json_prompt.find("multimodal_data") != json_prompt.end()) {


Suggested change

if (has_mtmd && json_prompt.find("multimodal_data") != json_prompt.end()) {

if (has_mtmd && json_prompt.contains("multimodal_data")) {

Or even better, like this:

if (json_prompt.contains("multimodal_data")) { if (has_mtmd) { ... do the thing .... } else throw std::runtime_error("multimodal is not supported by this server"); }

Done, however this leaves a trap for the clients who call us, because they don't know if we will give them an error or not. Added a capability to /models and /v1/models so the client can tell if we support multimodal or not, this should be sufficient to support an error rather than silently dropping it.

ngxson · 2025-08-11T09:10:28Z

tools/server/server.cpp

@@ -4750,22 +4709,22 @@ int main(int argc, char ** argv) {
            return;
        }

-        llama_tokens tokenized_query = tokenize_input_prompts(ctx_server.vocab, query, /* add_special */ false, true)[0];
+        server_tokens tokenized_query = std::move(tokenize_input_prompts(ctx_server.vocab, ctx_server.mctx, query, /* add_special */ false, true)[0]);


I think std::move is unnecessary here, the compiler should be good enough to optimize this

Done. It was necessary, at least on my recent gcc on linux (this was triggering an ateempted copy), but I refactored this so we only std::move once, below this point.

ngxson · 2025-08-12T00:07:01Z

tools/server/utils.hpp

+#define JSON_STRING_PROMPT_KEY "prompt_string"
+#define JSON_MTMD_DATA_KEY "multimodal_data"


it's best to define these in local scope, not globally, using const char *

ngxson · 2025-08-12T00:08:14Z

tools/server/utils.hpp

+
+            // JSON object with prompt and multimodal key.
+            std::vector<raw_buffer> files;
+            for (const auto& entry : json_prompt.at(JSON_MTMD_DATA_KEY)) {


Suggested change

for (const auto& entry : json_prompt.at(JSON_MTMD_DATA_KEY)) {

for (const auto & entry : json_prompt.at(JSON_MTMD_DATA_KEY)) {

ngxson · 2025-08-12T00:08:38Z

tools/server/utils.hpp

+    if (tokenized != 0) {
+        throw std::runtime_error("Failed to tokenize prompt");
+    }
+    auto result = server_tokens(chunks,true);


Suggested change

auto result = server_tokens(chunks,true);

auto result = server_tokens(chunks, true);

ngxson · 2025-08-12T00:10:21Z

tools/server/tests/utils.py

        server.n_ctx = 1024
-        server.n_batch = 32
+        server.n_batch = 512
        server.n_slots = 2
        server.n_predict = 4
        server.seed = 42
+        server.server_embeddings = True
        return server


define this in local test, before server.start(...)

see other test files for example

65a · 2025-08-12T00:15:43Z

A few notable updates:

Refactored prompt and multimodal_data to prompt_string and multimodal_data. This avoids having duplicate keys and opens up the possibility of having things like prompt_tokens or multimodal_urls in the future.
Sending multimodal data to a model without multimodal capability now results in a request error. To avoid that, clients should check /v1/models or /models for the multimodal capabability. Tests added for this new capability as well.
Embeddings work and are now documented and tested as multimodal
Removed some text about image_data in the docs that didn't appear to be implemented yet, more docs for multimodal
Made test names better and merged them into the one file.

@ngxson ready for re-review, I won't resolve your comments in case you want to discuss any of the changes.

65a · 2025-08-12T00:44:13Z

@ngxson I believe second round of comments are now addressed!

@oobabooga I saw you already were testing this, thanks. Please note the API has changed slightly. The client should check if multimodal is supported via the /models endpoint before sending a prompt that includes the multimodal_data key. The prompt key was renamed to prompt_string. Thanks for testing it!

oobabooga · 2025-08-12T00:45:18Z

Thanks for the heads up @65a, I have updated the request! oobabooga/text-generation-webui@e6447cd

65a · 2025-08-12T01:48:34Z

Can't reproduce sanitizer test timeout failure locally, hopefully pushing an updated commit message can retrigger CI.

… format - Use server_tokens in more places in server and util.cpp - Convert most functions that used llama_tokens to server_tokens - Modify input tokenizer to handle JSON objects as subprompts - Break out MTMD prompt parsing into utility function - Support JSON objects with multimodal_data arrays for MTMD prompts along with other existing types - Add capability to model endpoint to indicate if client can send multimodal data - Add tests.

65a requested a review from ngxson as a code owner August 6, 2025 01:54

github-actions bot added examples server labels Aug 6, 2025

pnb reviewed Aug 6, 2025

View reviewed changes

tools/server/README.md Outdated Show resolved Hide resolved

65a force-pushed the master branch from ccd2814 to 9ca7808 Compare August 8, 2025 01:33

github-actions bot added the python python script changes label Aug 8, 2025

65a force-pushed the master branch 2 times, most recently from 744d758 to 62f3bae Compare August 8, 2025 03:23

65a changed the title ~~mtmd: server: Support basic multimodal data in /completions endpoint of server~~ mtmd: server: Support multimodal data prompt in /completions endpoint of server Aug 8, 2025

65a force-pushed the master branch from 62f3bae to 3cf34d5 Compare August 9, 2025 00:05

65a force-pushed the master branch 2 times, most recently from 5359dda to 234531f Compare August 9, 2025 00:31

oobabooga mentioned this pull request Aug 9, 2025

Add multimodal support (ExLlamaV3) oobabooga/text-generation-webui#7174

Merged

4 tasks

pnb reviewed Aug 10, 2025

View reviewed changes

tools/server/tests/unit/test_vision_completion.py Outdated Show resolved Hide resolved

65a force-pushed the master branch from 234531f to 23ad1ff Compare August 10, 2025 22:19

ngxson reviewed Aug 11, 2025

View reviewed changes

65a force-pushed the master branch from 23ad1ff to 4971043 Compare August 12, 2025 00:04

65a changed the title ~~mtmd: server: Support multimodal data prompt in /completions endpoint of server~~ mtmd: server: Support multimodal data prompt in /completions and /embeddings endpoint of server Aug 12, 2025

ngxson reviewed Aug 12, 2025

View reviewed changes

65a force-pushed the master branch 2 times, most recently from 398d0fe to 58b9c3e Compare August 12, 2025 00:32

65a force-pushed the master branch from 58b9c3e to 695bb37 Compare August 12, 2025 01:49

	result.push_back(tokenize_input_subprompt(vocab,mctx, p,add_special, parse_special));
	result.push_back(tokenize_input_subprompt(vocab,mctx, p, add_special, parse_special));

	} else if (json_prompt.find("prompt") != json_prompt.end()) {
	} else if (json_prompt.contains("prompt")) {

	if (has_mtmd && json_prompt.find("multimodal_data") != json_prompt.end()) {
	if (has_mtmd && json_prompt.contains("multimodal_data")) {

		#define JSON_STRING_PROMPT_KEY "prompt_string"
		#define JSON_MTMD_DATA_KEY "multimodal_data"

	for (const auto& entry : json_prompt.at(JSON_MTMD_DATA_KEY)) {
	for (const auto & entry : json_prompt.at(JSON_MTMD_DATA_KEY)) {

	auto result = server_tokens(chunks,true);
	auto result = server_tokens(chunks, true);

mtmd: server: Support multimodal data prompt in /completions and /embeddings endpoint of server #15108

Are you sure you want to change the base?

mtmd: server: Support multimodal data prompt in /completions and /embeddings endpoint of server #15108

Conversation

65a commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ngxson commented Aug 7, 2025

Uh oh!

65a commented Aug 7, 2025

Uh oh!

65a commented Aug 7, 2025

Uh oh!

65a commented Aug 7, 2025

Uh oh!

65a commented Aug 7, 2025

Uh oh!

65a commented Aug 8, 2025

Uh oh!

65a commented Aug 9, 2025

Uh oh!

oobabooga commented Aug 10, 2025

Uh oh!

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

65a commented Aug 12, 2025

Uh oh!

65a commented Aug 12, 2025

Uh oh!

oobabooga commented Aug 12, 2025

Uh oh!

65a commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

65a commented Aug 6, 2025 •

edited

Loading

65a commented Aug 12, 2025 •

edited

Loading