server: implement GLM-style MTP #15225

F1LM1 · 2025-08-11T05:52:35Z

This is very much a draft/proof of concept I'm playing with, just one idea for an MTP implementation. Planning to test on GLM-4.5 because it's the only model out there that we've preserved NextN tensors for.

From what I can tell

the three models with MTP implemented in vLLM right now are all "DeepseekV3-style,"
they only have one MTP head, which predicts token at position n+2,
the MTP layers take as input the output embedding from the last conventional layer and their own input embedding.

So implementation-wise it seems like

we should try to reuse the existing speculative decode functionality (including nice stuff like main model KV cache management, various samplers, etc.),
but a lot of the full draft model functionality is redundant/harmful, like context/cache management for the draft model, vocab matching,
it probably makes sense to write a new function like mtp_speculative_gen_draft in speculative.cpp that is vastly simplified and branch into it in server.cpp when a slot has MTP (versus common_speculative_gen_draft).
AFAICT it looks like the server.cpp loop currently alternates between conventional forward pass and draft, which in the MTP case will probably sabotage performance gains (since our max throughput is only 1.5 tok/pass assuming zero rejections, instead of 2 tok/pass). Let me know if this isn't the case!—but if it is, should probably avoid doing non-speculative decodes after the first response token.
It doesn't make sense to have to manage a distinct ctx_dft in this case as well. It's a bit hacky but I was thinking we could just have ctx_dft = ctx and then have both normal and MTP passes write over the shared ctx logits. I think this minimizes required code changes elsewhere

This is my first time (1) working with ML stuff outside of python (2) attempting to contribute, so patience is appreciated :)

…lative.cpp

ggerganov · 2025-08-13T06:58:33Z

AFAICT it looks like the server.cpp loop currently alternates between conventional forward pass and draft, which in the MTP case will probably sabotage performance gains (since our max throughput is only 1.5 tok/pass assuming zero rejections, instead of 2 tok/pass). Let me know if this isn't the case!—but if it is, should probably avoid doing non-speculative decodes after the first response token.

This is correct - we always alternate between conventional and speculative passes. It's definitely not optimal, but improves flexibility for regular sampling. It allows to change the speculative parameters and even disable it per request, while the logic is quite simple.

It should be possible to improve this by keeping track which slots are speculating on each iteration and skip adding tokens to the conventional batch for them. It might be a good idea to implement this separately to avoid huge changes in the logic in a single PR.

ggerganov · 2025-08-13T07:52:04Z

Generally we should try to minimize the changes to llama.h, since changing/extending the public API requires a lot of effort.

On first look, I think the path that involves minimal changes is:

Add int n_mtp flag to llama_context_params (default = 1 - MTP is disabled, 2 - predict logits for one additional token, 3 - predict logits for 2 additional tokens, etc.)
Use this flag during graph build to determine if the MTP heads should be appended to the graph
Keep the conventional logits in the t_logits tensor in llm_graph_result
Add new tensor t_logits_mtp (or whatever is more appropriate) in llm_graph_result and use it to store the MTP results in it
In llama_decode() extract the t_logits_mtp data when available, following the same logic as for t_logits

Extracting the MTP logits during llama_decode() can be done in 2 ways:

Create separate buffer in the llama_context to store them and add a new llama_get_logits_mtp_ith() API that works with that new buffer in a similar way as the existing llama_get_logits_ith()
Reuse the existing logits buffer by expanding it to from [n_outputs][n_vocab] to [n_outputs][n_mtp*n_vocab]. This would avoid the need to add llama_get_logits_mtp_ith() and we can generalize the existing llama_get_logits_ith() by taking into account the value of n_mtp.

Currently, I am not sure which way is better. The first requires a new API call, while the second might break some existing assumptions (not sure if that's the case yet).

In any case, you can avoid this until you get the implementation working with a reasonable speedup. After that, we can discuss further how to best refactor the implementation.

slaren · 2025-08-13T11:36:20Z

Currently, I am not sure which way is better. The first requires a new API call, while the second might break some existing assumptions (not sure if that's the case yet).

I don't see an issue with adding a new API for this, and it would be easier to use.

F1LM1 added 2 commits August 10, 2025 23:52

added getter for nextn layer count and server slot has_mtp property

db60623

some work towards building mtp layer graph

e434f87

github-actions bot added examples server labels Aug 11, 2025

F1LM1 added 2 commits August 11, 2025 20:54

make nextn weights loadable without a crash

1f477b3

add model member function to build mtp graph, to be called from specu…

03231da

…lative.cpp

ggerganov added the hot Something that is hot label Aug 12, 2025

broad thrust of the mtp implementation

cf0f7c0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server: implement GLM-style MTP #15225

server: implement GLM-style MTP #15225

F1LM1 commented Aug 11, 2025 •

edited

Loading

Uh oh!

ggerganov commented Aug 13, 2025

Uh oh!

ggerganov commented Aug 13, 2025

Uh oh!

slaren commented Aug 13, 2025

Uh oh!

Uh oh!

server: implement GLM-style MTP #15225

Are you sure you want to change the base?

server: implement GLM-style MTP #15225

Conversation

F1LM1 commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Aug 13, 2025

Uh oh!

ggerganov commented Aug 13, 2025

Uh oh!

slaren commented Aug 13, 2025

Uh oh!

Uh oh!

F1LM1 commented Aug 11, 2025 •

edited

Loading