You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for your wonderful work on this project. I especially want to thank @angeloskath for implementing the "Better caching in the server" (I call it "prompt_checkpoint" feature) in the 0.31.0 update. This is a fantastic improvement.
Since Qwen-Next models were released, I have been struggling with how to handle KV caches that cannot be trimmed. After reading the prompt_checkpoint implementation, I understood that it's possible to stop KV cache updates at the prefill stage. This is exactly what I needed.
So, how about adding a standalone function to mlx_lm that performs prefill only, without text generation? In other words, a function that, just processes (evaluate) prompt them through the model to build the KV cache and return the KV cache. No text-generation.
I think it is useful for various cases. Not only handles non-trimmable kv cache models, but also useful for applications that need to process large system prompts or document contexts ahead of time, this would allow pre-building caches that can be reused across multiple sessions.
I have currently forked some parts from generate_step function to create a more simple function to work just above in my project. While this seems working well, I believe this functionality would be valuable as an official API.(I have no knowledge of directly manipulating the tensor of llms, so I worry whether I'm doing it correctly. (lol))
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Dear mlx_lm developers,
Thank you for your wonderful work on this project. I especially want to thank @angeloskath for implementing the "Better caching in the server" (I call it "prompt_checkpoint" feature) in the 0.31.0 update. This is a fantastic improvement.
Since Qwen-Next models were released, I have been struggling with how to handle KV caches that cannot be trimmed. After reading the prompt_checkpoint implementation, I understood that it's possible to stop KV cache updates at the prefill stage. This is exactly what I needed.
So, how about adding a standalone function to mlx_lm that performs prefill only, without text generation? In other words, a function that, just processes (evaluate) prompt them through the model to build the KV cache and return the KV cache. No text-generation.
I think it is useful for various cases. Not only handles non-trimmable kv cache models, but also useful for applications that need to process large system prompts or document contexts ahead of time, this would allow pre-building caches that can be reused across multiple sessions.
I have currently forked some parts from
generate_stepfunction to create a more simple function to work just above in my project. While this seems working well, I believe this functionality would be valuable as an official API.(I have no knowledge of directly manipulating the tensor of llms, so I worry whether I'm doing it correctly. (lol))(just for your information, here is my code: https://github.com/gitkaz/mlx_gguf_server/blob/experimental_feature/prompt_checkpoint_based_stream_generate/worker/task/completions_stream/fork_from_mlx_lm.py)
Thank you,
Beta Was this translation helpful? Give feedback.
All reactions