Conversation
Codex ReviewThe new prefill-step-size override is wired through most generation paths, but it is still ignored for Review comment:
Will's responseThis is true. The old prefill step size default was also not applied in the mlx-vlm image prompt path. It is out of scope to add for now. mlx-vlm does have a |
| ValueError: If the model configuration is invalid or unsupported | ||
| """ | ||
| set_seed(seed) | ||
| prefill_step_size = validate_prefill_step_size(prefill_step_size) |
There was a problem hiding this comment.
nit: could call this resolve_prefill_step_size since not a pure validation function (resolves default if None)
Personal preference, totally non-blocking
There was a problem hiding this comment.
I had this first, but codex was unhappy b/c "it is also resolving, not just validating." It clearly does both, but the resolve_and_validate_... name is a mouth full.
| kv_bits (Optional[int]): Number of bits for KV cache quantization. | ||
| kv_group_size (Optional[int]): Group size for KV cache quantization. | ||
| quantized_kv_start (Optional[int]): Step to begin KV cache quantization when enabled. | ||
| prefill_step_size (Optional[int]): Number of tokens to process per prefill chunk. |
There was a problem hiding this comment.
For your consideration - from what I can tell prefill_step_size doesn't really have to be a load parameter, but instead could be a create_generator/inference-time parameter.
I think this would more directly model the domain of capabilities of this underlying API at this time and enable more flexibility without needing to reload the model.
I acknowledge llama.cpp treats batch size as a load parameter, so there would be a conceptual divergence there.
Also non-blocking IMO
There was a problem hiding this comment.
Yes I made this choice b/c we already treat prefill step size as a Load-time parameter for llama.cpp. I opted to keep consistency but agree that for this engine we could treat it is Prediction-time.
Going to defer for now as we don't have a use-case that would benefit from making it Prediction-time.
prefill_step_sizeas a parameter toload_model