Skip to content

Add prefill_step_size as load param#295

Merged
will-lms merged 21 commits intomainfrom
will/prefill-step-size
Mar 25, 2026
Merged

Add prefill_step_size as load param#295
will-lms merged 21 commits intomainfrom
will/prefill-step-size

Conversation

@will-lms
Copy link
Copy Markdown
Contributor

  1. Add prefill_step_size as a parameter to load_model
  2. Increase the default step size to 2048 default across the board. This was already the default for BatchedModelKit. Users (or the app) can set a lower value if desired.
  3. Add tests for new parameter.

@github-actions github-actions bot added the CLA signed Indicates that all contributors have signed label Mar 24, 2026
@will-lms
Copy link
Copy Markdown
Contributor Author

Codex Review

The new prefill-step-size override is wired through most generation paths, but it is still ignored for
VisionModelKit-backed image requests. That leaves the advertised escape hatch ineffective for a real subset of
multimodal users.

Review comment:

  • [P2] Honor prefill_step_size on VisionModelKit image requests — mlx-engine/mlx_engine/
    vision_model_kit/vision_model_kit.py:141-144
    On VisionModelKit-backed multimodal models (for example Qwen2/2.5-VL), this path still returns a fake one-token
    prompt and does the real prompt prefill inside VisionModelWrapper.call() as a single full-sequence pass.
    That makes the new prefill_step_size escape hatch a silent no-op whenever images_b64 is non-empty, so users
    lowering it to work around the new 2048 default will still see the same peak-memory behavior on image requests.

Will's response

This is true. The old prefill step size default was also not applied in the mlx-vlm image prompt path. It is out of scope to add for now. mlx-vlm does have a prefill_step_size but does not have a prompt_progress_callback equivalent.

@will-lms will-lms marked this pull request as ready for review March 24, 2026 20:59
ValueError: If the model configuration is invalid or unsupported
"""
set_seed(seed)
prefill_step_size = validate_prefill_step_size(prefill_step_size)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could call this resolve_prefill_step_size since not a pure validation function (resolves default if None)

Personal preference, totally non-blocking

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had this first, but codex was unhappy b/c "it is also resolving, not just validating." It clearly does both, but the resolve_and_validate_... name is a mouth full.

kv_bits (Optional[int]): Number of bits for KV cache quantization.
kv_group_size (Optional[int]): Group size for KV cache quantization.
quantized_kv_start (Optional[int]): Step to begin KV cache quantization when enabled.
prefill_step_size (Optional[int]): Number of tokens to process per prefill chunk.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For your consideration - from what I can tell prefill_step_size doesn't really have to be a load parameter, but instead could be a create_generator/inference-time parameter.

I think this would more directly model the domain of capabilities of this underlying API at this time and enable more flexibility without needing to reload the model.

I acknowledge llama.cpp treats batch size as a load parameter, so there would be a conceptual divergence there.

Also non-blocking IMO

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I made this choice b/c we already treat prefill step size as a Load-time parameter for llama.cpp. I opted to keep consistency but agree that for this engine we could treat it is Prediction-time.

Going to defer for now as we don't have a use-case that would benefit from making it Prediction-time.

@will-lms will-lms merged commit 8cc4a15 into main Mar 25, 2026
2 checks passed
@will-lms will-lms deleted the will/prefill-step-size branch March 25, 2026 14:53
@github-actions github-actions bot locked and limited conversation to collaborators Mar 25, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

CLA signed Indicates that all contributors have signed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants