Add prefill_step_size as load param by will-lms · Pull Request #295 · lmstudio-ai/mlx-engine

will-lms · 2026-03-24T17:21:48Z

Add prefill_step_size as a parameter to load_model
Increase the default step size to 2048 default across the board. This was already the default for BatchedModelKit. Users (or the app) can set a lower value if desired.
Add tests for new parameter.

will-lms · 2026-03-24T20:55:09Z

Codex Review

The new prefill-step-size override is wired through most generation paths, but it is still ignored for
VisionModelKit-backed image requests. That leaves the advertised escape hatch ineffective for a real subset of
multimodal users.

Review comment:

[P2] Honor prefill_step_size on VisionModelKit image requests — mlx-engine/mlx_engine/
vision_model_kit/vision_model_kit.py:141-144
On VisionModelKit-backed multimodal models (for example Qwen2/2.5-VL), this path still returns a fake one-token
prompt and does the real prompt prefill inside VisionModelWrapper.call() as a single full-sequence pass.
That makes the new prefill_step_size escape hatch a silent no-op whenever images_b64 is non-empty, so users
lowering it to work around the new 2048 default will still see the same peak-memory behavior on image requests.

Will's response

This is true. The old prefill step size default was also not applied in the mlx-vlm image prompt path. It is out of scope to add for now. mlx-vlm does have a prefill_step_size but does not have a prompt_progress_callback equivalent.

mattjcly · 2026-03-25T14:25:01Z

mlx_engine/generate.py

        ValueError: If the model configuration is invalid or unsupported
    """
    set_seed(seed)
+    prefill_step_size = validate_prefill_step_size(prefill_step_size)


nit: could call this resolve_prefill_step_size since not a pure validation function (resolves default if None)

Personal preference, totally non-blocking

I had this first, but codex was unhappy b/c "it is also resolving, not just validating." It clearly does both, but the resolve_and_validate_... name is a mouth full.

mattjcly · 2026-03-25T14:35:39Z

mlx_engine/generate.py

        kv_bits (Optional[int]): Number of bits for KV cache quantization.
        kv_group_size (Optional[int]): Group size for KV cache quantization.
        quantized_kv_start (Optional[int]): Step to begin KV cache quantization when enabled.
+        prefill_step_size (Optional[int]): Number of tokens to process per prefill chunk.


For your consideration - from what I can tell prefill_step_size doesn't really have to be a load parameter, but instead could be a create_generator/inference-time parameter.

I think this would more directly model the domain of capabilities of this underlying API at this time and enable more flexibility without needing to reload the model.

I acknowledge llama.cpp treats batch size as a load parameter, so there would be a conceptual divergence there.

Also non-blocking IMO

Yes I made this choice b/c we already treat prefill step size as a Load-time parameter for llama.cpp. I opted to keep consistency but agree that for this engine we could treat it is Prediction-time.

Going to defer for now as we don't have a use-case that would benefit from making it Prediction-time.

will-lms added 19 commits March 18, 2026 19:18

test prefill step size

f9bbc28

fix with hard-code

e3a6ee6

interfaces for dynamic prefill chunk size

fbad84b

remove test_batched_prefill_step_size

0007278

Add tests

3f23af6

wire

4198c5d

Fix vision model kit

7a2d21b

given/when/then

7ddcbed

validate

62b2a7a

refactor helpers

5e7abe6

default 2048

1fa7224

fix test

8fbf68b

comment

9e92c93

fix vision models

5014fc9

use qwen 2.5 vl

a2e4e27

validate at boundary

91b83d3

fix formula at boundary

7585b22

restore tests to 512

bc43dc7

vision model kit text-only

6c6406c

github-actions bot added the CLA signed Indicates that all contributors have signed label Mar 24, 2026

will-lms added 2 commits March 24, 2026 16:19

fix report count changes

1a6fc53

comment

11c14bb

will-lms requested review from mattjcly and neilmehta24 March 24, 2026 20:55

will-lms marked this pull request as ready for review March 24, 2026 20:59

mattjcly approved these changes Mar 25, 2026

View reviewed changes

will-lms merged commit 8cc4a15 into main Mar 25, 2026
2 checks passed

will-lms deleted the will/prefill-step-size branch March 25, 2026 14:53

github-actions bot locked and limited conversation to collaborators Mar 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add prefill_step_size as load param#295

Add prefill_step_size as load param#295
will-lms merged 21 commits intomainfrom
will/prefill-step-size

will-lms commented Mar 24, 2026

Uh oh!

will-lms commented Mar 24, 2026

Uh oh!

mattjcly Mar 25, 2026

Uh oh!

will-lms Mar 25, 2026

Uh oh!

mattjcly Mar 25, 2026

Uh oh!

will-lms Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

will-lms commented Mar 24, 2026

Uh oh!

will-lms commented Mar 24, 2026

Codex Review

Will's response

Uh oh!

mattjcly Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

will-lms Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

mattjcly Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

will-lms Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants