Verified run-to-run determinism: 50 prompts × 20 runs, temperature=0, two models: 100% bit-identical #1017

NullPointerDepressiveDisorder · 2026-03-18T07:02:26Z

NullPointerDepressiveDisorder
Mar 18, 2026

I ran a systematic determinism test on mlx-lm and wanted to share results since I couldn't find this documented anywhere beyond the parameter description saying temperature 0.0 equals "deterministic."

Setup

Hardware: M4 Max (48GB), macOS 26.3.1
Models: mlx-community/Meta-Llama-3.1-8B-Instruct-4bit, mlx-community/Qwen3.5-4B-4bit
Method: 50 prompts, each run 20 times at temperature=0, single-request (no batching), outputs compared for exact string match

Results

Model	Prompts	Identical across 20 runs
Llama-3.1-8B-Instruct-4bit	50	50/50
Qwen3.5-4B-4bit	50	50/50

Every prompt produced bit-identical output across all 20 runs for both models. Perfect run-to-run determinism in single-request mode.

Why tho?

A August 2024 study found that even with temperature=0 and fixed seeds, production serving engines show considerable output variation: Mixtral-8x7b had a 72 percentage-point accuracy range across 10 runs. The mlx-deterministic project documented that MLX inference can produce different outputs with different batch sizes due to reduction order changes in RMSNorm/MatMul/Softmax.

My results are consistent with this distinction: single-request, same batch size, same hardware produces perfect determinism. I have not tested batch-invariant determinism (whether outputs change when processed alongside other requests in a batch). That's a different and harder property.

Methodology

Testing done with infer-check (pip install infer-check), a CLI I built for inference correctness testing on MLX engines.
Full writeup: blog post link

Sample size is small (n=50 × 20 = 1,000 runs per model), so treat this as a positive signal rather than comprehensive proof. Would be interested to know if anyone's seen non-determinism in single-request mode under different conditions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verified run-to-run determinism: 50 prompts × 20 runs, temperature=0, two models: 100% bit-identical #1017

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Verified run-to-run determinism: 50 prompts × 20 runs, temperature=0, two models: 100% bit-identical #1017

Uh oh!

Uh oh!

NullPointerDepressiveDisorder Mar 18, 2026

Setup

Results

Why tho?

Methodology

Replies: 0 comments

NullPointerDepressiveDisorder
Mar 18, 2026