Benchmark: MLX on analytical (non-ML) workloads - 6 TPC-H queries on M4 #3139

sadopc · 2026-02-18T10:34:52Z

sadopc
Feb 18, 2026

Hi MLX Team,

I wanted to share results from a research project benchmarking MLX on analytical database workloads - a use case outside the typical ML/deep learning focus.

The project, Unified-DB-2, runs 6 TPC-H queries across three execution paths on Apple M4: DuckDB SQL, NumPy CPU kernels, and MLX GPU kernels. The goal was to quantify the unified memory advantage for GPU-accelerated analytics.

Key findings

MLX achieved 1.3x-3.1x speedups over NumPy CPU kernels using identical algorithms (Q1: 3.06x, Q6: 1.89x at SF1; Q1: 3.03x, Q6: 2.17x at SF10), confirming the GPU advantage scales with data size.
A custom GPU-optimized query (QX) showed MLX beating DuckDB by 1.59x at SF1 and NumPy by 16x - demonstrating strong potential when the workload is pure parallel arithmetic with scatter-add aggregation.
mx.array() from NumPy operates at ~120 GB/s unified memory bandwidth (a copy, not zero-copy), which still vastly outperforms the 25-50 GB/s PCIe path on discrete GPU systems.
DuckDB's optimized C++ engine beat MLX on all 6 standard TPC-H queries - but this comparison is deliberately unfair (production SQL optimizer vs hand-written Python-orchestrated kernels). The NumPy-vs-MLX comparison is the fair one.

MLX limitations encountered and workarounds

During development I hit several MLX constraints and documented workarounds that may be useful to others:

Limitation	Workaround
No boolean fancy indexing (`array[bool_mask]`)	`mx.where(mask, value, zero)` + overflow bin pattern
No `argwhere` / `nonzero`	Convert to NumPy: `np.where(np_array > 0)[0]`
`mx.array(numpy)` is a copy, not zero-copy	Per-column loading + eager `del` to minimize peak memory
Float32 only (no float64)	0.1% tolerance for aggregate validation; ~0.03-0.08% error at SF1
No `lexsort` for multi-key sorting	Fallback to NumPy for Q3 (revenue DESC, orderdate ASC)

The overflow bin pattern was particularly effective: masked-out rows are routed to a discard group at index N during scatter-add, then sliced off with [:N]. This avoids boolean indexing entirely while keeping the computation fully on GPU.

Charts

MLX GPU vs NumPy CPU speedup (higher is better):

Three-baseline comparison at SF1 (lower is better):

MLX scaling across data sizes (lower is better):

Links

Repository: https://github.com/sadopc/unified-db-2
Full paper: https://github.com/sadopc/unified-db-2/blob/main/PAPER.md
All code is open source (MIT). The benchmark suite runs end-to-end with uv run python scripts/run_all.py.

Happy to discuss any of the findings, methodology, or MLX workarounds.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark: MLX on analytical (non-ML) workloads - 6 TPC-H queries on M4 #3139

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Benchmark: MLX on analytical (non-ML) workloads - 6 TPC-H queries on M4 #3139

Uh oh!

Uh oh!

sadopc Feb 18, 2026

Key findings

MLX limitations encountered and workarounds

Charts

Links

Replies: 0 comments

sadopc
Feb 18, 2026