Benchmark: MLX on analytical (non-ML) workloads - 6 TPC-H queries on M4 #3139
sadopc
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi MLX Team,
I wanted to share results from a research project benchmarking MLX on analytical database workloads - a use case outside the typical ML/deep learning focus.
The project, Unified-DB-2, runs 6 TPC-H queries across three execution paths on Apple M4: DuckDB SQL, NumPy CPU kernels, and MLX GPU kernels. The goal was to quantify the unified memory advantage for GPU-accelerated analytics.
Key findings
mx.array()from NumPy operates at ~120 GB/s unified memory bandwidth (a copy, not zero-copy), which still vastly outperforms the 25-50 GB/s PCIe path on discrete GPU systems.MLX limitations encountered and workarounds
During development I hit several MLX constraints and documented workarounds that may be useful to others:
array[bool_mask])mx.where(mask, value, zero)+ overflow bin patternargwhere/nonzeronp.where(np_array > 0)[0]mx.array(numpy)is a copy, not zero-copydelto minimize peak memorylexsortfor multi-key sortingThe overflow bin pattern was particularly effective: masked-out rows are routed to a discard group at index
Nduring scatter-add, then sliced off with[:N]. This avoids boolean indexing entirely while keeping the computation fully on GPU.Charts
MLX GPU vs NumPy CPU speedup (higher is better):
Three-baseline comparison at SF1 (lower is better):
MLX scaling across data sizes (lower is better):
Links
uv run python scripts/run_all.py.Happy to discuss any of the findings, methodology, or MLX workarounds.
Beta Was this translation helpful? Give feedback.
All reactions