Slice update with operation by angeloskath · Pull Request #3266 · ml-explore/mlx

angeloskath · 2026-03-16T11:31:58Z

Adds slice_update_op variants. Allow for faster implementation of slice updates that don't fall back to scatter. The CPU and CUDA implementations are still missing (will add them before merging).

There are still optimizations to be done but the first numbers are (M3U gpu)

Dtype        Dst Shape                 Update Shape       MLX^- (ms)   MLX (ms)    Torch (ms)   MLX^- GB/s   MLX GB/s     Torch GB/s
-------------------------------------------------------------------------------------------------------------------------------------
float32      (10000000,)               (1000000,)         0.870        0.699       1.953        413.85       515.34       184.32
float32      (100000,)                 (10000,)           0.466        0.454       1.804        7.73         7.94         2.00
float32      (1000, 64)                (100, 64)          0.482        0.456       1.812        4.78         5.05         1.27
float32      (100, 100, 64)            (20, 100, 64)      0.541        0.478       1.857        85.17        96.46        24.81
float32      (2048, 2048, 128)         (1000, 1000, 64)   37.590       33.338      77.250       612.93       691.10       298.25
float32      (2048, 2048, 128)         (50, 100, 64)      1.301        1.218       1.080        88.54        94.55        106.70
float32      (2048, 2048, 128)         (10, 10, 64)       1.276        1.187       0.698        1.81         1.94         3.30
bfloat16     (10000000,)               (1000000,)         1.203        0.634       1.971        149.60       283.77       91.32
bfloat16     (100000,)                 (10000,)           0.524        0.489       1.935        3.44         3.68         0.93
bfloat16     (1000, 64)                (100, 64)          0.526        0.453       1.908        2.19         2.54         0.60
bfloat16     (100, 100, 64)            (20, 100, 64)      0.603        0.497       1.936        38.21        46.35        11.90
bfloat16     (2048, 2048, 128)         (1000, 1000, 64)   36.939       17.693      76.855       311.87       651.10       149.89
bfloat16     (2048, 2048, 128)         (50, 100, 64)      1.407        1.214       1.098        40.94        47.46        52.46
bfloat16     (2048, 2048, 128)         (10, 10, 64)       1.287        1.209       0.737        0.89         0.95         1.56

where MLX^- means before this PR so it converts the slices to index arrays and uses scatter.

One of the main benefits of this PR is that changing code like x[idx] += 2 to x = x.at[idx].add(2) will almost certainly be significantly more efficient now since it will allow donating x.

The CPU version gets a pretty big boost as it is much simpler to implement (and I added a small SIMD optimization). M3 Ultra numbers below:

Dtype        Dst Shape                 Update Shape         MLX^- (ms)   MLX (ms)   Torch (ms)   MLX^- GB/s   MLX GB/s   Torch GB/s
------------------------------------------------------------------------------------------------------------------------------------
float32      (10000000,)               (1000000,)           54.328       2.653      6.468        6.63         135.67     55.66
float32      (100000,)                 (10000,)             0.542        0.078      0.099        6.65         45.98      36.18
float32      (1000, 64)                (100, 64)            0.353        0.075      0.076        6.53         30.85      30.16
float32      (100, 100, 64)            (20, 100, 64)        5.633        0.376      6.129        8.18         122.47     7.52
bfloat16     (10000000,)               (1000000,)           52.641       4.251      6.469        3.42         42.35      27.82
bfloat16     (100000,)                 (10000,)             0.441        0.101      0.179        4.08         17.89      10.05
bfloat16     (1000, 64)                (100, 64)            0.308        0.086      0.134        3.74         13.33      8.57
bfloat16     (100, 100, 64)            (20, 100, 64)        4.720        0.596      6.117        4.88         38.68      3.77

angeloskath · 2026-03-17T23:41:20Z

Fixed CUDA. This is the benchmark on H100 now, pretty clearly faster than PT

Dtype        Dst Shape                 Update Shape         MLX (ms)     MLX GB/s     Torch (ms)   Torch GB/s
--------------------------------------------------------------------------------------------------------------
float32      (10000000,)               (1000000,)           0.504        952.71       0.899        534.16
float32      (100000,)                 (10000,)             0.333        14.39        0.793        6.05
float32      (1000, 64)                (100, 64)            0.328        9.36         0.802        3.83
float32      (100, 100, 64)            (20, 100, 64)        0.342        179.84       0.811        75.73
float32      (2048, 2048, 128)         (1000, 1000, 64)     16.224       1893.50      18.634       1648.64
float32      (2048, 2048, 128)         (50, 100, 64)        0.437        351.49       0.935        164.31
float32      (2048, 2048, 128)         (10, 10, 64)         0.376        8.16         0.924        3.33
bfloat16     (10000000,)               (1000000,)           0.379        634.03       0.853        281.47
bfloat16     (100000,)                 (10000,)             0.307        7.83         0.799        3.00
bfloat16     (1000, 64)                (100, 64)            0.299        5.14         0.805        1.91
bfloat16     (100, 100, 64)            (20, 100, 64)        0.307        100.21       0.807        38.09
bfloat16     (2048, 2048, 128)         (1000, 1000, 64)     9.292        1653.09      13.671       1123.53
bfloat16     (2048, 2048, 128)         (50, 100, 64)        1.900        40.42        0.883        86.94
bfloat16     (2048, 2048, 128)         (10, 10, 64)         0.325        4.72         0.886        1.73

zcbenz

Beautiful change

angeloskath added 11 commits March 16, 2026 00:41

Make slice update op

84d316b

Add a rudimentary test

aa75533

Connect slice update to python

3a8fb3d

Small perf tune

db3d2cb

Add gradients for SliceUpdate

2e49c1d

Fix typo in vjp

1f14e72

Add some tests

6ebdb12

More tests

3c4da58

More slice only tests

00443a2

Add a benchmark

9c86773

Improve the benchmark

f18f874

angeloskath requested a review from nastya236 March 16, 2026 11:32

angeloskath added 7 commits March 16, 2026 13:54

Add a cpu implementation of slice_update_op

ed8b0a9

Appease the compiler

83755d4

Enable the benchmark in CUDA

b117de9

Add slice_update_op in CUDA

4ad29cd

Fix strides in cpu slice_update_op

1b04515

Optimize the Metal kernel a bit

4d9867f

Add some more tests

51b37f9

angeloskath mentioned this pull request Mar 17, 2026

Fix the rope mutation in a more natural way ml-explore/mlx-lm#1014

Open

Fix CUDA implementation

13126d6

zcbenz approved these changes Mar 18, 2026

View reviewed changes

angeloskath merged commit 7bc61cc into main Mar 18, 2026
16 checks passed

angeloskath deleted the slice-update branch March 18, 2026 13:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slice update with operation#3266

Slice update with operation#3266
angeloskath merged 19 commits intomainfrom
slice-update

angeloskath commented Mar 16, 2026 •

edited

Loading

Uh oh!

angeloskath commented Mar 17, 2026

Uh oh!

zcbenz left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

angeloskath commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

angeloskath commented Mar 17, 2026

Uh oh!

zcbenz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

angeloskath commented Mar 16, 2026 •

edited

Loading