[CUDA] Reduce use of managed memory by awni · Pull Request #2725 · ml-explore/mlx

awni · 2025-11-01T19:41:35Z

This is an attempt to reduce the use of managed memory in our cuda back-end. There are still a few details to figure out / ideas to explore. But for the common use cases of LLM training / inference it works quite well.

Some key differences:

malloc is still the same as before. It gives you managed memory and caches it in the buffer cache
cu::malloc_async gives you async allocated non-managed memory and caches it in the buffer cache
eval_gpu should always prefer cu::malloc_async(size, stream); to get the good memory
pointers to kernels should be gotten with gpu_ptr<T>(array)
Accessing arr.data<T>() can do a copy if the data is not in the right place. So don't access it unless you need the data accessible to the CPU

Some improvements to think about:

All allocations still go in the buffer cache. When doing malloc_async or malloc we first check the buffer cache. It might be good to prefer the appropriate type of memory (managed or not) when pulling from the cache.
When doing the copy to managed memory, we rely on the fact that it does a device synchronize. This can be slow. So it may be good to make it async when possible.

Benchmarks

Command:

mlx_lm.benchmark --model meta-llama/Meta-Llama-3.1-8B --p 2049 -g 128 -b 1 -n 4

DGX Spark
Pre: Averages: prompt_tps=744.025, generation_tps=10.301, peak_memory=21.707
Post: Averages: prompt_tps=2093.590, generation_tps=14.553, peak_memory=21.707

B200
Pre: Averages: prompt_tps=4780.945, generation_tps=207.900, peak_memory=21.707
Post: Averages: prompt_tps=37887.799, generation_tps=217.166, peak_memory=21.707

note the prompt length of 2049 is to get an aligned size (2048) for prompt processing, otherwise it's quite a bit slower.

mlx/backend/cuda/allocator.cpp

mlx/backend/cuda/allocator.h

mlx/array.h

mlx/backend/cuda/allocator.cpp

mlx/io/load.cpp

zcbenz · 2025-11-02T03:59:14Z

A future improvement we can do is to use cudaFreeAsync to free buffers.

awni · 2025-11-03T23:06:05Z

Ok I think this is good to go. @zcbenz, thanks for the review. I addressed your comments. The main thing was the removal of the cuda pool which simplified stuff nicely.

zcbenz

Awesome improvement!

zcbenz · 2025-11-03T23:56:47Z

It would be really nice if we could replace the MLX buffer cache with a CUDA pool(s). In CUDA 13 it's possible to use a pool with managed memory but not in CUDA 12.

What do you think if we use MLX buffer cache for managed memory and CUDA memory pool for others (as a future work)?

note the prompt length of 2049 is to get an aligned size (2048) for prompt processing, otherwise it's quite a bit slower.

Can you elaborate on this? I'm trying to understand why an extra token is needed to get aligned size.

awni · 2025-11-04T00:06:31Z

What do you think if we use MLX buffer cache for managed memory and CUDA memory pool for others (as a future work)?

Worth exploring. We might also want to make the cache in a similar API to make it easy to swap a cuda pool even for managed memory when you are on cuda toolkit 13 and up.

awni · 2025-11-04T00:07:37Z

Can you elaborate on this? I'm trying to understand why an extra token is needed to get aligned size.

It's an artifact of how we process the prompt to avoid compute the LM head for the prompt for all but the last token. So basically we process the prompt in two steps: n-1 tokens with no LM head and the last 1 token with the LM head.

Shivansh9000 · 2025-11-04T17:16:15Z

No

mlx/backend/cpu/primitives.cpp

angeloskath

This looks awesome!

I still think we need to add a check in the CudaAllocator::malloc_impl for 0 sized buffers and return early with {nullptr, 0, -1}.

Also the corresponding check in CudaAllocator::free should be if (!buf || !buf->data). Or the latter can be moved into CudaAllocator::cuda_free dealer's choice.

yiakwy-xpu-ml-framework-team · 2025-11-11T03:28:29Z

Hi @awni I want to help in supporting this framework in H100/H800 DGX machines with standard IB and nccl support.

How can I start with it ?

yiakwy-xpu-ml-framework-team · 2025-11-11T03:32:08Z

@awni Is there any body working on VMM , may be this can be the first topic I can contribute :

https://github.com/ruizhang1230/vTensor

awni requested review from angeloskath and zcbenz November 1, 2025 19:42

awni force-pushed the async_cuda_malloc branch from 5313209 to c27a064 Compare November 1, 2025 20:19

zcbenz reviewed Nov 2, 2025

View reviewed changes

awni force-pushed the async_cuda_malloc branch 3 times, most recently from 74c6ceb to 6adde62 Compare November 3, 2025 23:05

awni force-pushed the async_cuda_malloc branch from 6adde62 to cc6df9f Compare November 3, 2025 23:07

zcbenz approved these changes Nov 3, 2025

View reviewed changes

awni force-pushed the async_cuda_malloc branch from 4f8574e to fc00f16 Compare November 4, 2025 15:57

awni force-pushed the async_cuda_malloc branch from 234000e to 7eaa504 Compare November 4, 2025 17:32

angeloskath reviewed Nov 4, 2025

View reviewed changes

mlx/backend/cpu/primitives.cpp Show resolved Hide resolved

angeloskath approved these changes Nov 4, 2025

View reviewed changes

awni force-pushed the async_cuda_malloc branch 8 times, most recently from c77269b to c4c8690 Compare November 5, 2025 22:12

Awni Hannun added 3 commits November 5, 2025 15:04

Use async cuda malloc managed with cuda 13

5936002

add pool threshold

312d071

refactor for regular cuda malloc

2f95b4f

Awni Hannun added 7 commits November 5, 2025 15:04

load eval gpu for cuda

ad09398

remove use of cuda pool, use cuda free async

88fb650

fix

a7d7ee7

fix

1490a65

fix

14e6952

fix

e571103

fix + comment

b741d8b

awni force-pushed the async_cuda_malloc branch from c4c8690 to b741d8b Compare November 5, 2025 23:07

awni merged commit df58b41 into main Nov 6, 2025
7 checks passed

awni deleted the async_cuda_malloc branch November 6, 2025 00:05

BrewTestBot mentioned this pull request Nov 20, 2025

mlx 0.30.0 Homebrew/homebrew-core#255173

Merged

1 task

Conversation

awni commented Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zcbenz commented Nov 2, 2025

Uh oh!

awni commented Nov 3, 2025

Uh oh!

zcbenz left a comment

Choose a reason for hiding this comment

Uh oh!

zcbenz commented Nov 3, 2025

Uh oh!

awni commented Nov 4, 2025

Uh oh!

awni commented Nov 4, 2025

Uh oh!

Shivansh9000 commented Nov 4, 2025

Uh oh!

Uh oh!

angeloskath left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yiakwy-xpu-ml-framework-team commented Nov 11, 2025

Uh oh!

yiakwy-xpu-ml-framework-team commented Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

awni commented Nov 1, 2025 •

edited

Loading