Conversation
5313209 to
c27a064
Compare
|
A future improvement we can do is to use |
74c6ceb to
6adde62
Compare
|
Ok I think this is good to go. @zcbenz, thanks for the review. I addressed your comments. The main thing was the removal of the cuda pool which simplified stuff nicely. |
6adde62 to
cc6df9f
Compare
What do you think if we use MLX buffer cache for managed memory and CUDA memory pool for others (as a future work)?
Can you elaborate on this? I'm trying to understand why an extra token is needed to get aligned size. |
Worth exploring. We might also want to make the cache in a similar API to make it easy to swap a cuda pool even for managed memory when you are on cuda toolkit 13 and up. |
It's an artifact of how we process the prompt to avoid compute the LM head for the prompt for all but the last token. So basically we process the prompt in two steps: n-1 tokens with no LM head and the last 1 token with the LM head. |
4f8574e to
fc00f16
Compare
|
No |
234000e to
7eaa504
Compare
angeloskath
left a comment
There was a problem hiding this comment.
This looks awesome!
I still think we need to add a check in the CudaAllocator::malloc_impl for 0 sized buffers and return early with {nullptr, 0, -1}.
Also the corresponding check in CudaAllocator::free should be if (!buf || !buf->data). Or the latter can be moved into CudaAllocator::cuda_free dealer's choice.
c77269b to
c4c8690
Compare
c4c8690 to
b741d8b
Compare
|
Hi @awni I want to help in supporting this framework in H100/H800 DGX machines with standard IB and nccl support. How can I start with it ? |
|
@awni Is there any body working on VMM , may be this can be the first topic I can contribute : |
This is an attempt to reduce the use of managed memory in our cuda back-end. There are still a few details to figure out / ideas to explore. But for the common use cases of LLM training / inference it works quite well.
Some key differences:
mallocis still the same as before. It gives you managed memory and caches it in the buffer cachecu::malloc_asyncgives you async allocated non-managed memory and caches it in the buffer cacheeval_gpushould always prefercu::malloc_async(size, stream);to get the good memorygpu_ptr<T>(array)arr.data<T>()can do a copy if the data is not in the right place. So don't access it unless you need the data accessible to the CPUSome improvements to think about:
All allocations still go in the buffer cache. When doing
malloc_asyncormallocwe first check the buffer cache. It might be good to prefer the appropriate type of memory (managed or not) when pulling from the cache.When doing the copy to managed memory, we rely on the fact that it does a device synchronize. This can be slow. So it may be good to make it async when possible.
Benchmarks
Command:
DGX Spark
Pre: Averages: prompt_tps=744.025, generation_tps=10.301, peak_memory=21.707
Post: Averages: prompt_tps=2093.590, generation_tps=14.553, peak_memory=21.707
B200
Pre: Averages: prompt_tps=4780.945, generation_tps=207.900, peak_memory=21.707
Post: Averages: prompt_tps=37887.799, generation_tps=217.166, peak_memory=21.707
note the prompt length of 2049 is to get an aligned size (2048) for prompt processing, otherwise it's quite a bit slower.