-
Notifications
You must be signed in to change notification settings - Fork 75
Question about the recommended local setup for H100 #30
Description
Hi, thank you for open-sourcing this project.
I’m trying to set up a local environment on an H100 machine, and I wanted to ask about the recommended dependency combination for the default training workflow.
While reading the repo, I saw that:
- the README suggests
torch==2.5.1withcu124for Ampere/Hopper - the default config seems to use
vllmrollout setup.pyallowsvllm>=0.8.5,<=0.12.0and requiresnumpy<2.0.0requirements.txtpinsnumpy==2.1.0- the helper install script pins
vllm==0.11.0
I may be misunderstanding the intended setup, but it seems possible to end up with a local torch 2.5.1 + cu124 environment first, and then hit version changes again once vllm is needed by the default rollout path.
Could you please clarify what the recommended local setup is for H100, especially for the default PPO/SDPO training path?
In particular, it would be very helpful to know the suggested versions for:
torchCUDAnumpyvllm
Also, which source should I follow as the main reference for local installation: the README, setup.py, requirements.txt, or the helper install script?
Thanks a lot for your help.