Question about the recommended local setup for H100

Hi, thank you for open-sourcing this project.

I’m trying to set up a local environment on an H100 machine, and I wanted to ask about the recommended dependency combination for the default training workflow.

While reading the repo, I saw that:

- the README suggests `torch==2.5.1` with `cu124` for Ampere/Hopper
- the default config seems to use `vllm` rollout
- `setup.py` allows `vllm>=0.8.5,<=0.12.0` and requires `numpy<2.0.0`
- `requirements.txt` pins `numpy==2.1.0`
- the helper install script pins `vllm==0.11.0`

I may be misunderstanding the intended setup, but it seems possible to end up with a local `torch 2.5.1 + cu124` environment first, and then hit version changes again once `vllm` is needed by the default rollout path.

Could you please clarify what the recommended local setup is for H100, especially for the default PPO/SDPO training path?

In particular, it would be very helpful to know the suggested versions for:
- `torch`
- `CUDA`
- `numpy`
- `vllm`

Also, which source should I follow as the main reference for local installation: the README, `setup.py`, `requirements.txt`, or the helper install script?

Thanks a lot for your help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about the recommended local setup for H100 #30

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about the recommended local setup for H100 #30

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions