Skip to content

Question about the recommended local setup for H100 #30

@Doow501

Description

@Doow501

Hi, thank you for open-sourcing this project.

I’m trying to set up a local environment on an H100 machine, and I wanted to ask about the recommended dependency combination for the default training workflow.

While reading the repo, I saw that:

  • the README suggests torch==2.5.1 with cu124 for Ampere/Hopper
  • the default config seems to use vllm rollout
  • setup.py allows vllm>=0.8.5,<=0.12.0 and requires numpy<2.0.0
  • requirements.txt pins numpy==2.1.0
  • the helper install script pins vllm==0.11.0

I may be misunderstanding the intended setup, but it seems possible to end up with a local torch 2.5.1 + cu124 environment first, and then hit version changes again once vllm is needed by the default rollout path.

Could you please clarify what the recommended local setup is for H100, especially for the default PPO/SDPO training path?

In particular, it would be very helpful to know the suggested versions for:

  • torch
  • CUDA
  • numpy
  • vllm

Also, which source should I follow as the main reference for local installation: the README, setup.py, requirements.txt, or the helper install script?

Thanks a lot for your help.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions