Resource needed for multi-node training #106 #66

Some-random · 2025-05-23T03:23:04Z

How many GPUs/nodes do we need for 32B model training? We have tried using 2 8*H100 nodes, each with 80G memory, to run the training of 32B model. But we're getting OOM errors

AnselCmy · 2025-05-23T06:44:05Z

Could you please share your training configuration parameters? To address the OOM errors, there are a few potential solutions:

Increase the tensor model parallel size (rollout.tensor_model_parallel_size) to distribute the model across more GPUs
Decrease the training batch size to reduce memory usage

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Resource needed for multi-node training #106 #66

Resource needed for multi-node training #106 #66

Some-random commented May 23, 2025

AnselCmy commented May 23, 2025

Uh oh!

Resource needed for multi-node training #106 #66

Resource needed for multi-node training #106 #66

Comments

Some-random commented May 23, 2025

AnselCmy commented May 23, 2025

Uh oh!