Skip to content

Fix DreamZero full finetune h100x8 checkpoint disk overflow#374

Open
vertix wants to merge 1 commit intoPositronic-Robotics:mainfrom
vertix:dreamzero-training
Open

Fix DreamZero full finetune h100x8 checkpoint disk overflow#374
vertix wants to merge 1 commit intoPositronic-Robotics:mainfrom
vertix:dreamzero-training

Conversation

@vertix
Copy link
Copy Markdown
Contributor

@vertix vertix commented Mar 24, 2026

Summary

  • Make save_total_limit configurable (default 10, overridable per preset)
  • Use +training_args.save_only_model=true for full finetune h100x8 presets — DeepSpeed checkpoints are ~200GB each (model + optimizer state across 8 ranks), exceeding disk with save_total_limit=10
  • LoRA h100x8 presets unchanged — LoRA checkpoints are small enough for full DeepSpeed saves

Test plan

  • Tested wan2.2 full finetune on 8×H100: training runs past checkpoint saves without disk-full crash
  • Inference tested with checkpoint-2000 from w22f8_230326 — server loads and serves correctly

…h100x8

Full finetune DeepSpeed checkpoints are ~200GB each on 8 GPUs. With
save_total_limit=10 this exceeds the disk. Use save_only_model=true
for full h100x8 presets (same as h100x1) to save only model weights.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant