Fix DreamZero full finetune h100x8 checkpoint disk overflow by vertix · Pull Request #374 · Positronic-Robotics/positronic

vertix · 2026-03-24T08:03:06Z

Summary

Make save_total_limit configurable (default 10, overridable per preset)
Use +training_args.save_only_model=true for full finetune h100x8 presets — DeepSpeed checkpoints are ~200GB each (model + optimizer state across 8 ranks), exceeding disk with save_total_limit=10
LoRA h100x8 presets unchanged — LoRA checkpoints are small enough for full DeepSpeed saves

Test plan

Tested wan2.2 full finetune on 8×H100: training runs past checkpoint saves without disk-full crash
Inference tested with checkpoint-2000 from w22f8_230326 — server loads and serves correctly

…h100x8 Full finetune DeepSpeed checkpoints are ~200GB each on 8 GPUs. With save_total_limit=10 this exceeds the disk. Use save_only_model=true for full h100x8 presets (same as h100x1) to save only model weights.

Make save_total_limit configurable, use save_only_model for full …

c13d898

…h100x8 Full finetune DeepSpeed checkpoints are ~200GB each on 8 GPUs. With save_total_limit=10 this exceeds the disk. Use save_only_model=true for full h100x8 presets (same as h100x1) to save only model weights.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix DreamZero full finetune h100x8 checkpoint disk overflow#374

Fix DreamZero full finetune h100x8 checkpoint disk overflow#374
vertix wants to merge 1 commit intoPositronic-Robotics:mainfrom
vertix:dreamzero-training

vertix commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vertix commented Mar 24, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant