Resume training from checkpoint #23

theoschiff · 2025-11-03T16:57:08Z

This PR introduces resume semantics for both Hugging Face training and Weights & Biases logging, fully controlled by config (no auto-discovery, no env hacks).

What changed

New config flags (top-level YAML):
- resume_from_checkpoint: true|false
- wandb_run_id: <string or null>
Training resume rule (HF Trainer):
- If resume_from_checkpoint: true → always resume from base_model (which must point to the desired checkpoint dir or hub ref).
- If resume_from_checkpoint: false → start fresh (no resume_from_checkpoint passed to Trainer).
W&B resume rule:
- If resume_from_checkpoint: true and wandb_run_id is set → resume the exact same run via wandb.init(id=..., resume="allow").
- If wandb_run_id is null → a new run is created (even if resuming training).

Why

Relaunches after failures/timeouts retrained on already-seen data and opened a new W&B run, fragmenting logs.
This update makes resumes deterministic: we resume from the user-specified checkpoint and, when wandb_run_id is set, continue the same W&B run, yielding continuous metrics without reprocessing the same data.

Behavior matrix

resume_from_checkpoint	wandb_run_id	Trainer resume from	W&B run behavior
false	null / unset	none (fresh)	new run
false	`<id>`	none (fresh)	attach to run `<id>` (no resume)
true	null / unset	`base_model`	new run
true	`<id>`	`base_model`	resume run `<id>`

How to use

Resume training and resume the same W&B run

base_model: /path/to/checkpoint-2582
resume_from_checkpoint: true
wandb_run_id: 92v4qfkb

Start fresh training and a new W&B run

base_model: /path/to/base-or-checkpoint
resume_from_checkpoint: false
wandb_run_id: null

Note: base_model is used both for weight loading and, when resume_from_checkpoint: true, as the checkpoint path passed to trainer.train(resume_from_checkpoint=...).

Backward compatibility

Existing configs continue to work (default is fresh training if resume_from_checkpoint is absent or false).
Users migrating to this behavior should set resume_from_checkpoint: true and wandb_run_id: <existing-id> to continue a previous run.

Risks & mitigations

If base_model doesn’t point to a valid Trainer checkpoint, HF Trainer will raise; this is intentional to prevent silent mis-resumes.
W&B wandb_run_id must be correct; otherwise W&B will start a new run (expected).

Files changed

src/multimeditron/train/train.py
All 12 config files across alignment, end2end and full

src/multimeditron/cli/train.py

…to restart-from-checkpoint

theoschiff added 7 commits November 3, 2025 10:10

resume from checkpoint and resume same wandb run

b4b685d

simplified resume with new config flag resume_from_checkpoint

db40e57

modified config files with resume_from_checkpoint

eb712e6

resume simplified + new config param

c5fc13c

config updates with aligned model checkpoints

212303d

fix typo in path

f0fe922

added truncation and max_sequence length

e4b5274

MichelDucartier requested changes Nov 6, 2025

View reviewed changes

src/multimeditron/cli/train.py Outdated Show resolved Hide resolved

MichelDucartier and others added 6 commits November 6, 2025 13:42

Update src/multimeditron/cli/train.py

305e500

Clean

13a55e9

Revert

a604576

Merge branch 'master' of https://github.com/EPFLiGHT/MultiMeditron in…

71de45f

…to restart-from-checkpoint

Refactor

7f73109

Solve conflicts

8497441

MichelDucartier merged commit adbadb7 into master Dec 4, 2025
1 check failed

MichelDucartier deleted the restart-from-checkpoint branch December 4, 2025 16:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Resume training from checkpoint #23

Resume training from checkpoint #23

Uh oh!

theoschiff commented Nov 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Resume training from checkpoint #23

Resume training from checkpoint #23

Uh oh!

Conversation

theoschiff commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Why

Behavior matrix

How to use

Resume training and resume the same W&B run

Start fresh training and a new W&B run

Backward compatibility

Risks & mitigations

Files changed

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

theoschiff commented Nov 3, 2025 •

edited

Loading