Skip to content

Conversation

@theoschiff
Copy link
Contributor

@theoschiff theoschiff commented Nov 3, 2025

This PR introduces resume semantics for both Hugging Face training and Weights & Biases logging, fully controlled by config (no auto-discovery, no env hacks).

What changed

  • New config flags (top-level YAML):

    • resume_from_checkpoint: true|false
    • wandb_run_id: <string or null>
  • Training resume rule (HF Trainer):

    • If resume_from_checkpoint: true → always resume from base_model (which must point to the desired checkpoint dir or hub ref).
    • If resume_from_checkpoint: false → start fresh (no resume_from_checkpoint passed to Trainer).
  • W&B resume rule:

    • If resume_from_checkpoint: true and wandb_run_id is set → resume the exact same run via wandb.init(id=..., resume="allow").
    • If wandb_run_id is null → a new run is created (even if resuming training).

Why

  • Relaunches after failures/timeouts retrained on already-seen data and opened a new W&B run, fragmenting logs.
  • This update makes resumes deterministic: we resume from the user-specified checkpoint and, when wandb_run_id is set, continue the same W&B run, yielding continuous metrics without reprocessing the same data.

Behavior matrix

resume_from_checkpoint wandb_run_id Trainer resume from W&B run behavior
false null / unset none (fresh) new run
false <id> none (fresh) attach to run <id> (no resume)
true null / unset base_model new run
true <id> base_model resume run <id>

How to use

Resume training and resume the same W&B run

base_model: /path/to/checkpoint-2582
resume_from_checkpoint: true
wandb_run_id: 92v4qfkb

Start fresh training and a new W&B run

base_model: /path/to/base-or-checkpoint
resume_from_checkpoint: false
wandb_run_id: null

Note: base_model is used both for weight loading and, when resume_from_checkpoint: true, as the checkpoint path passed to trainer.train(resume_from_checkpoint=...).

Backward compatibility

  • Existing configs continue to work (default is fresh training if resume_from_checkpoint is absent or false).
  • Users migrating to this behavior should set resume_from_checkpoint: true and wandb_run_id: <existing-id> to continue a previous run.

Risks & mitigations

  • If base_model doesn’t point to a valid Trainer checkpoint, HF Trainer will raise; this is intentional to prevent silent mis-resumes.
  • W&B wandb_run_id must be correct; otherwise W&B will start a new run (expected).

Files changed

  • src/multimeditron/train/train.py
  • All 12 config files across alignment, end2end and full

@MichelDucartier MichelDucartier merged commit adbadb7 into master Dec 4, 2025
1 check failed
@MichelDucartier MichelDucartier deleted the restart-from-checkpoint branch December 4, 2025 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants