Resume training from checkpoint #23
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces resume semantics for both Hugging Face training and Weights & Biases logging, fully controlled by config (no auto-discovery, no env hacks).
What changed
New config flags (top-level YAML):
resume_from_checkpoint: true|falsewandb_run_id: <string or null>Training resume rule (HF Trainer):
resume_from_checkpoint: true→ always resume frombase_model(which must point to the desired checkpoint dir or hub ref).resume_from_checkpoint: false→ start fresh (noresume_from_checkpointpassed to Trainer).W&B resume rule:
resume_from_checkpoint: trueandwandb_run_idis set → resume the exact same run viawandb.init(id=..., resume="allow").wandb_run_idisnull→ a new run is created (even if resuming training).Why
wandb_run_idis set, continue the same W&B run, yielding continuous metrics without reprocessing the same data.Behavior matrix
<id><id>(no resume)base_model<id>base_model<id>How to use
Resume training and resume the same W&B run
Start fresh training and a new W&B run
Backward compatibility
resume_from_checkpointis absent or false).resume_from_checkpoint: trueandwandb_run_id: <existing-id>to continue a previous run.Risks & mitigations
base_modeldoesn’t point to a valid Trainer checkpoint, HF Trainer will raise; this is intentional to prevent silent mis-resumes.wandb_run_idmust be correct; otherwise W&B will start a new run (expected).Files changed
src/multimeditron/train/train.pyalignment,end2endandfull