-
Notifications
You must be signed in to change notification settings - Fork 373
Unable to Resume Training from Checkpoints – Possible Issue in Checkpoint Loading Logic #160
Description
Hi, thanks for your amazing work on Search-R1!
I am facing an issue when trying to resume training from saved checkpoints. During training, the checkpoints are successfully generated under the checkpoint-* directories. However, when I restart the training script after interruption, the model does not properly load the previous checkpoint and instead initializes some weights from scratch.
Here is the warning message I consistently get:
(WorkerDict pid=3488384) Some weights of Qwen2ForTokenClassification were not initialized from the model checkpoint at /data/USTC_Kevin/yantai/search_r1_project/models/Qwen2.5-1.5B-Instruct and are newly initialized: ['score.bias'] [repeated 2x across cluster]
🙏 My Question
Could you please help confirm:
1.Which file/function in the repository handles loading existing checkpoints when training restarts?
2.Is there any additional flag or configuration required to ensure resume-from-checkpoint works properly?
3.Why would some parameters like score.bias be initialized instead of loaded—does this indicate a mismatch between model head and saved state dict?