Skip to content

Unable to Resume Training from Checkpoints – Possible Issue in Checkpoint Loading Logic #160

@xytsakura

Description

@xytsakura

Hi, thanks for your amazing work on Search-R1!

I am facing an issue when trying to resume training from saved checkpoints. During training, the checkpoints are successfully generated under the checkpoint-* directories. However, when I restart the training script after interruption, the model does not properly load the previous checkpoint and instead initializes some weights from scratch.

Here is the warning message I consistently get:
(WorkerDict pid=3488384) Some weights of Qwen2ForTokenClassification were not initialized from the model checkpoint at /data/USTC_Kevin/yantai/search_r1_project/models/Qwen2.5-1.5B-Instruct and are newly initialized: ['score.bias'] [repeated 2x across cluster]

🙏 My Question

Could you please help confirm:

1.Which file/function in the repository handles loading existing checkpoints when training restarts?

2.Is there any additional flag or configuration required to ensure resume-from-checkpoint works properly?

3.Why would some parameters like score.bias be initialized instead of loaded—does this indicate a mismatch between model head and saved state dict?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions