Skip to content

How to reproduce the same training process when using "train_from" #2006

@lemon234071

Description

@lemon234071

Dear,

When the model training was forced to stop due to an accident. I use the opt "train_from" to continue training from the checkpoint. But the result is different from the tranining from start to finish without stopping:

  1. The stored patience for "early stop" was not saved into checkpoint.
  2. The order of data batch provied train_iter is different, when train_from a checkpoint. (When train_from, it starts over from the begining of the dataset and the data are very different from where it stands at the step of saved checkpoint)

Note that i fixed all random seed.

So it is very convenient that If a reproduction mechanism can be added into the code base.
Any help will be greatly appreciated.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions