[Bug]: Checkpoint resume diverges from continuous training due to missing LR scheduler/grad scaler states

Since the LR scheduling state and the grad scaler are not stored with the rest of the training run data, the model re-initializes them, which leads significantly different results than had the training not been interrupted.

NB: There is also a typo "optimizer_state_dit" should probably be "optimizer_state_dict"