In PR #281 we are stopping training with a message to the user if three consecutive NaNs are encountered: https://github.com/skinniderlab/CLM/pull/281/files#diff-40d1257aa23fce729a527b8214cde30f9c11062479becfb68cc00d20e0e91df9R42
Would it be more appropriate to stop training altogether and remove the checkpoint? Given that something has probably gone quite wrong for three NaN losses in a row to occur.