Stop with error when loss is NaN?

In PR #281 we are stopping training with a message to the user if three consecutive NaNs are encountered: https://github.com/skinniderlab/CLM/pull/281/files#diff-40d1257aa23fce729a527b8214cde30f9c11062479becfb68cc00d20e0e91df9R42
Would it be more appropriate to stop training altogether and remove the checkpoint? Given that something has probably gone quite wrong for three NaN losses in a row to occur.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop with error when loss is NaN? #284

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Stop with error when loss is NaN? #284

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions