Ideas to debug NaN #636
Replies: 2 comments 2 replies
-
It's quite difficult to debug NaNs in general, especially when they show up late in training. Some frameworks have a debug mode that checks the output of every computation and throws if a NaN is encountered. In theory we could do something like that and it's probably better than nothing. But in practice it's still quite tricky to pinpoint exactly where in the computation a NaN occurred because mapping the low-level operation to a high-level one is not so straight-forward. For example if it was in an operation in the gradient computation, it might be hard to know exactly where that operation came from and what to do with that information. I am curious to know more about the NaNs you are seeing. Usually with gradient exploding you will see the gradient spike and the loss diverges before you see actual NaNs. For other types of NaNs it's not really expected with your typical Transformer training (somewhat depends on the precision). So it could be a low-level bug. |
Beta Was this translation helpful? Give feedback.
-
|
I am using gumbel noise in my MoE softmax. Because I want to gradually attenuate the noise after many episodes of training. I multiplied a temperature (gradual decaying) to the gumbel noise. However, the moment I did so, NaN appeared many layers downstream. As soon as I removed temperature, the NaN did not come back. Or if I use interpreted (eager) version, NaN did not come back. The phenomenon is very reproducible. Turns out it is caused by some kernel optimization during the compilation. I solve that issue with so: |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Once a while I encountered NaN during training with MLX. It usually has to do with exploding value, which just requires a simple fix, but the difficult thing is to find out where it occurs. Sometime, the NaN gets too deep downstream and pollute all computations. Is there an efficient and a better way to debug NaN, trace its location so it is easier to fix at its source?
Beta Was this translation helpful? Give feedback.
All reactions