Ideas to debug NaN #636

tianenchong · 2025-11-30T01:30:24Z

tianenchong
Nov 30, 2025

Once a while I encountered NaN during training with MLX. It usually has to do with exploding value, which just requires a simple fix, but the difficult thing is to find out where it occurs. Sometime, the NaN gets too deep downstream and pollute all computations. Is there an efficient and a better way to debug NaN, trace its location so it is easier to fix at its source?

awni · 2025-12-01T20:07:45Z

awni
Dec 1, 2025
Maintainer

Is there an efficient and a better way to debug NaN, trace its location so it is easier to fix at its source?

It's quite difficult to debug NaNs in general, especially when they show up late in training. Some frameworks have a debug mode that checks the output of every computation and throws if a NaN is encountered. In theory we could do something like that and it's probably better than nothing.

But in practice it's still quite tricky to pinpoint exactly where in the computation a NaN occurred because mapping the low-level operation to a high-level one is not so straight-forward. For example if it was in an operation in the gradient computation, it might be hard to know exactly where that operation came from and what to do with that information.

I am curious to know more about the NaNs you are seeing. Usually with gradient exploding you will see the gradient spike and the loss diverges before you see actual NaNs. For other types of NaNs it's not really expected with your typical Transformer training (somewhat depends on the precision). So it could be a low-level bug.

0 replies

tianenchong · 2025-12-01T21:55:04Z

tianenchong
Dec 1, 2025
Author

I am using gumbel noise in my MoE softmax. Because I want to gradually attenuate the noise after many episodes of training. I multiplied a temperature (gradual decaying) to the gumbel noise. However, the moment I did so, NaN appeared many layers downstream. As soon as I removed temperature, the NaN did not come back. Or if I use interpreted (eager) version, NaN did not come back. The phenomenon is very reproducible. Turns out it is caused by some kernel optimization during the compilation. I solve that issue with so: g = mx.stop_gradient(gumbel * temperature) g = g + 0.0 * temperature # breaks fusion

2 replies

awni Dec 1, 2025
Maintainer

There should never be a NaN that appears during compilation but not in eager mode. That’s a bit suspicious. Could you share more details? Like maybe the gumbel code that causes a NaN with and without compilation?

tianenchong Dec 1, 2025
Author

I am still using mlx version 0.26.2. This is the working version:

            if exploration_enabled:
                # gumbel = -mx.log(-mx.log(mx.random.uniform(shape=gated_x_seq.shape) + 1e-8) + 1e-8)
                gumbel = mx.stop_gradient(mx.random.gumbel(shape=gated_x_seq.shape) * temperature)
                gumbel = gumbel + 0.0 * temperature  # break fusion
                gate_probs = mx.softmax((logits/alpha + gumbel), axis=-1)
            else:
                gate_probs = mx.softmax((logits/alpha), axis=-1)

Not working in compiled mode on their own (throw NaN downstream, not in the current layer, I did check the gradient output of each layer):
gumbel = mx.stop_gradient(-mx.log(-mx.log(mx.random.uniform(shape=gated_x_seq.shape) + 1e-8) + 1e-8) * temperature)
or
gumbel = mx.stop_gradient(mx.random.gumbel(shape=gated_x_seq.shape) * temperature)
only works with this line added:
gumbel = gumbel + 0.0 * temperature

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ideas to debug NaN #636

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Ideas to debug NaN #636

Uh oh!

Uh oh!

tianenchong Nov 30, 2025

Replies: 2 comments · 2 replies

Uh oh!

awni Dec 1, 2025 Maintainer

Uh oh!

Uh oh!

tianenchong Dec 1, 2025 Author

Uh oh!

Uh oh!

awni Dec 1, 2025 Maintainer

Uh oh!

Uh oh!

tianenchong Dec 1, 2025 Author

tianenchong
Nov 30, 2025

Replies: 2 comments 2 replies

awni
Dec 1, 2025
Maintainer

tianenchong
Dec 1, 2025
Author

awni Dec 1, 2025
Maintainer

tianenchong Dec 1, 2025
Author