I think this is the original source of the aux loss: https://transformer-circuits.pub/2024/jan-update/index.html#dict-learning-resampling they mention stopping the gradient to the residual. Otherwise it fights against the main reconstruction loss.