Do the authors have any insight on whether nGPT quantizes better/more than standard GPT?
The faster convergence in FP16, along with all weights/activations being normalized, would seem to imply it most likely would be so.
Did the authors try this with any of their trained models?