This repository contains implementations for the Simplified Transformer block , as well as the architecture using a single wide MLP for all transformer blocks. I intend to add further implementations such as multihead latent attention (MLA) as well as empirical results on small datasets in the future. Obviously, improving transformer architecture efficiency is very wide and deep field of research.
This paper suggests major prunings to the transformer architecture, resulting in fewer parameters while maintaining performance. Specifically,
Instead of three projections (Q,K,V), only two are used (Q,K). The scaled dot-product attention operation is performed without multiplying by V at the end:
The final attention output is a weighted sum of the scores as well as two fixed matrices (known as shaped attention):
Finally, the shaped attention operation and the MLP are not arranged sequentially, but rather in parallel:
The main idea of this paper is to use a shared MLP for all the transformer blocks in a model. Typically, the hidden layer in the MLP has a width of 4*embedding_dim. The authors suggest using a shared MLP with a hidden layer of width 4 * embedding_dim * n_layers, which maintains performance, while resulting in minimal parameter count reduction. The highlight result is Paretto-optimal parameter reductions when using