Hello world 👋
I often spend my weekends exploring what transformer-based architectures are doing under the hood and in this repo I'm documenting my findings.
- First embedding layer (Embedding table )
- Last 4 layers summed up embeddings
- Comparing the embeddings from the two different strategies
- Sentence embeddings
- Animated - how embedding values are changing as it passes through each of the 12 layers
- Weights distribution for query, key, values
- Weights distribution for other layers
- Attention score vs token per attention head
- Attention score vs token per layer
- Observations