PoPE (Polar Coordinate Positional Embeddings) is an alternative method for positional embeddings in transformer models. It promises better generalization accross sequence length and to some degree better performance in general. We investigate the effect of switching from RoPE to PoPE for pretrained Pythia models. After patching, we recalibrate for ~2% of pretrain budget (inspired by DroPE).
We show that PoPE can match RoPE performance after recalibration while providing better length generalization. After 6.5B tokens of continued training:
| Model | Method | PPL @512 | PPL @1024 | PPL @2048 | PPL @3072 | PPL @4096 |
|---|---|---|---|---|---|---|
| Pythia-70m | RoPE (original) | 28.4 | 26.1 | 24.8 | 25.0 | 104.6 |
| Pythia-70m | PoPE (recalibrated) | 27.0 | 25.1 | 24.1 | 23.8 | 24.2 |
| Pythia-160m | RoPE (original) | 17.6 | 16.0 | 15.1 | 26.8 | 1101.9 |
| Pythia-160m | PoPE (recalibrated) | 17.7 | 16.3 | 15.5 | 15.1 | 15.4 |
Key observations:
- PoPE achieves similar perplexity than RoPE at the training sequence length (2048)
- PoPE maintains stable perplexity at longer sequence lengths (3072, 4096), while RoPE degrades significantly
Further insights/notes:
- Replacing RoPE with PoPE seems more effective than dropping RoPE (see related DroPE Pythia replication)
- With the current setup, continued pretrain without patching unexpectedly slightly improves loss/perplexity for high context lengths, potentially due to different weight decay or quantization setup
python pope_pythia.pyWe train Pythia models further on their pretraining data The Pile using AdamW. Before training, we pre-tokenize the text data and cache it. Before and after training, we evaluate the perplexity on different sequence lengths. The script supports multi-gpu training and logging to wandb.
- Does this transfer to larger models and other model families?
- How is the performance on benchmarks and in real world use?
