Skip to content

JonasLoos/pope-pythia

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Patching Pythia Models to use PoPE instead of RoPE

PoPE (Polar Coordinate Positional Embeddings) is an alternative method for positional embeddings in transformer models. It promises better generalization accross sequence length and to some degree better performance in general. We investigate the effect of switching from RoPE to PoPE for pretrained Pythia models. After patching, we recalibrate for ~2% of pretrain budget (inspired by DroPE).

Results

We show that PoPE can match RoPE performance after recalibration while providing better length generalization. After 6.5B tokens of continued training:

Model Method PPL @512 PPL @1024 PPL @2048 PPL @3072 PPL @4096
Pythia-70m RoPE (original) 28.4 26.1 24.8 25.0 104.6
Pythia-70m PoPE (recalibrated) 27.0 25.1 24.1 23.8 24.2
Pythia-160m RoPE (original) 17.6 16.0 15.1 26.8 1101.9
Pythia-160m PoPE (recalibrated) 17.7 16.3 15.5 15.1 15.4

Key observations:

  • PoPE achieves similar perplexity than RoPE at the training sequence length (2048)
  • PoPE maintains stable perplexity at longer sequence lengths (3072, 4096), while RoPE degrades significantly
image

Further insights/notes:

  • Replacing RoPE with PoPE seems more effective than dropping RoPE (see related DroPE Pythia replication)
  • With the current setup, continued pretrain without patching unexpectedly slightly improves loss/perplexity for high context lengths, potentially due to different weight decay or quantization setup

Training Setup

python pope_pythia.py

We train Pythia models further on their pretraining data The Pile using AdamW. Before training, we pre-tokenize the text data and cache it. Before and after training, we evaluate the perplexity on different sequence lengths. The script supports multi-gpu training and logging to wandb.

Open Questions

  • Does this transfer to larger models and other model families?
  • How is the performance on benchmarks and in real world use?

About

replacing RoPE with PoPE in Pyhia models and recalibrating

Resources

Stars

Watchers

Forks

Contributors

Languages