Skip to content

songmzhang/AlignDistil

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 

Repository files navigation

AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation (ACL2025)

Songming Zhang, Xue Zhang, Tong Zhang, Bojie Hu, Yufeng Chen*, Jinan Xu

Our code is based on OpenRLHF v0.5.2.post2 (We are also updating our codebase to the latest OpenRLHF).

Why use AlignDistil?

  • Token-level reward optimization
  • Solid performance
  • Stable optimization and Faster convergence
  • Support both on-policy (like RL) and off-policy training (like DPO)

Method Framework

alignDistil framework

Our AlignDistil is easy to use, which contains three steps:

  • Train a DPO model on your preferece data
  • Train a reverse DPO model on your reversed preference data (swapping chosen and rejected)
  • AlignDistil: Composing a synthetic distribution from these two models and distill it to the current policy model. It could be on your preference data (off-policy) or model-sampled data (on-policy).

Prepare

Please run the following commands to install the specific OpenRLHF in our repo:

git clone https://github.com/songmzhang/AlignDistil
cd AlignDistil/OpenRLHF
pip install -e ./

You also need to install vllm to run on-policy AlignDistil.

Start

DPO training example:

bash ./train_scripts/ultrafeedback/qwen2.5-1.5b/dpo_01.sh

Reverse DPO training example:

bash ./train_scripts/ultrafeedback/qwen2.5-1.5b/reverse_dpo_01.sh

Off-policy AlignDistil training example:

bash ./train_scripts/ultrafeedback/qwen2.5-1.5b/aligndistil_off_policy.sh

On-policy AlignDistil training example:

bash ./train_scripts/ultrafeedback/qwen2.5-1.5b/aligndistil_on_policy.sh

Citation

If you find this repo helpful, please cite our paper:

@article{zhang2025aligndistil,
  title={Aligndistil: Token-level language model alignment as adaptive policy distillation},
  author={Zhang, Songming and Zhang, Xue and Zhang, Tong and Hu, Bojie and Chen, Yufeng and Xu, Jinan},
  journal={arXiv preprint arXiv:2503.02832},
  year={2025}
}

About

Code for ACL 2025 Paper "AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors