Songming Zhang, Xue Zhang, Tong Zhang, Bojie Hu, Yufeng Chen*, Jinan Xu
Our code is based on OpenRLHF v0.5.2.post2 (We are also updating our codebase to the latest OpenRLHF).
- Token-level reward optimization
- Solid performance
- Stable optimization and Faster convergence
- Support both on-policy (like RL) and off-policy training (like DPO)
Our AlignDistil is easy to use, which contains three steps:
- Train a DPO model on your preferece data
- Train a reverse DPO model on your reversed preference data (swapping
chosenandrejected) - AlignDistil: Composing a synthetic distribution from these two models and distill it to the current policy model. It could be on your preference data (off-policy) or model-sampled data (on-policy).
Please run the following commands to install the specific OpenRLHF in our repo:
git clone https://github.com/songmzhang/AlignDistil
cd AlignDistil/OpenRLHF
pip install -e ./You also need to install vllm to run on-policy AlignDistil.
DPO training example:
bash ./train_scripts/ultrafeedback/qwen2.5-1.5b/dpo_01.shReverse DPO training example:
bash ./train_scripts/ultrafeedback/qwen2.5-1.5b/reverse_dpo_01.shOff-policy AlignDistil training example:
bash ./train_scripts/ultrafeedback/qwen2.5-1.5b/aligndistil_off_policy.shOn-policy AlignDistil training example:
bash ./train_scripts/ultrafeedback/qwen2.5-1.5b/aligndistil_on_policy.shIf you find this repo helpful, please cite our paper:
@article{zhang2025aligndistil,
title={Aligndistil: Token-level language model alignment as adaptive policy distillation},
author={Zhang, Songming and Zhang, Xue and Zhang, Tong and Hu, Bojie and Chen, Yufeng and Xu, Jinan},
journal={arXiv preprint arXiv:2503.02832},
year={2025}
}
