AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation (ACL2025)

Songming Zhang, Xue Zhang, Tong Zhang, Bojie Hu, Yufeng Chen*, Jinan Xu

Our code is based on OpenRLHF v0.5.2.post2 (We are also updating our codebase to the latest OpenRLHF).

Why use AlignDistil?

Token-level reward optimization
Solid performance
Stable optimization and Faster convergence
Support both on-policy (like RL) and off-policy training (like DPO)

Method Framework

Our AlignDistil is easy to use, which contains three steps:

Train a DPO model on your preferece data
Train a reverse DPO model on your reversed preference data (swapping chosen and rejected)
AlignDistil: Composing a synthetic distribution from these two models and distill it to the current policy model. It could be on your preference data (off-policy) or model-sampled data (on-policy).

Prepare

Please run the following commands to install the specific OpenRLHF in our repo:

git clone https://github.com/songmzhang/AlignDistil
cd AlignDistil/OpenRLHF
pip install -e ./

You also need to install vllm to run on-policy AlignDistil.

Start

DPO training example:

bash ./train_scripts/ultrafeedback/qwen2.5-1.5b/dpo_01.sh

Reverse DPO training example:

bash ./train_scripts/ultrafeedback/qwen2.5-1.5b/reverse_dpo_01.sh

Off-policy AlignDistil training example:

bash ./train_scripts/ultrafeedback/qwen2.5-1.5b/aligndistil_off_policy.sh

On-policy AlignDistil training example:

bash ./train_scripts/ultrafeedback/qwen2.5-1.5b/aligndistil_on_policy.sh

Citation

If you find this repo helpful, please cite our paper:

@article{zhang2025aligndistil,
  title={Aligndistil: Token-level language model alignment as adaptive policy distillation},
  author={Zhang, Songming and Zhang, Xue and Zhang, Tong and Hu, Bojie and Chen, Yufeng and Xu, Jinan},
  journal={arXiv preprint arXiv:2503.02832},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
OpenRLHF		OpenRLHF
images		images
train_scripts/ultrafeedback/qwen2.5-1.5b		train_scripts/ultrafeedback/qwen2.5-1.5b
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation (ACL2025)

Why use AlignDistil?

Method Framework

Prepare

Start

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation (ACL2025)

Why use AlignDistil?

Method Framework

Prepare

Start

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages