AMiD: Knowledge Distillation for LLMs with α-mixture Assistant Distribution [ICLR 2026]

This repository contains an official implementation for the paper AMiD: Knowledge Distillation for LLMs with α-mixture Assistant Distribution in ICLR 2026.

Donghyeok Shin, Yeongmin Kim, Suhyeon Jo, Byeonghu Na, and Il-Chul Moon

Overview

Abstract Autoregressive large language models (LLMs) have achieved remarkable improvement across many tasks but incur high computational and memory costs. Knowledge distillation (KD) mitigates this issue by transferring knowledge from a large teacher to a smaller student through distributional alignment. Previous studies have proposed various discrepancy metrics, but the capacity gap and training instability caused by near-zero probabilities, stemming from the high-dimensional output of LLMs, remain fundamental limitations. To overcome these challenges, several approaches implicitly or explicitly incorporating assistant distribution have recently been proposed. However, the past proposals of assistant distributions have been a fragmented approach without a systematic investigation of the interpolation path and the divergence. This paper proposes α-mixture assistant distribution, a novel generalized family of assistant distributions, and α-mixture distillation, coined AMiD, a unified framework for KD using the assistant distribution. The α-mixture assistant distribution provides a continuous extension of the assistant distribution by introducing a new distribution design variable α, which has been fixed in all previous approaches. Furthermore, AMiD generalizes the family of divergences used with the assistant distributions based on optimality, which has also been restricted in previous works. Through extensive experiments, we demonstrate that AMiD offers superior performance and training stability by leveraging a broader and theoretically grounded assistant distribution space.

Environment

Create a new virtual environment and install the required dependencies using the install.sh file:

bash install.sh

Datasets

The training/evaluation intruction-response data before processing can be downloaded from MiniLLM.

huggingface-cli download MiniLLM/dolly --repo-type dataset --local-dir ./data/dolly/
huggingface-cli download MiniLLM/self-inst --repo-type dataset --local-dir ./data/self-inst/
huggingface-cli download MiniLLM/Vicuna --repo-type dataset --local-dir ./data/vicuna/
huggingface-cli download MiniLLM/sinst --repo-type dataset --local-dir ./data/sinst/
huggingface-cli download MiniLLM/uinst --repo-type dataset --local-dir ./data/uinst/

The processed data can be downloaded from MiniLLM.

huggingface-cli download MiniLLM/dolly-processed --repo-type dataset --local-dir ./processed_data/dolly/
huggingface-cli download MiniLLM/openwebtext-processed --repo-type=dataset --local-dir ./processed_data/openwebtext/gpt2/512/10M/

Models

To run fine-tuning or standard KD baselines, you need to download the model checkpoints from Huggingface Model Hub and put them in checkpoints/. For example, for gpt2-large, you can download the model from this link and put them in checkpoints/gpt2-large.

huggingface-cli download gpt2 --repo-type model --local-dir ./checkpoints/gpt2-base
huggingface-cli download gpt2-medium --repo-type model --local-dir ./checkpoints/gpt2-medium
huggingface-cli download gpt2-large --repo-type model --local-dir ./checkpoints/gpt2-large
huggingface-cli download gpt2-xl --repo-type model --local-dir ./checkpoints/gpt2-xlarge

Usage

The main hyperparameters of AMiD are as follows:

amid-div-name : Name of divergence
amid-div-order : Order of distributions in divergence
amid-alpha : Value of alpha
amid-lam : Value of lambda

Detailed values for these hyperparameters can be found in our paper. Please refer to the provided bash scripts for detailed arguments when running experiments.

Train

We provide example commands for gpt2-base. Please refer to scripts for other model architectures. The final checkpoints are selected by the ROUGE-L scores. Please refer to MiniLLM, DistiLLM, and ABKD for instructions on how to run baselines.

Fine-tune the teacher models

bash ./scripts/gpt2/sft/sft_xlarge.sh ${/PATH/TO/AMiD} ${MASTER_PORT} ${GPU_NUM}

or download from MiniLLM.

huggingface-cli download MiniLLM/teacher-gpt2-1.5B --repo-type model --local-dir ./results/gpt2/train/sft/gpt2-xlarge

Student Initialization

The final checkpoints are selected by the validation loss.

bash ./scripts/gpt2/init/init_base.sh ${/PATH/TO/AMiD} ${MASTER_PORT} ${GPU_NUM}

or download from MiniLLM.

huggingface-cli download MiniLLM/init-gpt2-120M --repo-type model --local-dir ./results/gpt2/train/init/gpt2-base

Train AMiD

bash ./scripts/gpt2/amid/train_0.1B_1.5B.sh ${/PATH/TO/AMiD} ${MASTER_PORT} ${GPU_NUM} ${amid-div-name} ${amid-div-order} ${amid-alpha} ${amid-lam} ${batch-size} ${lr}

Evaluation

bash ./scripts/gpt2/eval/run_eval.sh ${/PATH/TO/AMiD_CKPT} ${MASTER_PORT}

Citation

If you find the code useful for your research, please consider citing our paper.

@inproceedings{
shin2026amid,
title={{AM}iD: Knowledge Distillation for {LLM}s with \${\textbackslash}alpha\$-mixture Assistant Distribution},
author={Donghyeok Shin and Yeongmin Kim and Suhyeon Jo and Byeonghu Na and Il-chul Moon},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=7WPJ0EgPdW}
}

This work is heavily built upon the code from

MiniLLM: On-Policy Distillation of Large Language Models, Paper, Code
DistiLLM: Towards Streamlined Distillation for Large Language Models, Paper, Code
ABKD: Pursuing a Proper Allocation of the Probability Mass in Knowledge Distillation via α-β-Divergence, Paper, Code

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assets		assets
configs		configs
data_utils		data_utils
distillm		distillm
minillm		minillm
scripts/gpt2		scripts/gpt2
tools		tools
LICENSE		LICENSE
README.md		README.md
arguments.py		arguments.py
evaluate.py		evaluate.py
evaluate_main.py		evaluate_main.py
finetune.py		finetune.py
generate.py		generate.py
install.sh		install.sh
parse_result.py		parse_result.py
rouge_metric.py		rouge_metric.py
train_minillm.py		train_minillm.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AMiD: Knowledge Distillation for LLMs with α-mixture Assistant Distribution [ICLR 2026]

Overview

Environment

Datasets

Models

Usage

Train

Fine-tune the teacher models

Student Initialization

Train AMiD

Evaluation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AMiD: Knowledge Distillation for LLMs with α-mixture Assistant Distribution [ICLR 2026]

Overview

Environment

Datasets

Models

Usage

Train

Fine-tune the teacher models

Student Initialization

Train AMiD

Evaluation

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages