Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding

Acceleration Demo of GTO for llama3.1-8B in a 4090GPU.

Overview

GTO is a novel framework designed to address draft policy misalignment in speculative decoding. This repository provides the implementation of GTO, including its Draft Tree Reward and Group-based Draft Policy Training.

GTO is:
- 5.6x faster than vanilla decoding.
- 7% faster than EAGLE-3.

Setup & Installation

To set up the environment, follow these steps:

Clone the repository and navigate to the GTO directory:

cd GTO

Install the required dependencies:

pip install -r requirements.txt

Update the paths in the code: Replace placeholders like "your-model-paths" and "your-datasets-path" with the actual paths to your models and datasets.

GTO Weights

Base Model	GTO on Hugging Face	Base Model	GTO on Hugging Face
Llama-3.1-8B-Instruct	husj576/GTO-llama31-instruct-8B	Llama-3.3-70B-Instruct	husj576/GTO-llama33-instruct-70B
vicuna-13b-v1.3	husj576/GTO-vicuna-13b	Qwen3-8B	husj576/GTO-qwen3-8B
DeepSeek-R1-Distill-Llama-8B	husj576/GTO-deepseek-8B

Inference

The inference code we provide automatically allocates model weights (loading a model across multiple GPUs), allowing you to run models that exceed the memory of a single GPU.

We have provided a suggested web interface, which you can use by running the following command. After the model is fully loaded, a URL will be output in the terminal, which you can enter into your browser to access.

python -m application.webui --ea-model-path [path of GTO weight]\ 
		--base-model-path [path of the original model]\
		--model-type [vicuna\llama3\qwen]\
        --total-token [int]

The total-token is the number of draft tokens. For smaller models and advanced GPUs, this value can be set larger. Adjusting according to the specific device and model can achieve better results.

Training

To train GTO's draft model, you first need to generate the training data and then proceed with the two-phase training process.

Generate Train Data

You can run the following command to generate the training data.

python -m ge_data.regeneratedata.py

Train the Draft Model

GTO's training involves two training phase.

For training phase I (optional)

Run the following command for the training phase I:

cd traingto

export PYTHONPATH="/your-GTO-path/GTO:$PYTHONPATH"

deepspeed main.py \
--deepspeed_config ds_config.json \
--trainpath [path to training data] \
--savedir [path to save checkpoints]

This phase provides a baseline model to stabilize subsequent group-based updates and can be skipped when a sufficiently strong draft model exists, e.g., draft model well trained by EAGLE-3 and GRIFFIN.

For training phase II

Use the following command:

cd traingto

export PYTHONPATH="/your-GTO-path/GTO:$PYTHONPATH"

deepspeed main_gto.py \
--deepspeed_config ds_config.json \
--trainpath [path to training data] \
--savedir [path to save checkpoints] \
--load_path [path to phase I model checkpoint]

Example configuration files can be found in the GTO/traingto directory.

Evaluation

To evaluate the performance and speed of GTO, use the provided scripts for different models. Run the following commands:

./scripts/llama31_test_8b.sh
./scripts/llama33_test_70b.sh
./scripts/deepseek_test_8b.sh
./scripts/vicuna_test_13b.sh
./scripts/qwen3_test_8b.sh

Reference

For technical details and full experimental results, please check the paper of GTO.

@article{hu2025bridging,
  title={Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding},
  author={Hu, Shijing and Li, Jingyang and Lu, Zhihui and Zhou, Pan},
  journal={arXiv preprint arXiv:2509.22134},
  year={2025}
}

Acknowledgements

Our implementation is based on the opensource repository of EAGLE. This project has been influenced by many excellent projects in the LLM community, such as HASS, GRIFFIN, and others.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding

Acceleration Demo of GTO for llama3.1-8B in a 4090GPU.

Overview

Setup & Installation

GTO Weights

Inference

Training

Generate Train Data

Train the Draft Model

For training phase I (optional)

For training phase II

Evaluation

Reference

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
application		application
data		data
evaluation		evaluation
figs		figs
ge_data		ge_data
model		model
scripts		scripts
traingto		traingto
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding

Acceleration Demo of GTO for llama3.1-8B in a 4090GPU.

Overview

Setup & Installation

GTO Weights

Inference

Training

Generate Train Data

Train the Draft Model

For training phase I (optional)

For training phase II

Evaluation

Reference

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages