Skip to content

[ICLR 2026] Official code of PPE: Positional Preservation Embedding for Token Compression in Multimodal Large Language Models.

License

Notifications You must be signed in to change notification settings

MouxiaoHuang/PPE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

14 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PPE: Positional Preservation Embedding for Token Compression in Multimodal Large Language Models

Mouxiao Huang*, Borui Jiang*βœ‰, Dehua Zheng, Hailin Huβœ‰, Kai Han, Xinghao Chenβœ‰

* Equal contribution

arXiv:2510.22936 ICLR 2026 Poster GitHub stars

⭐ If you find this useful, a star would be appreciated

PPE Framework Overview PPE Framework Overview

πŸ—Ί Roadmap

  • βœ… Paper accepted at ICLR 2026

  • βœ… Core PPE implementation released

  • βœ… Training and inference pipeline

  • πŸ”œ Additional benchmark support

  • πŸ”œ Cascade compression for image inputs

  • πŸ”œ Finetuned checkpoints

  • πŸ”œ Extended backbone support

  • πŸ”œ HuggingFace integration


πŸ“‘ Table of Contents

🌟 Highlights

  • Plug-and-Play & Parameter-Free: Works in a plug-and-play manner, without modifying original token selection or aggregation mechanisms.
  • Preserve Positional Information: Preserves richer positional cues under the same reduction ratio.
  • Training-Free & SFT: Mainly depends on the underlying compression method and can be better when training is allowed.
  • Broad Compatibility: Easily combines with various token compression methods.
  • Cascade Clustering Support: Facilitates multi-stage compression within LLM while maintaining performance.

πŸ“¦ Installation

βœ… Requirements

  • transformers β‰₯ 4.50 (required for Qwen2.5-VL support)
  • liger_kernel
  • deepspeed
  • flash-attn
  • accelerate==1.1.1
  • torch_npu (optional, only needed for Ascend NPU usage)

πŸš€ Usage

Pretrained Model

We conduct experiments primarily on Qwen2.5-VL-3B-Instruct. You can download the official pretrained model from here.

πŸ“ Dataset

Training Data:

Due to computational limitations, our supervised fine-tuning (SFT) dataset is constructed from public sources:

You may use the whole datasets or customize your own.

Data structure please refer to ./data/demo.json or Qwen-VL-Series-Finetune for more information.

Evaluation Benchmarks:

  • Image Tasks: MMBench, SQA, TextVQA, ChartQA, DocVQA, OCRBench
  • Video Tasks: VideoMME, NeXT-QA, SEED-Bench-Video, MVBench

All benchmarks can be downloaded from their official sources and follow original instructions. Many original datasets are provided in .parquet format. However, we convert most of them into .json files with images stored separately (personal preference).

We also provide several example annotation files under ./data/XXX_benchmark to illustrate the expected data format and directory structure.

Custom Benchmarks:

We develop a simple, user-friendly pipeline that ensures inference is fully compatible with the training forward pass. To add a new benchmark, you can follow the implementation of existing ones:

  1. Implement the benchmark logic in ./src/evaluate/benchmarks/NEW_BENCH.py

    class CustomDataset(object):
        modality = "image" # or "video"
        
        def __init__(self, image_path="", anno_path="", pre_prompt="", post_prompt=""):
            # Load your annotations here
            self.data = [] 
    
        def __len__(self):
            return len(self.data)
    
        def __getitem__(self, idx):     
            # 1. Prepare common fields
            res = {
                "index": idx,
                "prompt": "Your formatted prompt",
                "GT": "Ground truth answer"
            }
    
            # 2. Add Media: Supports Image (PIL/Path) or Video (Path)
            # For Image Benchmarks:
            res.update({
                "image": image, # Supports PIL.Image object OR image_path
                "image_path": image_path 
            })
    
            # OR For Video Benchmarks:
            # res.update({
            #     "video_path": video_path
            # })
    
            return res
  2. Implement the corresponding evaluation metrics in ./src/evaluate/benchmarks/metrics/eval_NEW_BENCH.py

  3. Update ./src/evaluate/benchmarks/benchmarks_config.py


πŸ‹οΈβ€β™€οΈ Training

Run Training

  • MODEL_PATH: path to the pretrained model
  • DATA_ROOT: root directory of your training data
  • DATA_JSON: JSON file describing the dataset (examples in ./data/demo.json)
bash script/run_sft.sh
# For debugging (single GPU/NPU, no deepspeed, supports breakpoint):
# bash script/run_sft.sh debug

πŸ§ͺ Evaluation

  • MODEL_PATH: path to the model checkpoint
  • BENCHMARKS: list of benchmarks for evaluation
  • PPE_CONFIG: configuration options for different compression settings
  • ⚠️ Reminder: Edit DATASET_CONFIG in ./src/evaluate/benchmarks_config.py according to your local setup.

Run Inference

MODEL_PATH=/path/to/model bash script/run_infer.sh
# For debugging (single GPU/NPU, supports breakpoint):
# bash script/run_infer.sh debug

πŸ“Š Benchmarks

πŸ–ΌοΈ Image Tasks

1. Experiments on Qwen2.5-VL-3B-Instruct

Qwen2.5-VL-3B-Instruct Method MMBench (EN) MMBench (CN) SQA* TextVQA DocVQA OCRBench ChartQA Red. Ratio
Training-Free Vanilla (report) 79.10 78.10 76.14 79.30 93.90 797 84.00 0%
Chat-UniVi 81.50 80.06 74.35 37.60 19.58 307 18.72 55%
Chat-UniVi + PPE 82.28 (+0.78) 81.43 (+1.37) 74.58 (+0.23) 73.78 (+36.18) 66.16 (+46.58) 598 (+291) 67.08 (+48.36) 55%
SFT Dense 85.89 86.07 79.39 79.50 89.44 761 79.96 0%
Chat-UniVi 84.92 83.71 77.48 57.66 52.48 535 49.60 55%
Chat-UniVi + PPE 84.73 (-0.19) 84.87 (+1.16) 78.34 (+0.86) 77.14 (+19.48) 76.79 (24.31) 691 (+156) 74.52 (+24.92) 55%

* denotes reproduction results, as these benchmarks are not reported in the original paper.

2. Experiments on Qwen2.5-VL-7B-Instruct

We further extended our experiments to the 7B model. However, due to time and resource constraints, we trained it on only 1/5 of the data used for the 3B model.

Qwen2.5-VL-7B-Instruct Method MMBench (EN) MMBench (CN) SQA* TextVQA DocVQA OCRBench ChartQA Red. Ratio
Training-Free Vanilla (report) 83.50 83.40 85.52 84.90 95.70 864 87.30 0%
Chat-UniVi 83.23 80.18 80.49 35.82 27.06 479 19.92 55%
Chat-UniVi + PPE 83.58 (+0.35) 82.35 (+2.17οΌ‰ 81.59 (+1.10) 63.44 (+27.62) 66.42 (+39.36) 577 (+98) 46.72 (+29.8) 55%
SFT Dense 86.90 85.35 84.83 87.20 92.97 826 86.32 0%
Chat-UniVi 86.23 84.25 82.40 54.92 50.01 584 43.96 55%
Chat-UniVi + PPE 86.26 (+0.03) 84.85 (+0.60) 83.56 (+1.16) 82.46 (+27.54) 85.84 (+35.83) 764 (+180) 78.88 (+34.92) 55%

* denotes reproduction results, as these benchmarks are not reported in the original paper.

🎬 Video Tasks

Qwen2.5-VL-3B-Instruct Method VideoMME
(w/o subs)
VideoMME
(w/ subs)
NeXT-QA
(MC)
NeXT-QA
(OE)
SEED-Bench-Video MVBench Avg. Red. Ratio
SFT Dense 57.81 57.96 78.20 31.65 57.60 67.90 58.52 0%
Chat-UniVi 57.22 57.22 77.63 25.37 56.08 66.90 56.74 55%
Chat-UniVi + PPE 58.70 (+1.48) 59.07 (+1.85) 78.42 (+0.42) 32.61 (+7.24) 55.98 (-0.10) 67.38 (+0.48) 58.69 (+1.95) 55%
+ PPE Cascade 58.48 58.52 78.20 32.20 56.11 67.35 58.48 90%

πŸ“Œ Citation

If you find this work helpful, please consider citing us:

@article{huang2025ppe,
	title={PPE: Positional Preservation Embedding for Token Compression in Multimodal Large Language Models},
  	author={Mouxiao Huang and Borui Jiang and Dehua Zheng and Hailin Hu and Kai Han and Xinghao Chen},
  	journal={arXiv preprint arXiv:2510.22936},
  	year={2025}
}

❓ FAQ

Q1: Why are some baseline comparisons missing from this repo?

A: For convenience, we directly conducted comparisons using official implementations for PACT, ToMe, and VisionZip, etc.

Q2: Why are some benchmarks missing from this repo?

A: Due to internal compliance and the lengthy review process required for exporting code from our corporate environment, some benchmark implementations are currently unavailable. Even though these are based on open-source standards, the export process remains restrictive. However, we have designed the pipeline to be highly extensible. We encourage you to implement your own benchmarks using our straightforward template; it is designed for a seamless, plug-and-play experience.

Q3: Why is K=8 by default?

A: This version adapts to Qwen2.5-VL, which originally uses 3D-MRoPE (mrope_section=[16, 24, 24]). K=8 works well for both video and image experiments. For experiments strictly aligned with the paper's image-only results, please manually switch to 2D-MRoPE.

Q4: What if we set K=32 when using PPE with mrope_section=[16, 24, 24]?

A: It falls back to a repeating [1(T), 1(H), 1(W), ...] pattern. Since $64$ is not divisible by $3$, it only yields $21$ complete $T/H/W$ triplets, leaving the remainder incomplete. Although not the intended implementation, the performance remains decent because the compressed token still captures multiple position cues rather than just one.

Q5: Is this the full implementation of the paper?

A: No. The currently released code is a cleaned and re-implemented version optimized for readability. Fully migrating and organizing every single experiment involves a significant amount of redundant manual labor. More importantly, the core idea of PPE is elegantly simple and easy to implement: compressed token RoPE embeddings should represent multiple original positions rather than a single point. Our goal is to provide this key insight to the community to foster further discussion and collaborative exploration.


🀝 Contributing

We welcome contributions from the community! Here's how to get started:

  1. Fork this repository
  2. Create a new feature branch: git checkout -b feature/your-feature-name
  3. Make your changes and commit them
  4. Push your branch and open a pull request

πŸ“¬ Contact

  • πŸ’¬ For questions, suggestions, or bug reports, please open an issue on GitHub or email us.

βš–οΈ License

πŸ“„ This project is licensed under the Apache License 2.0.


πŸ™ Acknowledgements

We build upon the inspiring work of:

About

[ICLR 2026] Official code of PPE: Positional Preservation Embedding for Token Compression in Multimodal Large Language Models.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published