Unified Multimodal Understanding via Byte-Pair Visual Encoding

Being-VL-0.5 is an MLLM that combines text and image understanding using a novel approach called Visual Byte-Pair Encoding (vBPE). Instead of treating images and text as completely separate modalities, our method applies BPE tokenization directly to visual tokens, creating a more unified representation that helps the model better understand relationships between vision and language.

For more details, please refer to our paper: Unified Multimodal Understanding via Byte-Pair Visual Encoding (ICCV'25).

News

[2025-12-22]: For more convenient development, we provide an experimental version of the vBPE dictionary. Based on this, you can skip the vBPE training and directly proceed with subsequent model development. Please download it via this link.
[2025-07-24]: 🎉🎉 Our paper is selected as ICCV Highlight!
[2025-07-12]: 🔥🔥 We release the code and training scripts!
[2025-06-26]: 🎉🎉 We publish Being-VL-0.5, which is accepted by ICCV 2025! Check our paper here. The code and training scripts will be released soon.
[2025-01-23]: 🎉🎉 Being-VL-0 is accepted by ICLR 2025! Check our paper here.
[2024-10-03]: We publish Being-VL-0, the first version of Being-VL series.

Quick Start

1. Installation

pip install -e .
pip install flash-attn --no-build-isolation
pip install -e ./transformers

We have made some modifications to the transformers library to support our model. Please install our provided transformers folder.

2. Directory Setup

Create a workspace with the following structure:

/path/to/your/workspace/
├── models/
│   ├── Llama-3.1-8B/              # Base LLaMA model
│   ├── BeingVL-VQ-8K/              # Being VQ-GAN model
│   ├── being-tokenizer/          # Being tokenizer
│   └── beingvl/                  # Output models
│       ├── base/                 # Initialized model
│       ├── stage-1/              # Stage 1 output
│       ├── stage-2/              # Stage 2 output
│       └── stage-3/              # Final model
├── data/
│   ├── images/                   # Raw images
│   ├── annotations/              # JSON annotations
│   ├── vq_tokens/                # VQ encoded tokens (.npy)
│   ├── vbpe/                     # vBPE tokenizer (.pkl)
│   └── tokenized/                # Tokenized datasets (.jsonl)
│       ├── pt/                   # Stage 1 PT data
│       ├── sft_stage2/           # Stage 2 SFT data
│       └── sft_stage3/           # Stage 3 SFT data
└── logs/                         # Training logs
    ├── stage-1/
    ├── stage-2/
    └── stage-3/

3. Base Model Initialization

Requirements:

Downloaded Llama-3.1-8B checkpoint. You can also use any other text-LLM, but it will require additional configuration (eg, dimensions, processing codes, etc).
Pretrained VQ-GAN checkpoint. This is extracted from Meta's Chameleon weights and converted to adapt to Being-VL. You can also use your own VQ-GAN models.
Being tokenizer config: beingvl/config/being-tokenizer-config
(Optional) Download our experimental version of the vBPE dictionary via this link. Please note that this is a visual BPE dictionary pre-trained on a small image dataset, intended only for rapid development and verification. For better performance, we still recommend that you train your own vBPE based on your datasets.

Initialize the Being-VL base model from Llama-3.1-8B using the provided tokenizer configuration:

# Download Llama-3.1-8B model (if not already available)
# Place it in /path/to/your/workspace/models/Llama-3.1-8B/

# Initialize Being-VL base model
python beingvl/utils/convert_llama_beingvl.py \
    --llama_path /path/to/your/workspace/models/Llama-3.1-8B \
    --being_tokenizer_config_path beingvl/config/being-tokenizer-config \
    --being_vq_path /path/to/your/workspace/models/BeingVL-VQ-8K \
    --output_path /path/to/your/workspace/models/beingvl/base \
    --verify_loading

This creates and initializes a Being-VL base model with extended vocabulary for VQ and vBPE tokens, ready for 3-stage training.

4. Training Pipeline

Being-VL uses a 3-stage training methodology:

# Stage 1
bash beingvl/scripts/train-stage-1.sh 0

# Stage 2
bash beingvl/scripts/train-stage-2.sh 0

# Stage 3
bash beingvl/scripts/train-stage-3.sh 0

5. Documentation

For detailed instructions, see:

Data.md: Data preparation and VQ encoding
Train.md: Training configuration and commands
Inference.md: Using the trained model for inference

Acknowledgements

We thank the open-sourcing projects Chameleon and Transformers, as our code is developed based on them.

Disclaimer

The code has been refactored from our development version and may not be fully tested in the new codebase. Some minor functions are still in progress and will be included later. If you encounter any issues, please open an issue to help us improve the project.

Citation

If you find our work useful, please consider citing us and give a star to our repository! 🌟🌟🌟

Being-VL-0.5

@inproceedings{zhang2025beingvl05,
  title={Unified Multimodal Understanding via Byte-Pair Visual Encoding},
  author={Zhang, Wanpeng and Feng, Yicheng and Luo, Hao and Li, Yijiang and Yue, Zihao and Zheng, Sipeng and Lu, Zongqing},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  year={2025}
}

Being-VL-0

@inproceedings{zhang2025beingvl0,
  title={From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities},
  author={Zhang, Wanpeng and Xie, Zilong and Feng, Yicheng and Li, Yijiang and Xing, Xingrun and Zheng, Sipeng and Lu, Zongqing},
  booktitle={The Thirteenth International Conference on Learning Representations},
  year={2025},
  url={https://openreview.net/forum?id=3TnLGGHhNx}
}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
beingvl		beingvl
docs		docs
transformers		transformers
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unified Multimodal Understanding via Byte-Pair Visual Encoding

News

Quick Start

1. Installation

2. Directory Setup

3. Base Model Initialization

4. Training Pipeline

5. Documentation

Acknowledgements

Disclaimer

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Unified Multimodal Understanding via Byte-Pair Visual Encoding

News

Quick Start

1. Installation

2. Directory Setup

3. Base Model Initialization

4. Training Pipeline

5. Documentation

Acknowledgements

Disclaimer

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages