Skip to content

BeingBeyond/Being-VL-0.5

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Unified Multimodal Understanding via Byte-Pair Visual Encoding

Project Page arXiv License

Being-VL-0.5 is an MLLM that combines text and image understanding using a novel approach called Visual Byte-Pair Encoding (vBPE). Instead of treating images and text as completely separate modalities, our method applies BPE tokenization directly to visual tokens, creating a more unified representation that helps the model better understand relationships between vision and language.

For more details, please refer to our paper: Unified Multimodal Understanding via Byte-Pair Visual Encoding (ICCV'25).

News

  • [2025-12-22]: For more convenient development, we provide an experimental version of the vBPE dictionary. Based on this, you can skip the vBPE training and directly proceed with subsequent model development. Please download it via this link.
  • [2025-07-24]: 🎉🎉 Our paper is selected as ICCV Highlight!
  • [2025-07-12]: 🔥🔥 We release the code and training scripts!
  • [2025-06-26]: 🎉🎉 We publish Being-VL-0.5, which is accepted by ICCV 2025! Check our paper here. The code and training scripts will be released soon.
  • [2025-01-23]: 🎉🎉 Being-VL-0 is accepted by ICLR 2025! Check our paper here.
  • [2024-10-03]: We publish Being-VL-0, the first version of Being-VL series.

Quick Start

1. Installation

pip install -e .
pip install flash-attn --no-build-isolation
pip install -e ./transformers

We have made some modifications to the transformers library to support our model. Please install our provided transformers folder.

2. Directory Setup

Create a workspace with the following structure:

/path/to/your/workspace/
├── models/
│   ├── Llama-3.1-8B/              # Base LLaMA model
│   ├── BeingVL-VQ-8K/              # Being VQ-GAN model
│   ├── being-tokenizer/          # Being tokenizer
│   └── beingvl/                  # Output models
│       ├── base/                 # Initialized model
│       ├── stage-1/              # Stage 1 output
│       ├── stage-2/              # Stage 2 output
│       └── stage-3/              # Final model
├── data/
│   ├── images/                   # Raw images
│   ├── annotations/              # JSON annotations
│   ├── vq_tokens/                # VQ encoded tokens (.npy)
│   ├── vbpe/                     # vBPE tokenizer (.pkl)
│   └── tokenized/                # Tokenized datasets (.jsonl)
│       ├── pt/                   # Stage 1 PT data
│       ├── sft_stage2/           # Stage 2 SFT data
│       └── sft_stage3/           # Stage 3 SFT data
└── logs/                         # Training logs
    ├── stage-1/
    ├── stage-2/
    └── stage-3/

3. Base Model Initialization

Requirements:

  • Downloaded Llama-3.1-8B checkpoint. You can also use any other text-LLM, but it will require additional configuration (eg, dimensions, processing codes, etc).
  • Pretrained VQ-GAN checkpoint. This is extracted from Meta's Chameleon weights and converted to adapt to Being-VL. You can also use your own VQ-GAN models.
  • Being tokenizer config: beingvl/config/being-tokenizer-config
  • (Optional) Download our experimental version of the vBPE dictionary via this link. Please note that this is a visual BPE dictionary pre-trained on a small image dataset, intended only for rapid development and verification. For better performance, we still recommend that you train your own vBPE based on your datasets.

Initialize the Being-VL base model from Llama-3.1-8B using the provided tokenizer configuration:

# Download Llama-3.1-8B model (if not already available)
# Place it in /path/to/your/workspace/models/Llama-3.1-8B/

# Initialize Being-VL base model
python beingvl/utils/convert_llama_beingvl.py \
    --llama_path /path/to/your/workspace/models/Llama-3.1-8B \
    --being_tokenizer_config_path beingvl/config/being-tokenizer-config \
    --being_vq_path /path/to/your/workspace/models/BeingVL-VQ-8K \
    --output_path /path/to/your/workspace/models/beingvl/base \
    --verify_loading

This creates and initializes a Being-VL base model with extended vocabulary for VQ and vBPE tokens, ready for 3-stage training.

4. Training Pipeline

Being-VL uses a 3-stage training methodology:

# Stage 1
bash beingvl/scripts/train-stage-1.sh 0

# Stage 2
bash beingvl/scripts/train-stage-2.sh 0

# Stage 3
bash beingvl/scripts/train-stage-3.sh 0

5. Documentation

For detailed instructions, see:

  • Data.md: Data preparation and VQ encoding
  • Train.md: Training configuration and commands
  • Inference.md: Using the trained model for inference

Acknowledgements

We thank the open-sourcing projects Chameleon and Transformers, as our code is developed based on them.

Disclaimer

The code has been refactored from our development version and may not be fully tested in the new codebase. Some minor functions are still in progress and will be included later. If you encounter any issues, please open an issue to help us improve the project.

Citation

If you find our work useful, please consider citing us and give a star to our repository! 🌟🌟🌟

Being-VL-0.5

@inproceedings{zhang2025beingvl05,
  title={Unified Multimodal Understanding via Byte-Pair Visual Encoding},
  author={Zhang, Wanpeng and Feng, Yicheng and Luo, Hao and Li, Yijiang and Yue, Zihao and Zheng, Sipeng and Lu, Zongqing},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  year={2025}
}

Being-VL-0

@inproceedings{zhang2025beingvl0,
  title={From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities},
  author={Zhang, Wanpeng and Xie, Zilong and Feng, Yicheng and Li, Yijiang and Xing, Xingrun and Zheng, Sipeng and Lu, Zongqing},
  booktitle={The Thirteenth International Conference on Learning Representations},
  year={2025},
  url={https://openreview.net/forum?id=3TnLGGHhNx}
}

About

Being-VL-0.5: Unified Multimodal Understanding via Byte-Pair Visual Encoding (ICCV 2025, Highlight)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages