Authors: Lingjun Mao, Rodolfo Corona, Xin Liang, Wenhao Yan, Zineng Tang†
Most existing vision encoders map images into a fixed-length sequence of tokens, overlooking the fact that different images contain varying amounts of information. For example, a visually complex image (e.g., a cluttered room) inherently carries more information and thus deserves more tokens than a simple image (e.g., a blank wall). To address this inefficiency, We propose DOVE, a Dynamic Output Vision Encoder that produces a variable number of tokens to reconstruct each image. Our results show that DOVE significantly reduces the average number of tokens while maintaining high reconstruction quality. In several linear probing and downstream multimodal tasks, it outperforms existing autoencoder-based tokenization methods when using far fewer tokens, capturing more expressive semantic features compared to fixed-length encoding. We further extend DOVE with query-conditioned tokenization. By guiding the model to focus on query-relevant regions, it achieves more efficient and targeted semantic extraction.
git clone https://github.com/mao1207/DOVE.git
cd DOVE
conda create -n dove python=3.10
conda activate dove
pip install -r requirements.txtKey dependencies:
- PyTorch ≥ 2.0 (tested on 2.5.1)
- torchvision ≥ 0.15 (tested on 0.20.1)
- HuggingFace Transformers ≥ 4.30 (tested on 4.51.3)
- Diffusers ≥ 0.30 (tested on 0.33.1)
- accelerate, einops, timm, scikit-learn, matplotlib, pandas
Our model is initialized from pretrained VQGAN and Pythia language models, then jointly finetuned.
Before training or evaluation, download the following components from HuggingFace and place them in the root directory:
| Component | Description | HuggingFace Link |
|---|---|---|
vqvae-amused |
VQGAN visual tokenizer | mao1207/vqvae-amused |
pythia-14m (optional) |
Small language model backbone | EleutherAI/pythia-14m |
pythia-70m (optional) |
Larger language model backbone | EleutherAI/pythia-70m |
You can also download pretrained DOVE and Q-DOVE checkpoints:
| Model Variant | HuggingFace Link |
|---|---|
| DOVE | mao1207/DOVE |
| Q-DOVE | mao1207/Q-DOVE |
You can visualize how a single image is reconstructed under different token lengths:
CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch \
--nproc_per_node=1 \
--use_env \
--master_port=29500 \
evaluation/image_reconstruction/test_single_image.py \
--image_path path/to/image.png \
--question "null" \
--model_path path/to/dove_checkpoint.pth \
--output_path path/to/output_image.pngThis produces a single output image showing side-by-side reconstructions with varying token lengths.
If using a Q-DOVE checkpoint, you may provide a text prompt via --question.
You can also run reconstruction over an entire directory of images:
CUDA_VISIBLE_DEVICES=0 python evaluation/image_reconstruction/test_batch_images.py \
--image_dir path/to/images \
--model_path path/to/dove_checkpoint.pth \
--output_folder path/to/output_folderEach subfolder inside the output directory will correspond to a token length (e.g., 8, 16, 32) and contain the reconstructed images.
For Q-DOVE, you may additionally pass --question_json for query-aware generation.
To evaluate reconstruction quality with FID:
pytorch-fid path/to/real_images path/to/reconstructed_imagesThis can be used to compare different token budgets or model variants.
We provide a script for evaluating how well DOVE embeddings support classification. The default setting uses CIFAR-100:
CUDA_VISIBLE_DEVICES=0 python evaluation/linear_probe/DOVE_prob.py \
--data_root /path/to/cifar100 \
--ckpt_path ./checkpoints/DVT_epoch_215.pth \
--log_path ./results/linear_probe_cifar100.csv \
--batch_size 128 \
--epochs 100 \
--num_classes 100 \
--token_dim 512 \
--num_tokens 32This trains a linear classifier on top of frozen DOVE tokens.
You can replace the dataset and adjust --num_classes accordingly.
You can train DOVE from scratch or resume from a checkpoint:
CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 --master_port=29501 \
training/train_dove.py \
--batch_size 24 \
--checkpoint_path path/to/pretrained_or_resume_checkpoint.pth \
--output_images_dir ./data/train_images \
--output_model_dir ./checkpoints/dove \
--epochs 20 \
> train_log.txt 2>&1Notes:
- Adjust
CUDA_VISIBLE_DEVICESand--nproc_per_nodeto match your hardware. - The
--output_images_dirshould point to a folder containing all training images (flat directory, no subfolders). - If
--checkpoint_pathis omitted, training will start from scratch.
To enable query-conditioned tokenization:
CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 --master_port=29502 \
training/train_qdove.py \
--batch_size 24 \
--checkpoint_path path/to/q_dove_checkpoint.pth \
--output_images_dir ./data/train_images \
--output_model_dir ./checkpoints/qdove \
--annotation_json ./data/train_annotations.json \
--dataset_dir ./data/train_images \
--epochs 20 \
> train_qdove_log.txt 2>&1Requirements:
--dataset_dir: directory of training images.--annotation_json: JSON file with query annotations per image.
Each JSON entry should look like:
[
{
"image_id": "image_0001",
"question": "What is the dog doing?",
"answer": "Running",
"bounding_box": [[100, 150, 50, 60], [20, 40, 70, 85]]
},
{
"image_id": "image_0002",
"question": "Where is the person?",
"answer": null,
"bounding_box": [[0, 0, 256, 256]]
}
]
