Skip to content

mao1207/DOVE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Images are Worth Variable Length of Representations

Authors: Lingjun Mao, Rodolfo Corona, Xin Liang, Wenhao Yan, Zineng Tang†


DOVE Framework Overview

Abstract

Most existing vision encoders map images into a fixed-length sequence of tokens, overlooking the fact that different images contain varying amounts of information. For example, a visually complex image (e.g., a cluttered room) inherently carries more information and thus deserves more tokens than a simple image (e.g., a blank wall). To address this inefficiency, We propose DOVE, a Dynamic Output Vision Encoder that produces a variable number of tokens to reconstruct each image. Our results show that DOVE significantly reduces the average number of tokens while maintaining high reconstruction quality. In several linear probing and downstream multimodal tasks, it outperforms existing autoencoder-based tokenization methods when using far fewer tokens, capturing more expressive semantic features compared to fixed-length encoding. We further extend DOVE with query-conditioned tokenization. By guiding the model to focus on query-relevant regions, it achieves more efficient and targeted semantic extraction.


Installation

git clone https://github.com/mao1207/DOVE.git
cd DOVE
conda create -n dove python=3.10
conda activate dove
pip install -r requirements.txt

Key dependencies:

  • PyTorch ≥ 2.0 (tested on 2.5.1)
  • torchvision ≥ 0.15 (tested on 0.20.1)
  • HuggingFace Transformers ≥ 4.30 (tested on 4.51.3)
  • Diffusers ≥ 0.30 (tested on 0.33.1)
  • accelerate, einops, timm, scikit-learn, matplotlib, pandas

Preparation

Our model is initialized from pretrained VQGAN and Pythia language models, then jointly finetuned.

Before training or evaluation, download the following components from HuggingFace and place them in the root directory:

Component Description HuggingFace Link
vqvae-amused VQGAN visual tokenizer mao1207/vqvae-amused
pythia-14m (optional) Small language model backbone EleutherAI/pythia-14m
pythia-70m (optional) Larger language model backbone EleutherAI/pythia-70m

You can also download pretrained DOVE and Q-DOVE checkpoints:

Model Variant HuggingFace Link
DOVE mao1207/DOVE
Q-DOVE mao1207/Q-DOVE

Evaluation: Image Reconstruction

Single Image Reconstruction

You can visualize how a single image is reconstructed under different token lengths:

CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch \
  --nproc_per_node=1 \
  --use_env \
  --master_port=29500 \
  evaluation/image_reconstruction/test_single_image.py \
  --image_path path/to/image.png \
  --question "null" \
  --model_path path/to/dove_checkpoint.pth \
  --output_path path/to/output_image.png

This produces a single output image showing side-by-side reconstructions with varying token lengths. If using a Q-DOVE checkpoint, you may provide a text prompt via --question.


Batch Reconstruction

You can also run reconstruction over an entire directory of images:

CUDA_VISIBLE_DEVICES=0 python evaluation/image_reconstruction/test_batch_images.py \
  --image_dir path/to/images \
  --model_path path/to/dove_checkpoint.pth \
  --output_folder path/to/output_folder

Each subfolder inside the output directory will correspond to a token length (e.g., 8, 16, 32) and contain the reconstructed images. For Q-DOVE, you may additionally pass --question_json for query-aware generation.


FID Evaluation

To evaluate reconstruction quality with FID:

pytorch-fid path/to/real_images path/to/reconstructed_images

This can be used to compare different token budgets or model variants.


Evaluation: Linear Probe

We provide a script for evaluating how well DOVE embeddings support classification. The default setting uses CIFAR-100:

CUDA_VISIBLE_DEVICES=0 python evaluation/linear_probe/DOVE_prob.py \
  --data_root /path/to/cifar100 \
  --ckpt_path ./checkpoints/DVT_epoch_215.pth \
  --log_path ./results/linear_probe_cifar100.csv \
  --batch_size 128 \
  --epochs 100 \
  --num_classes 100 \
  --token_dim 512 \
  --num_tokens 32

This trains a linear classifier on top of frozen DOVE tokens. You can replace the dataset and adjust --num_classes accordingly.


Training DOVE

You can train DOVE from scratch or resume from a checkpoint:

CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 --master_port=29501 \
  training/train_dove.py \
  --batch_size 24 \
  --checkpoint_path path/to/pretrained_or_resume_checkpoint.pth \
  --output_images_dir ./data/train_images \
  --output_model_dir ./checkpoints/dove \
  --epochs 20 \
  > train_log.txt 2>&1

Notes:

  • Adjust CUDA_VISIBLE_DEVICES and --nproc_per_node to match your hardware.
  • The --output_images_dir should point to a folder containing all training images (flat directory, no subfolders).
  • If --checkpoint_path is omitted, training will start from scratch.

Training Q-DOVE

To enable query-conditioned tokenization:

CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 --master_port=29502 \
  training/train_qdove.py \
  --batch_size 24 \
  --checkpoint_path path/to/q_dove_checkpoint.pth \
  --output_images_dir ./data/train_images \
  --output_model_dir ./checkpoints/qdove \
  --annotation_json ./data/train_annotations.json \
  --dataset_dir ./data/train_images \
  --epochs 20 \
  > train_qdove_log.txt 2>&1

Requirements:

  • --dataset_dir: directory of training images.
  • --annotation_json: JSON file with query annotations per image.

Each JSON entry should look like:

[
  {
    "image_id": "image_0001",
    "question": "What is the dog doing?",
    "answer": "Running",
    "bounding_box": [[100, 150, 50, 60], [20, 40, 70, 85]]
  },
  {
    "image_id": "image_0002",
    "question": "Where is the person?",
    "answer": null,
    "bounding_box": [[0, 0, 256, 256]]
  }
]

About

Code for paper "Images are Worth Variable Length of Representations"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages