Images are Worth Variable Length of Representations

Authors: Lingjun Mao, Rodolfo Corona, Xin Liang, Wenhao Yan, Zineng Tang†

Abstract

Most existing vision encoders map images into a fixed-length sequence of tokens, overlooking the fact that different images contain varying amounts of information. For example, a visually complex image (e.g., a cluttered room) inherently carries more information and thus deserves more tokens than a simple image (e.g., a blank wall). To address this inefficiency, We propose DOVE, a Dynamic Output Vision Encoder that produces a variable number of tokens to reconstruct each image. Our results show that DOVE significantly reduces the average number of tokens while maintaining high reconstruction quality. In several linear probing and downstream multimodal tasks, it outperforms existing autoencoder-based tokenization methods when using far fewer tokens, capturing more expressive semantic features compared to fixed-length encoding. We further extend DOVE with query-conditioned tokenization. By guiding the model to focus on query-relevant regions, it achieves more efficient and targeted semantic extraction.

Installation

git clone https://github.com/mao1207/DOVE.git
cd DOVE
conda create -n dove python=3.10
conda activate dove
pip install -r requirements.txt

Key dependencies:

PyTorch ≥ 2.0 (tested on 2.5.1)
torchvision ≥ 0.15 (tested on 0.20.1)
HuggingFace Transformers ≥ 4.30 (tested on 4.51.3)
Diffusers ≥ 0.30 (tested on 0.33.1)
accelerate, einops, timm, scikit-learn, matplotlib, pandas

Preparation

Our model is initialized from pretrained VQGAN and Pythia language models, then jointly finetuned.

Before training or evaluation, download the following components from HuggingFace and place them in the root directory:

Component	Description	HuggingFace Link
`vqvae-amused`	VQGAN visual tokenizer	mao1207/vqvae-amused
`pythia-14m` (optional)	Small language model backbone	EleutherAI/pythia-14m
`pythia-70m` (optional)	Larger language model backbone	EleutherAI/pythia-70m

You can also download pretrained DOVE and Q-DOVE checkpoints:

Model Variant	HuggingFace Link
DOVE	mao1207/DOVE
Q-DOVE	mao1207/Q-DOVE

Evaluation: Image Reconstruction

Single Image Reconstruction

You can visualize how a single image is reconstructed under different token lengths:

CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch \
  --nproc_per_node=1 \
  --use_env \
  --master_port=29500 \
  evaluation/image_reconstruction/test_single_image.py \
  --image_path path/to/image.png \
  --question "null" \
  --model_path path/to/dove_checkpoint.pth \
  --output_path path/to/output_image.png

This produces a single output image showing side-by-side reconstructions with varying token lengths. If using a Q-DOVE checkpoint, you may provide a text prompt via --question.

Batch Reconstruction

You can also run reconstruction over an entire directory of images:

CUDA_VISIBLE_DEVICES=0 python evaluation/image_reconstruction/test_batch_images.py \
  --image_dir path/to/images \
  --model_path path/to/dove_checkpoint.pth \
  --output_folder path/to/output_folder

Each subfolder inside the output directory will correspond to a token length (e.g., 8, 16, 32) and contain the reconstructed images. For Q-DOVE, you may additionally pass --question_json for query-aware generation.

FID Evaluation

To evaluate reconstruction quality with FID:

pytorch-fid path/to/real_images path/to/reconstructed_images

This can be used to compare different token budgets or model variants.

Evaluation: Linear Probe

We provide a script for evaluating how well DOVE embeddings support classification. The default setting uses CIFAR-100:

CUDA_VISIBLE_DEVICES=0 python evaluation/linear_probe/DOVE_prob.py \
  --data_root /path/to/cifar100 \
  --ckpt_path ./checkpoints/DVT_epoch_215.pth \
  --log_path ./results/linear_probe_cifar100.csv \
  --batch_size 128 \
  --epochs 100 \
  --num_classes 100 \
  --token_dim 512 \
  --num_tokens 32

This trains a linear classifier on top of frozen DOVE tokens. You can replace the dataset and adjust --num_classes accordingly.

Training DOVE

You can train DOVE from scratch or resume from a checkpoint:

CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 --master_port=29501 \
  training/train_dove.py \
  --batch_size 24 \
  --checkpoint_path path/to/pretrained_or_resume_checkpoint.pth \
  --output_images_dir ./data/train_images \
  --output_model_dir ./checkpoints/dove \
  --epochs 20 \
  > train_log.txt 2>&1

Notes:

Adjust CUDA_VISIBLE_DEVICES and --nproc_per_node to match your hardware.
The --output_images_dir should point to a folder containing all training images (flat directory, no subfolders).
If --checkpoint_path is omitted, training will start from scratch.

Training Q-DOVE

To enable query-conditioned tokenization:

CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 --master_port=29502 \
  training/train_qdove.py \
  --batch_size 24 \
  --checkpoint_path path/to/q_dove_checkpoint.pth \
  --output_images_dir ./data/train_images \
  --output_model_dir ./checkpoints/qdove \
  --annotation_json ./data/train_annotations.json \
  --dataset_dir ./data/train_images \
  --epochs 20 \
  > train_qdove_log.txt 2>&1

Requirements:

--dataset_dir: directory of training images.
--annotation_json: JSON file with query annotations per image.

Each JSON entry should look like:

[
  {
    "image_id": "image_0001",
    "question": "What is the dog doing?",
    "answer": "Running",
    "bounding_box": [[100, 150, 50, 60], [20, 40, 70, 85]]
  },
  {
    "image_id": "image_0002",
    "question": "Where is the person?",
    "answer": null,
    "bounding_box": [[0, 0, 256, 256]]
  }
]

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
evaluation		evaluation
model		model
training		training
utils		utils
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Images are Worth Variable Length of Representations

Abstract

Installation

Preparation

Evaluation: Image Reconstruction

Single Image Reconstruction

Batch Reconstruction

FID Evaluation

Evaluation: Linear Probe

Training DOVE

Training Q-DOVE

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

mao1207/DOVE

Folders and files

Latest commit

History

Repository files navigation

Images are Worth Variable Length of Representations

Abstract

Installation

Preparation

Evaluation: Image Reconstruction

Single Image Reconstruction

Batch Reconstruction

FID Evaluation

Evaluation: Linear Probe

Training DOVE

Training Q-DOVE

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages