LAST-ViT

Vision Transformers Need More Than Registers

Cheng Shi · Yizhou Yu · Sibei Yang

The University of Hong Kong & Sun Yat-sen University

Abstract

Vision Transformers (ViTs), when pre-trained on large-scale data, provide general-purpose representations for diverse downstream tasks. However, artifacts in ViTs are widely observed across different supervision paradigms and downstream tasks. Through systematic analysis of artifacts in ViTs, we find that their fundamental mechanisms have yet to be sufficiently elucidated. In this paper, through systematic analysis, we conclude that these artifacts originate from a lazy aggregation behavior: ViT uses semantically irrelevant background patches as shortcuts to represent global semantics, driven by global attention and coarse-grained semantic supervision. Our solution selectively integrates patch features into the CLS token, reducing the influence of background-dominated shortcuts and consistently improving performance across 12 benchmarks under label-, text-, and self-supervision. We hope this work offers a new perspective on ViT behavior.

Method

LAST-ViT replaces the standard CLS token with a frequency-domain token selection mechanism:

Before (standard ViT):

x = self._process_input(x)
x = torch.cat([batch_class_token, x], dim=1)
x = self.encoder(x)
cls_token = x[:, 0:1]
return cls_token

With LAST-ViT:

x = self.encoder(x)

x_detach = x[:, 1:]
x = torch.fft.fft(x[:, 1:], dim=-1)
gs_k = self.gaussian_kernel_1d(kernel_size, sigma)
x = torch.fft.fftshift(x, dim=-1)
x = x * gs_k
x = torch.fft.ifftshift(x, dim=-1)
x = torch.fft.ifft(x, dim=-1).real
diff = x_detach / torch.abs(x - x_detach)
_, indices = torch.topk(diff, k=1, dim=1, largest=True)
sel_p = torch.gather(x_detach, 1, indices)
cls_token = torch.mean(sel_p, dim=1)
return cls_token

Pre-trained Weights

Training Scenario	Training Script	LAST-ViT Weight
Self-supervised (DINO)	facebookresearch/dino	Download
Text-supervised (CLIP)	mlfoundations/open_clip	Download
Label-supervised (ViT-B/16)	cls_pretrain/	Download

Getting Started

Installation

pip install torch torchvision detectron2 fvcore omegaconf tqdm matplotlib scipy

Classification Training

python cls_pretrain/lazy_train.py \
    --config-file cls_pretrain/conf.py \
    --num-gpus 8 \
    dataloader.train.dataset.root=/path/to/imagenet

Evaluation: Patch-BBox Hit Ratio

Evaluate whether the highest-scoring patch from LAST-ViT falls inside the ground-truth object bounding box:

python visualization/evaluate_patch_hit.py \
    --checkpoint ViT_190k.pth \
    --imagenet-root /path/to/imagenet \
    --batch-size 32

News

[2026-03] Initial release: code, pre-trained weights (DINO / CLIP / ViT-B16), and patch-bbox evaluation. Sanity check coming soon.
[2026-04] Thank for kylin0421 to help provide the visualization code.

Citation

@article{shi2026vision,
  title={Vision Transformers Need More Than Registers},
  author={Shi, Cheng and Yu, Yizhou and Yang, Sibei},
  journal={arXiv preprint arXiv:2602.22394},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
cls_pretrain		cls_pretrain
imgs		imgs
visualization		visualization
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LAST-ViT

Vision Transformers Need More Than Registers

Abstract

Method

Pre-trained Weights

Getting Started

Installation

Classification Training

Evaluation: Patch-BBox Hit Ratio

News

Citation

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LAST-ViT

Vision Transformers Need More Than Registers

Abstract

Method

Pre-trained Weights

Getting Started

Installation

Classification Training

Evaluation: Patch-BBox Hit Ratio

News

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages