Overview of our proposed DINO-QPM. The pipeline processes the (a) input image using the frozen backbone to produce patch embeddings, which are transformed by the interpretability adapter to obtain a globally interpretable image classification. We compare the diffuse saliency map of (b) DINO GradCAM, extracted from a linear probed DINO model, with our (c) DINO-QPM local explanation. The local explanation can be further decomposed into its (d) class-independent diverse features. Compared to the baseline, we observe a drastic increase in localisation quality, showcasing how our interpretability adapter successfully isolates semantically meaningful features.
Although visual foundation models like DINOv2 provide state-of-the-art performance as feature extractors, their complex, high-dimensional representations create substantial hurdles for interpretability. This work proposes DINO-QPM, which converts these powerful but entangled features into contrastive, class-independent representations that are interpretable by humans. DINO-QPM is a lightweight interpretability adapter that pursues globally interpretable image classification, adapting the Quadratic Programming Enhanced Model (QPM) to operate on strictly frozen DINO backbones. While classification with visual foundation models typically relies on the CLS token, we deliberately diverge from this standard. By leveraging average-pooling, we directly connect the patch embeddings to the model's features and therefore enable spatial localisation of DINO-QPM's globally interpretable features within the input space. Furthermore, we apply a sparsity loss to minimise spatial scatter and background noise, ensuring that explanations are grounded in relevant object parts. With DINO-QPM we make the level of interpretability of QPM available as an adapter while exceeding the accuracy of DINOv2 linear probe. Evaluated through an introduced Plausibility metric and other interpretability metrics, extensive experiments demonstrate that DINO-QPM is superior to other applicable methods for frozen visual foundation models in both classification accuracy and explanation quality.
Comparison with state-of-the-art interpretable models. We report Accuracy, Plausibility, SID@5, Class-Independence, and Contrastiveness (all metrics in %). Features of a model are localised if they have a direct connection to the feature vector used for classification. The Plausibility metric is evaluated only on CUB-2011 due to the availability of segmentation masks. Dense CLS representation. For DINO-SLDD and DINO-QSENN, we employ a pipeline closely resembling our proposed method, with the exception of the feature selection mechanisms, which follow prior work. 123
Comparison of a Brewer's Blackbird image with a Rusty Blackbird image. From the selected features
Before being able to run the installation a Conda distribution must be installed (Anaconda or Miniconda).6
From the repository root:
conda env create -f environment.yml
conda activate DINO-QPM
python -m pip install --upgrade pip
python -m pip install -e .By default, data is expected under:
~/tmp/Datasets
Expected dataset folders:
~/tmp/Datasets/
├── CUB200
│ └── CUB_200_2011
│ ├── attributes
│ ├── class_sim_gts
│ ├── images
│ ├── parts
│ └── segmentations
├── StanfordCars
│ ├── car_devkit
│ ├── cars_test
│ └── cars_train
└── dino_data
├── CUB2011
│ └── ...
└── StanfordCars
└── ...
Download CUB_200_2011.tgz from the official Caltech project page (mirror: Caltech Data) and extract it into ~/tmp/Datasets/CUB200/:
mkdir -p ~/tmp/Datasets/CUB200
tar -xzf CUB_200_2011.tgz -C ~/tmp/Datasets/CUB200The archive already contains the expected CUB_200_2011/ directory with images/, images.txt, image_class_labels.txt, train_test_split.txt, attributes/, parts/ and segmentations/.
The dataset is no longer hosted at the original Stanford URLs. The easiest way is to let the StanfordCarsClass loader fetch it from the Kaggle mirror — it will pull and unpack the archive into ~/tmp/Datasets/StanfordCars/ automatically on first use (requires Kaggle credentials in ~/.kaggle/kaggle.json):
pip install kaggle
# place your kaggle.json under ~/.kaggle/ (chmod 600)
python -c "from dino_qpm.dataset_classes.stanfordcars import StanfordCarsClass; StanfordCarsClass(train=True, transform=None, download=True)"Alternatively, download the archives manually (e.g. from the Kaggle mirror) and arrange them as:
~/tmp/Datasets/StanfordCars/
├── car_devkit/
│ ├── cars_meta.mat
│ └── cars_train_annos.mat
├── cars_train/
├── cars_test/
└── cars_test_annos_withlabels.mat
The test annotations (cars_test_annos_withlabels.mat) are not bundled with the original devkit and must be added separately; one mirror is linked in stanfordcars.py:70.
The dino_data/ folder shown above is generated automatically the first time precomputed feature maps/vectors are written — it does not need to be downloaded.
Pretrained backbone weights are expected under:
~/tmp/model_weights/
The exact filename depends on the selected backbone (arch and model_type). Examples:
~/tmp/model_weights/
├── dinov2_vitl14_pretrain.pth # DINOv2 large
├── dinov2_vitb14_reg4_pretrain.pth # DINOv2 base with registers
└── dino_vitbase16_pretrain.pth # DINO base
Weights must be downloaded from the upstream repositories (DINO, DINOv2) and placed in this folder. A missing file raises a FileNotFoundError pointing at the expected path.
Entry point:
main.py
Supported subcommands:
traininferenceevaluate
An overview of the DINO-QPM training pipeline
Note: train runs evaluation by default.
It evaluates the dense model after dense training and, when finetuning is enabled, evaluates the finetuned model as well.
Minimal run using defaults:
python main.py trainConfiguration is resolved in two steps:
- General config (
dino_qpm/configs/main_training.yaml) - Model config in
dino_qpm/configs/modelsselected by(sldd_mode, arch[, mlp]), e.g.qpm/dinov2.yaml
The default configuration processes patch embeddings through the MLP token-by-token and derives the feature vector by average-pooling over all patch tokens (model.feat_vec_type=avg_pooling). An alternative no-MLP baseline (mlp: false, config qpm/dinov2_no_mlp.yaml) skips the MLP entirely and uses the CLS token as the feature vector (model.feat_vec_type=normal).
python main.py inference \
--model-path /path/to/model_checkpoint.pth \
--image-path /path/to/image_or_folder \
--output-json /path/to/predictions.jsonpython main.py evaluate \
--model-path /path/to/model_checkpoint.pth \
--mode finetune \
--eval-mode all \
--output-json /path/to/eval_results.jsonIf you use this work, please cite:
@inproceedings{zimmermann2026dino-qpm,
title = {{DINO-QPM}: Adapting Visual Foundation Models for Globally Interpretable
Image Classification},
author = {Zimmermann, Robert and Norrenbrock, Thomas and Rosenhahn, Bodo},
booktitle = {2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)},
year = {2026}
}Footnotes
-
Norrenbrock, Thomas, Marco Rudolph, and Bodo Rosenhahn. Q-senn: Quantized self-explaining neural networks. Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 38. No. 19. 2024. ↩
-
Norrenbrock, Thomas, Marco Rudolph, and Bodo Rosenhahn. Take 5: Interpretable Image Classification with a Handful of Features. ↩
-
Norrenbrock, Thomas, et al. QPM: Discrete optimization for globally interpretable image classification. The Thirteenth International Conference on Learning Representations. 2025. ↩
-
Rusty Blackbird Identification, All About Birds, Cornell Lab of Ornithology. https://www.allaboutbirds.org/guide/Rusty_Blackbird/id ↩
-
Carl Savignac. COSEWIC Assessment and Status Report on the Rusty Blackbird, Euphagus Carolinus, in Canada. Committee on the Status of Endangered Wildlife in Canada, Ottawa, 2006. ↩