This repository contains code and experiments for automatic keypoint description and identification for ViT-based models by using vision-language models, aiming to enhancing the explainability and interpretability in fine-grained image classification tasks. The approach is evaluated on the CUB-200-2011 dataset.
Before running the code, download the CUB-200-2011 dataset from Kaggle:
🔗 https://www.kaggle.com/datasets/wenewone/cub2002011
After downloading, extract the dataset and place it in a known location such as:
./CUB_200_data/CUB_200_2011/
💡 Make sure to update the paths in the notebooks or scripts based on your local setup.
We recommend using a virtual environment for dependency management.
git clone https://github.com/chensy618/SuperpixelCUB.git
cd VLPartpython3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activatepip install -r clean_requirements_updated.txtYou will need the following pre-trained checkpoints:
-
VLPart - Pascal Part AP / AP50
-
Prototype Representations
💡 Place these
.pthfiles under./checkpoints/or update paths in the scripts accordingly.
The primary experiments and visualizations can be found in the following Jupyter Notebooks:
-
KCConCUB_update.ipynb
→ Displays classification accuracy. -
SuperpixelInvestigationCUB_vlpart.ipynb
→ Explores superpixel and vlpart segments, keypoint discovery and matching and semantic alignment. -
Note: Since the notebook file is large, please download it first to view the results locally.
-
💡Alternatively, check out the generated PDF version of the notebook SuperpixelInvestigationCUB_vlpart.pdf and KCConCUB_update.
You may encounter errors such as:
FileNotFoundError: [Errno 2] No such file or directory
These are usually caused by incorrect file paths. Make sure all paths reflect your environment (e.g., local folder or mounted Google Drive).
- No training is required on CUB to achieve strong keypoint and semantic alignment, turning ViT-based models into Self-Explainable Models(SEMs).
- Achieves 82.7% classification accuracy with only 3 prototypes per class.
- Provides interpretable keypoint visualizations that help explain the model’s decision-making process for object recognition and classification.
🙏 Acknowledgements
This project leverages VLPart by Facebook Research for semantic segmentation.
We thank the authors for making their work publicly available.