Skip to content

Automated key point identification and description for Vision Transformers using vision-language models

Notifications You must be signed in to change notification settings

chensy618/SuperpixelCUB

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KLCs - Keypoint Labeling Classifiers

This repository contains code and experiments for automatic keypoint description and identification for ViT-based models by using vision-language models, aiming to enhancing the explainability and interpretability in fine-grained image classification tasks. The approach is evaluated on the CUB-200-2011 dataset.

📁 Dataset

Before running the code, download the CUB-200-2011 dataset from Kaggle:

🔗 https://www.kaggle.com/datasets/wenewone/cub2002011

After downloading, extract the dataset and place it in a known location such as:

./CUB_200_data/CUB_200_2011/

💡 Make sure to update the paths in the notebooks or scripts based on your local setup.


⚙️ Environment Setup

We recommend using a virtual environment for dependency management.

1. Clone the Repository

git clone https://github.com/chensy618/SuperpixelCUB.git
cd VLPart

2. Create and Activate Virtual Environment

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

3. Install Dependencies

pip install -r clean_requirements_updated.txt

🔽 4. Download .pth Files

You will need the following pre-trained checkpoints:

  1. VLPart - Pascal Part AP / AP50

  2. Prototype Representations

💡 Place these .pth files under ./checkpoints/ or update paths in the scripts accordingly.


📈 Main Results

The primary experiments and visualizations can be found in the following Jupyter Notebooks:


⚠️ Execution Notes

You may encounter errors such as:

FileNotFoundError: [Errno 2] No such file or directory

These are usually caused by incorrect file paths. Make sure all paths reflect your environment (e.g., local folder or mounted Google Drive).


✨ Highlights

  • No training is required on CUB to achieve strong keypoint and semantic alignment, turning ViT-based models into Self-Explainable Models(SEMs).
  • Achieves 82.7% classification accuracy with only 3 prototypes per class.
  • Provides interpretable keypoint visualizations that help explain the model’s decision-making process for object recognition and classification.

🙏 Acknowledgements

This project leverages VLPart by Facebook Research for semantic segmentation.

We thank the authors for making their work publicly available.

About

Automated key point identification and description for Vision Transformers using vision-language models

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published