This is an implementation of Simple Masked Image Modelling (SimMIM), a self-supervised learning framework for pre-training vision transformers for downstream tasks such as classification or segmentation. A VisionTransformer model is pre-trained on the ImageNet-1K dataset and fine-tuned on the Oxford-IIIT Pets dataset for segmentation. Additionally, we investigate the effects of intermediate fine-tuning on the Intel Image Classification dataset, performed before segmentation training. See the technical report for details.
Four random reconstruction samples from the pre-trained encoder (from left to right: original image, masked image and reconstruction):

Four random samples of segmentation predictions from the fine-tuned model (from left to right: original image, ground-truth segmentation map and predicted map):

This project requires Python 3.10 and packages listed in requirements.txt.
To install required packages, run the command:
pip install -r requirements.txtin your virtual environment.
-
Run the following to start pre-training (with the default settings used in the report, on a smaller subset of data):
python main_pretrain.py \ --config vit_4M_pretrain \ --train_size 10000 \ --download_imagenet
This will first download and extract the compressed ImageNet files, then start printing training statistics and save the model weights as a
.pthfile every epoch. Use the flag--run_plotsto save reconstructions during training, and the--val_setflag to use the smaller (validation) set only, for quicker testing. Change the train size between 45k, 100k and 200k to reproduce results from the report.
Note the full download may take upwards of 4 hours, depending on your connection. You may choose to download the validation set and the devkit files only and train on this smaller subset, in which case use the
--val_setflag.
[Optional] Manual downloads: Navigate to the project's
/data/folder and download ImageNet-1K by either running these commands below in a bash command line, or manually using the links to these 3 files (devkit, validation, train):
cd data
wget https://image-net.org/data/ILSVRC/2012/ILSVRC2012_devkit_t12.tar.gz --no-check-certificate
wget https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_val.tar --no-check-certificate
wget https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_train.tar --no-check-certificate-
With the pre-trained encoder weights in the
/weights/folder, run this command for fine-tuning, which will download the Oxford-IIIT Pets dataset and start training initialised with the weights given:python main_finetune.py \ --config vit_4M_finetune \ --train_size 6000 \ --test_size 1000 \ --weights weights/encoder_vit_4M_pretrain_200K.pth
Loss is printed every epoch while test set pixel accuracy and mean IoU is calculated after training is complete. Segmentation predictions will be saved under
/figures/. Change the train size to reproduce results from the report. -
To run a baseline model with no pre-training, omit the
--weightsargument, i.e. use the following command:python main_finetune.py \ --config vit_4M_finetune \ --train_size 6000 \ --test_size 1000
With the pre-trained encoder weights in the /weights/ folder, run the following command to perform intermediate fine-tuning on this dataset, followed by segmentation fine-tuning on Oxford-IIIT Pets:
python main_finetune.py \
--config vit_4M_finetune \
--train_size 6000 \
--test_size 1000 \
--weights weights/encoder_vit_4M_pretrain_200K.pth \
--int_finetuneFirst, the Intel Image Classification dataset will be automatically downloaded. You may choose to do this manually via:
cd data
wget https://huggingface.co/datasets/miladfa7/Intel-Image-Classification/resolve/main/archive.zip?download=trueTo plot reconstructions from pre-trained models on ImageNet validation set (download above):
python evaluation.py \
--config vit_4M_pretrain \
--weights weights/mim_vit_4M_pretrain_200K.pthTo evaluate a fine-tuned segmentation model on the Oxford-IIIt Pets test set, use a command like the following, replacing the weights with those saved after fine-tuning (see above):
python evaluation.py \
--config vit_4M_finetune \
--weights weights/vit_4M_finetune_data_250.pth \
--test_size 1000 \
--train_size 250