The training and evaluation code requires PyTorch 2.0 and xFormers 0.0.18 as well as a number of other 3rd party packages. Note that the code has only been tested with the specified versions and also expects a Linux environment. To setup all the required dependencies for training and evaluation, please follow the instructions below:
Clone the repository and then create and activate a dinov2 conda environment using the provided environment definition:
conda env create -f conda.yaml
conda activate dinov2For dense tasks (depth estimation and semantic segmentation), there are additional dependencies (specific versions of mmcv and mmsegmentation) which are captured in the extras dependency specifications:
conda env create -f conda-extras.yaml
conda activate dinov2-extrasYou need to wrap up your data in a tarball file:
-
Ensure images are all in one directory
-
Create a single large tarball file that contains all images and name it
pretrain_dataset.tar:tar -chf pretrain_dataset.tar /path/to/image/folder
-
Infer the auxiliary files
pretrain_entries.npyandpretrain_file_indices.npy:python scripts/infer_entries.py \ --tarball_path /path/to/pretrain_dataset.tar \ --output_root /path/to/output/folder \ --name pretrainThe
pretrain_entries.npyfile will record:- a dummy class index (we set it to 0 for all images since we’re not using classes)
- a unique filename index for each image
- the start and end offsets of each image within the tarball file
The
pretrain_file_indices.npyfile consists in a dictionnary mapping filename index to corresponding filename. -
Dump
pretrain_dataset.tar,pretrain_entries.npyandpretrain_file_indices.npyin a common folder (e.g./root/data)
You may not want to use all the patches of a cohort, but only a subset of them (e.g. the cohort comes with a train/tune/test split and you only want to use the patches belonging to slides in the train partition).
Then, follow these simple steps:
-
Dump the image filenames (e.g.
patch1.jpg) of the subset of interest in a.txtfile (e.g.{subset}.txt) -
Infer the corresponding auxiliary files
pretrain_entries_{subset}.npypython scripts/infer_entries.py \ --tarball_path /path/to/pretrain_dataset.tar \ --output_root /path/to/output/folder \ --keep /path/to/{subset}.txt \ --name pretrain \ --suffix {subset}The
pretrain_entries_{subset}.npyfile will record:- a dummy class index (we set it to 0 for all images since we’re not using classes)
- a unique filename index for each image listed in
{subset}.txt - the start and end offsets of each image within the tarball file
A generic
pretrain_file_indices.npyfile will be saved the first time you run this command.
It consists in a dictionnary mapping filename index to corresponding filename for the entire tarball file. -
Dump
pretrain_dataset.tar,pretrain_entries_{subset}.npyandpretrain_file_indices.npyin a common folder (e.g./root/data)
This section describes the steps to follow in case you want to run tuning on a downstream task dataset with patch-level labels.
-
Create a
.csvfile containing downstream patches' filenames and labels:filename,label downstream_patch_1.jpg,3 downstream_patch_2.jpg,1 ... -
Create a single tarball file that contains all downstream tuning patches and name it
downstream_dataset.tartar -chf downstream_dataset.tar /path/to/downstream/dataset/image/folder
-
Infer the auxiliary files
query_entries.npyandquery_file_indices.npy:python3 scripts/infer_entries.py \ --tarball_path /path/to/downstream_dataset.tar \ --output_root /path/to/output/folder \ --csv /path/to/csv/file.csv \ --keep /path/to/output/query.txt \ --prefix query/path/to/csv/file.csvshould point to the.csvfile created in step 1. just above
/path/to/output/query.txtshould contain the list of filnames for the patches in the query subset of the downstream dataset. -
Infer the auxiliary file
test_entries.npyandtest_file_indices.npy:python3 scripts/infer_entries.py \ --tarball_path /path/to/downstream_dataset.tar \ --output_root /path/to/output/folder \ --csv /path/to/csv/file.csv \ --keep /path/to/output/test.txt \ --prefix test/path/to/csv/file.csvshould point to the.csvfile created in step 1. just above
/path/to/output/test.txtshould contain the list of filnames for the patches in the test subset of the downstream dataset. -
dump the
.tarfile and the.npyfiles in a common folder (e.g./root/data)
dinov2 package is included in the Python module search path:
export PYTHONPATH="${PYTHONPATH}:/path/to/your/dinov2"Update dinov2/configs/train/vitl14.yaml if you want to change some parameters (e.g. enabling early stopping).
Then run:
python -m torch.distributed.run --nproc_per_node=gpu dinov2/train/train.py \
--config-file dinov2/configs/train/vitl14.yaml \
train.dataset_path=Pathology:root={path/to/tarball/root}:extra={path/to/entry/root}:subset={subset}Replace {path/to/data/root} with the root folder where tarballs are saved, and {path/to/entry/root} with the root folder where numpy entry files are saved (e.g. Pathology:root=/root/data:extra=/root/data).
Leave out :subset={subset} if you didn't restrict the dataset to a specific subset when preparing data.
Otherwise, replace {subset} with the suffix you chose for --suffix in data preparation (e.g. Pathology:root=/root/data:extra=/root/data:subset=train).
In case you want to run downstream tuning, make sure to update the following two parameters in your config:
tune:
query_dataset_path: KNN:root={path/to/data/root}:extra={path/to/entry/root}:split=query
test_dataset_path: KNN:root={path/to/data/root}:extra={path/to/entry/root}:split=testReplace {path/to/data/root} with the folder where you dumped the downstream .tar files.
Replace {path/to/entry/root} with the folder where you dumped the downstream .npy entry files.