Skip to content

CirculatoryHealth/PHENOMICL

Repository files navigation

PHENOMICL: a machine learning model for PHENOtyping of whole-slide images using Multiple InstanCe Learning

DOI

Francesco Cisternino1*, Yipei Song2*, Tim S. Peters3*, Roderick Westerman3, Gert J. de Borst4, Ernest Diez Benavente3, Noortje A.M. van den Dungen3, Petra Homoed-van der Kraak5, Dominique P.V. de Kleijn4, Joost Mekke4, Michal Mokry3, Gerard Pasterkamp3, Hester M. den Ruijter3,6, Evelyn Velema6, Clint L. Miller2*, Craig A. Glastonbury1,7*, S.W. van der Laan2,3*.

* these authors contributed equally

Affiliations
1 Human Technopole, Viale Rita Levi-Montalcini 1, 20157, Milan, Italy; 2 Department of Genome Sciences, University of Virginia, Charlottesville, VA, USA; 3 Central Diagnostic Laboratory, Division Laboratories, Pharmacy, and Biomedical genetics, University Medical Center Utrecht, Utrecht University, Utrecht, the Netherlands; 4 Vascular surgery, University Medical Center Utrecht, Utrecht University, Utrecht, the Netherlands; 5 Pathology, University Medical Center Utrecht, Utrecht University, Utrecht, the Netherlands; 6 Experimental Cardiology, Department Cardiology, Division Heart & Lungs, University Medical Center Utrecht, Utrecht University, Utrecht, the Netherlands; 7 Nuffield Department of Medicine, University of Oxford, Oxford, UK.

Background

Despite tremendous medical progress, cardiovascular diseases (CVD) are still topping global charts of morbidity and mortality. Atherosclerosis is the major underlying cause of CVD and results in atherosclerotic plaque formation. The extent and type of atherosclerosis is manually assessed through histological analysis, and histological characteristics are linked to major acute cardiovascular events (MACE). However, conventional means of assessing plaque characteristics suffer major limitations directly impacting their predictive power. PHENOMICL will use a machine learning method, multiple instance learning (MIL), to develop an internal representation of the 2-dimensional plaque images, allowing the model to learn position and scale in variant structures in the data. We created a powerful model for image recognition problems using whole-slide images from stained atherosclerotic plaques to predict relevant phenotypes, for example, intraplaque haemorrhage.

This work is associated with the PHENOMICL_downstream project.

Where do I start?

Folder structure

  • ./scripts/: Example bash (driver) scripts to run the pre-processing, training and evaluation.
  • ./examples/: Example input files per stain for the Usage example code. Also contains model checkpoints.
  • ./wsi_preprocessing/: Pre-processing scripts (segmentation/feature extraction).
  • ./AtheroExpressCLAM/: Code to run (and train) model.
    • iph.py: Scripts for generating visualisation of IPH heatmap.
    • main.py: Main script to train the model.

Installation

To get started with PHENOMICL, follow these steps to set up your environment and install the required dependencies.

Note

Expected installation time in a typical Linux/macOS environment: ~15-20 minutes. Installation steps verified on Macbook Air M1 (CPU version) & Lunix server, Rocky8 (CUDA version).

Step 1: Clone the Repository

Clone the repository to your local machine:

git clone https://github.com/CirculatoryHealth/PHENOMICL.git
cd PHENOMICL

Step 2: Set Up a Conda Environment

Depending on your hardware, choose one of the following options:

For CPU-only Environments:

conda env create -f phenomicl.yml
conda activate phenomicl

For GPU (CUDA) Environments:

conda env create -f phenomicl_cuda.yml
conda activate phenomicl

Ensure CUDA is installed correctly:

python -c "import torch; print('CUDA available:', torch.cuda.is_available())"

Step 3: Install OpenSlide

OpenSlide (>v3.1.0) is required for processing whole-slide images. Install it using your preferred method:

macOS (via Homebrew):

brew install openslide

Ubuntu/Debian:

sudo apt-get install openslide-tools

Windows:

Download and install OpenSlide from the official website.

Step 4: Verify Installation

Ensure Openslide is installed correctly:

python -c "import openslide; print('OpenSlide version:', openslide.__version__)"

Usage

Once the installation is complete, you can start using PHENOMICL for pre-processing and running machine learning models on whole-slide images.

Download Segmentation model (Unet) checkpoint

Download the Unet segmentation checkpoint and place it in the ./examples/ folder.

Download Example Whole slide images

We made 3 WSIs (ndpi/TIF) available for all 9 stain models to test and run the code. These can be downloaded from the DataverseNL website. To use these WSIs without to much configuration, please place them in the folder as specified in Step 1 of pre-processing (Stain type is mentioned in file name of WSI).

Besides the WSIs, other files like macro images, WSI metadata, and slide thumbnails can be downloaded. These are not necessary to run the following example.

On the DataverseNL website you can also download example results for the HE stain. These results will correspond with the results you will generate when running the code with the example WSIs.

Important

Are you using the example WSIs, please update these files to contain the path to this repository directory.

  • ./examples/*/phenomicl_test_set.csv: These files are pre-prepared for use with example WSIs. For each stain, please edit the phenomicl_test_set.csv file such that the full_path column has the full directory path to this repository directory.

Pre-processing Workflow

Note

Time required:

  • Macbook Air M1, CPU (NO CUDA), processing 3 example WSIs: ~1.25 hour.
  • Lunix server, Rocky8, Tesla V100-PCIE-16GB GPU (CUDA), processing 3 example WSIs: ~6 minutes.

Important

After installation, all commands mentioned below can be executed through the command line from the GitHub repository directory. to get there, please use cd <full/path/to>/PHENOMICL.

Step 1: Organise Whole-Slide Images by Stain

Before proceeding, ensure your whole-slide images (WSIs) (ndpi/TIF) are organised into separate folders based on their stain type. For example:

PHENOMICL/examples/HE/
PHENOMICL/examples/SMA/
...

Each folder should contain WSIs corresponding to the specific stain (e.g., all HE-stained images in the HE folder and all SMA-stained images in the SMA folder).

Step 2: Set the Working Directory

Define the directory containing your organised WSIs for the stain you want to process. Execute:

Note

We process each stain separately. If you want to process all 9 stains, repeat 9 times for each stain folder.

SLIDE_DIR="/full_path_to_where_the_wsi_are_for_stain/"
# Examples:
# SLIDE_DIR="</full/path/to>/PHENOMICL/examples/HE/"
# SLIDE_DIR="</full/path/to>/PHENOMICL/examples/SMA/"
# ...

Step 3: Segmentation and Patch Extraction

Run the segmentation script to extract patches from WSIs:

python ./wsi_preprocessing/segmentation.py \
--slide_dir="${SLIDE_DIR}" \
--output_dir="${SLIDE_DIR}/PROCESSED" \
--masks_dir="${SLIDE_DIR}/PROCESSED/masks/" \
--model ./examples/checkpoint_ts.pth

Step 4: Feature Extraction

Extract features from the processed patches:

python ./wsi_preprocessing/extract_features.py \
-h5_data="${SLIDE_DIR}/PROCESSED/patches/" \
-slide_folder="${SLIDE_DIR}" \
-output_dir="${SLIDE_DIR}/PROCESSED/features/"

Tip

If you encounter a Permission denied error related to Torch cache, manually create the directory:

mkdir -p ~/.cache/torch

Running the Model

Step 1: Set the Working Directory

Define the directory containing your organised WSIs for the stain you want to process. Execute:

Note

We process each stain separately. If you want to process all 9 stains, repeat 9 times for each stain folder.

SLIDE_DIR="/full_path_to_where_the_wsi_are_for_stain/"
# Examples:
# SLIDE_DIR="</full/path/to>/PHENOMICL/examples/HE/"
# SLIDE_DIR="</full/path/to>/PHENOMICL/examples/SMA/"
# ...

Step 2: Run the Model

Run the model on the pre-processed slides to generate predictions and heatmaps:

python3 ./AtheroExpressCLAM/iph.py \
--h5_dir="${SLIDE_DIR}/PROCESSED/features/h5_files/" \
--csv_in="${SLIDE_DIR}/phenomicl_test_set.csv" \
--csv_out="${SLIDE_DIR}/phenomicl_test_results.csv" \
--out_dir="${SLIDE_DIR}/heatmaps/" \
--model_checkpoint="${SLIDE_DIR}/MODEL_CHECKPOINT.pt"

Step 3: Check results

After you ran the model, the results can be found in the corresponding stain folder. (e.g. For HE, the results will be in ./examples/HE/).

  • You will find the heatmaps in ./examples/<STAIN>/heatmaps/.
  • You will find the model results (prediction, probability, area) in ./examples/<STAIN>/phenomicl_test_results.csv

Additional Notes

  • Input Data: Ensure your input WSIs are in the correct format and stored in the specified directory.
  • Output Data: Processed data, features, and results will be saved in the respective subdirectories under SLIDE_DIR/PROCESSED.
  • Hardware Requirements: For optimal performance, use a GPU-enabled environment for large datasets.

Project structure

File Description Usage
README.md Description of project Human editable
PHENOMICL.Rproj Project file Loads project
LICENSE User permissions Read only
.worcs WORCS metadata YAML Read only
renv.lock Reproducible R environment Read only
images Images used in readme, etc Human editable
scripts Script to process data Human editable

Reproducibility

This project uses the Workflow for Open Reproducible Code in Science (WORCS) to ensure transparency and reproducibility. The workflow is designed to meet the principles of Open Science throughout a research project.

To learn how WORCS helps researchers meet the TOP-guidelines and FAIR principles, read the preprint at https://osf.io/zcvbs/

WORCS: Advice for authors

WORCS: Advice for readers

Please refer to the vignette on reproducing a WORCS project for step by step advice.

Questions and issues

Do you have burning questions or do you want to discuss usage with other users? Do you want to report an issue? Or do you have an idea for improvement or adding new features to our method and tool? Please use the Issues tab.

Citations

Using our PHENOMICL method? Please cite our work:

Intraplaque haemorrhage quantification and molecular characterisation using attention based multiple instance learning
Francesco Cisternino, Yipei Song, Tim S. Peters, Roderick Westerman, Gert J. de Borst, Ernest Diez Benavente, Noortje A.M. van den Dungen, Petra Homoed-van der Kraak, Dominique P.V. de Kleijn, Joost Mekke, Michal Mokry, Gerard Pasterkamp, Hester M. den Ruijter, Evelyn Velema, Clint L. Miller, Craig A. Glastonbury, S.W. van der Laan.
medRxiv 2025.03.04.25323316; doi: https://doi.org/10.1101/2025.03.04.25323316.

Data availability

The whole-slide images used in this project are available through a DataverseNL repository. There are restrictions on use by commercial parties, and on sharing openly based on (inter)national laws, regulations and the written informed consent. Therefore these data (and additional clinical data) are only available upon discussion and signing a Data Sharing Agreement (see Terms of Access) and within a specially designed UMC Utrecht provided environment.

Acknowledgements

We are thankful for the support of the Netherlands CardioVascular Research Initiative of the Netherlands Heart Foundation (CVON 2011/B019 and CVON 2017-20: Generating the best evidence-based pharmaceutical targets for atherosclerosis [GENIUS I&II]), the ERA-CVD program 'druggable-MI-targets' (grant number: 01KL1802), the Leducq Fondation 'PlaqOmics' and ‘AtheroGen’, and the Chan Zuckerberg Initiative ‘MetaPlaq’. The research for this contribution was made possible by the AI for Health working group of the EWUU alliance. The collaborative project ‘Getting the Perfect Image’ was co-financed through use of PPP Allowance awarded by Health~Holland, Top Sector Life Sciences & Health, to stimulate public-private partnerships.

Funding for this research was provided by National Institutes of Health (NIH) grant nos. R00HL125912 and R01HL14823 (to Clint L. Miller), a Leducq Foundation Transatlantic Network of Excellence ('PlaqOmics') grant no. 18CVD02 (to Dr. Clint L. Miller and Dr. Sander W. van der Laan), the CZI funded 'MetaPlaq' (to Dr. Clint L. Miller and Dr. Sander W. van der Laan), EU HORIZON NextGen (grant number: 101136962, to Dr. Sander W. van der Laan), EU HORIZON MIRACLE (grant number: 101115381, to Dr. Sander W. van der Laan), and Health~Holland PPP Allowance ‘Getting the Perfect Image’ (to Dr. Sander W. van der Laan).

Dr Craig A. Glastonbury has stock options in BenevolentAI and is a paid consultant for BenevolentAI, unrelated to this work. Dr. Sander W. van der Laan was funded by Roche Diagnostics, as part of 'Getting the Perfect Image', however Roche was not involved in the conception, design, execution or in any other way, shape or form of this project.

The framework was based on the WORCS package.

Changes log

Version:      v1.2.0
Last update:  2025-04-25
Written by:   Francesco Cisternino; Craig Glastonbury; Sander W. van der Laan; Clint L. Miller; Yipei Song; Tim S. Peters.
Description:  CONVOCALS repository: classification of atherosclerotic histological whole-slide images
Minimum requirements: R version 3.4.3 (2017-06-30) -- 'Single Candle', Mac OS X El Capitan

**MoSCoW To-Do List**
The things we Must, Should, Could, and Would have given the time we have.
_M_

_S_

_C_

_W_

Changes log
* v1.2.0 Major Installation/Usage instruction ReadMe update.
* v1.1.0 Major overhaul, updates and re-organization prior to submission.
* v1.0.1 Updates and re-organization.
* v1.0.0 Initial version. 

Creative Commons BY-NC-ND 4.0

This is a human-readable summary of (and not a substitute for) the license.
You are free to share, copy and redistribute the material in any medium or format. The licencor cannot revoke these freedoms as long as you follow the license terms.

Under the following terms:
- Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
- NonCommercial — You may not use the material for commercial purposes.
- NoDerivatives — If you remix, transform, or build upon the material, you may not distribute the modified material.
- No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

Notices:
You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation. No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material.

About

Apply machine or deep learning to high-resolution, histological slide images

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors