Skip to content

creatis-myriad/GENESIS

Repository files navigation

GENESIS

Code repository for the Graph nEural Networks for pulmonary EmboliSm rIsk Stratification (GENESIS) project.

python pytorch lightning hydra lightning-hydra-template
uv ruff pre-commit
code-quality tests codecov
license

Publications

Description

A project applying tabular models and graph neural networks to the task of pulmonary embolism risk stratification.

For tabular models, it is assumed that a CSV of global features (i.e., medical records, cardiac biomarkers, vascular biomarkers) for all patients is available. This code repository can then fit tabular models to predict the risk of pulmonary embolism from these features.

For graph neural networks (GNNs), it is assumed that an image processing pipeline previously segmented and extracted the graph of the vascular tree from 3D CTPA images. This code repository then takes these graphs as input and trains GNNs to predict the risk of pulmonary embolism from the vascular graphs and global features.

Important

Using this project requires a basic understanding of PyTorch Lightning and Hydra. If you do not know at least what these libraries do and how they work at a high level, you should familiarize yourself with them. We refer you to the PyTorch Lightning documentation and the Hydra documentation.

Table of Contents

  1. Installation
    1. uv
    2. pip
    3. Extras
    4. Weight & Biases configuration
  2. Reproduce published experiments
  3. Run custom experiments
    1. Basics
    2. Preset configs
    3. Track experiments
    4. Launch multiple experiments simultaneously
    5. Hyperparameter search with Optuna
  4. Run tests

Installation

uv (recommended)

Note

uv is a Python package and project manager. It allows you to manage Python interpreters, dependencies, and project configuration in a single tool. If you don't have it installed already, you can install it (on Linux and macOS) by running:

curl -LsSf https://astral.sh/uv/install.sh | sh
  1. Download the repository.
    git clone https://github.com/creatis-myriad/GENESIS
    cd GENESIS
  2. Create a virtual environment and install the project and its dependencies. You must specify as an extra the desired compute platform for PyTorch (i.e. CPU/CUDA). Supported values are: cpu, cu129, cu128, cu126.
    # e.g. to install the project with the PyTorch version built for CPU
    uv sync --extra cpu
    
    # e.g. to install the project with the PyTorch version built for CUDA 12.8
    uv sync --extra cu128
    [OPTIONAL] You can also specify other extras for additional functionalities:
    # e.g. to install the `wandb` extra for W&B integration
    uv sync --extra cpu --extra wandb
    
    # e.g. to install all extra functionalities at once
    uv sync --extra cpu --extra all
  3. Activate the virtual environment created by uv.
    source .venv/bin/activate

Pip

  1. Download the repository.
    git clone https://github.com/creatis-myriad/GENESIS
    cd GENESIS
  2. Create a virtual environment and activate it.
    python -m venv .venv
    source .venv/bin/activate
  3. Install PyTorch according to the official instructions. Follow the instructions for pip and the compute platform compatible with your system.
    # e.g. to install the PyTorch version built for CPU
    pip install torch --index-url https://download.pytorch.org/whl/cpu
    
    # e.g. to install the PyTorch version built for CUDA 12.8
    pip install torch --index-url https://download.pytorch.org/whl/cu128
  4. Install PyG and its torch_scatter dependency according to the official instructions Follow the instructions for pip and the compute platform compatible with your system.
    # install PyG
    pip install torch_geometric
    
    # install `torch_scatter` optional dependency (e.g. for CUDA 12.8)
    pip install torch_scatter -f https://data.pyg.org/whl/torch-2.8.0+cu128.html
  5. Install the project in editable mode.
    pip install -e .
    [OPTIONAL] You can also specify other extras for additional functionalities:
    # e.g. to install the `wandb` extra for W&B integration
    pip install -e .[wandb]
    
    # e.g. to install all extra functionalities at once
    pip install -e .[all]

List of available extras

  • [cpu|cu129|cu128|cu126]: Required mutually exclusive extras to install the project with a PyTorch version built for CPU or a specific CUDA version (only available when using uv, not pip).
  • all: Install all (non-mutually exclusive) extras at once.
  • baselines: Extra dependencies required to run the baselines.
  • totalsegmentator: For using the pretrained TotalSegmentator model for segmenting heart ventricles to preprocess images.
  • wandb: For experiment tracking with Weights & Biases.

Setup Weight & Biases

Create an account

Follow the instructions on the Weights & Biases website to create an account.

Install W&B

Make sure that you install the wandb extra when installing the project, as shown in the installation instructions.

Configure your credentials

The recommended way to configure your W&B credentials is to expose them as environment variables (see W&B's documentation on this). You can do this by copying the configs/local/example.yaml to a new default.yaml (which will be ignored by Git) and filling in your W&B credentials.

You don't have to do anything more than that, as the project is configured to automatically load keys under hydra.job.env_set as environment variables when executing the scripts.

Use the wandb logger

Follow the instructions provided in the How to run section to enable experiment tracking via W&B.

Reproduce published experiments

The commands below are meant to reproduce the experiments described in the paper. They will run different combinations of models and data configurations in a 10-fold cross-validation setting.

Results are logged both locally and online on W&B (see previous section for instructions to set up W&B). Each experiment corresponds to a configuration run on a specific cross-validation fold. To facilitate analysis, groups in W&B correspond to the same configuration run on the cross-validation folds.

Depending on the type of model (tabular or GNN) different Python entry point scripts are called:

Warning

Running the experiments below requires access to the PERSEVERE dataset, which is not publicly available. Thus, the scripts should not be expected to run as-is without the dataset. Rather, the scripts and code are provided for reference.

Tip

Since models are implemented in a dataset-agnostic way, implementing PyG datasets and providing corresponding configs should be all that is needed to test the models on other datasets.

Ablation study of global features with tabular models for risk stratification

To run tabular models (TabPFN, XGBoost) on combinations of global features (medical records, cardiac biomarkers, vascular biomarkers):

scripts/train-persevere-tabular.sh

Benchmark of GNNs on vascular graph and global features for risk stratification

To run GNN backbones (GCN, GAT, GIN, GPS), with and without Virtual Nodes (VN) for MPNN backbones, and with different strategies to combine global features (early fusion (EF), late fusion (LF), virtual node (VN), Feature Tokenizer with cross-attention (FTxA)):

scripts/train-persevere-gnn.sh

Vascular biomarkers regression as sanity check on GNNs

To compare the best tabular and GNN backbones for the prediction of vascular biomarkers that are derived from local graph features:

  • Runs TabPFN on global features (medical records, cardiac biomarkers);
  • Runs GIN and GPS on the vascular graphs.
scripts/run-persevere-sanity-check-targets.sh

Ablation study of data and graph representations on the best GNN configuration

To run the best GNN configuration with alternative graph and global features representations:

# Test the primal graph representation.
# The default config uses the dual (i.e. line graph) representation.
scripts/run-persevere-gnn-ablation.sh graph_representation

# Test linear and TabPFN embedding of global features.
# The default config uses the Feature Tokenizer embedding.
scripts/run-persevere-gnn-ablation.sh global_features_embedding

# Test using a CLS token on global features as readout, i.e. graph-level representation.
# The default configuration uses global graph pooling (i.e., mean or sum depending on the config).
scripts/run-persevere-gnn-ablation.sh readout

Important

The results of these runs are meant to be compared to runs launched with the best GNN configuration, GPS + Feature Tokenizer with Cross-Attention (gps+ftxa), run as part of the GNN benchmark.

Tip

Calling the run-persevere-gnn-ablation.sh script with the name of one of the folders in configs/experiment/ablation will run all the experiment configs in that folder using the train.py script.

Run custom experiments

This section describes how to configure individual experiments, e.g., to change hyperparameters, models, datasets, etc., if you want more control over the configuration than the predefined batch of experiments described in the previous section.

The basics

Train model with the default configuration (on the small MUTAG dataset).

# train on CPU
gnn-train trainer=cpu

# train on GPU
gnn-train trainer=gpu

Override any individual parameter in the config files from the command line like this:

# override the number of epochs and batch size
gnn-train trainer.max_epochs=20 data.batch_size=64 ...

# train default model on your dataset
gnn-train data/dataset=<YOUR_DATASET_CONFIG> ...

# train your model on the default dataset
gnn-train model=<YOUR_MODEL_CONFIG> ...

To evaluate a trained model, use the gnn-eval script.

# evaluate a trained model on your dataset's test set
gnn-eval data=<DATAMODULE_CONFIG> data/dataset=<DATASET_CONFIG> model=<MODEL_CONFIG> ckpt_path=<PATH_TO_CHECKPOINT>

Use preset configs

Train model with chosen experiment configuration from configs/experiment/.

Tip

This allows you to provide (complete) presets on top of the default configuration, typically for experiments you want to run regularly.

gnn-train experiment=<YOUR_EXPERIMENT_CONFIG>

Track experiments

The implemented tool to track experiments is Weights & Biases, by using W&B's integration in PyTorch Lightning.

Warning

You must have followed the W&B setup instructions to use this feature.

# track experiment online w/ W&B
gnn-train logger=wandb

# track experiment offline w/ W&B
gnn-train logger=wandb logger.wandb.offline=True

Run multiple experiments

Launch multiple experiments at once using the multirun (-m) option.

# run multiple experiments sequentially, here w/ 5 different seeds
gnn-train -m seed=0,1,2,3,4

Launch multiple experiments at once in parallel using the Joblib launcher for Hydra.

Note

The hydra-joblib-launcher plugin required to use this feature is installed by default with the project, so no need to install it by yourself.

# run multiple experiments in parallel, here w/ 5 different seeds
gnn-train -m hydra/launcher=joblib seed=0,1,2,3,4

Run automatic hyperparameter search with Optuna

Launch an automatic hyperparameter search using the Optuna sweeper for Hydra.

Warning

You have to make sure that the hparams_search config you use is compatible with the model, since hparams_search defines how to sweep over model-dependent config options.

# Example of a predefined Optuna config for graph-level models compatible with the default experiment
gnn-train hparams_search=graph_classification_optuna

Tip

Optuna can be used in a cross-validation setting, by evaluating each sampling of hyperparameters on the different dataset folds and reporting the average performance. However, this approach is not compatible with the default Optuna sweeper plugin, where each trial corresponds to one Hydra run, i.e. one model trained/evaluated on a specific partition of the dataset.

To support this feature, we rely on our custom serial_sweeper, designed to run multiple jobs in sequence within the same Hydra run and then aggregate the results of these jobs. By sweeping over the different folds with this sweeper, we support cross-validation with Optuna.

This is all handled already in the predefined Optuna config graph_classification_optuna for graph-level models. If you want to support this in your own Optuna config, all you have to do is to use the predefined splits config for serial_sweeper, and make sure that data/split=kfold is used to split the data into multiple folds.

gnn-train [...] hparams_search=<YOUR_OPTUNA_CONFIG> data/split=k_fold serial_sweeper=splits

Run tests

Run the tests using Pytest.

# run all tests
pytest

# run a test package
pytest tests/integration

# run tests from a specific file
pytest tests/integration/test_train.py

# run all tests except the ones marked as slow
pytest -k "not slow"

About

Graph nEural Networks for pulmonary EmboliSm rIsk Stratification (GENESIS)

Resources

License

Stars

Watchers

Forks

Contributors