Welcome to the repository of Hypergraph Factorisation for Multi-Tissue Gene Expression Imputation — HYFA.
Overview of HYFA
HYFA processes gene expression from a number of collected tissues (e.g. accessible tissues) and infers the transcriptomes of uncollected tissues.
HYFA Workflow
- The model receives as input a variable number of gene expression samples
$x^{(k)}_i$ corresponding to the collected tissues$k \in \mathcal{T}(i)$ of a given individual$i$ . The samples$x^{(k)}_i$ are fed through an encoder that computes low-dimensional representations$e^{(k)}_{ij}$ for each metagene$j \in 1 .. M$ . A metagene is a latent, low-dimensional representation that captures certain gene expression patterns of the high-dimensional input sample.- These representations are then used as hyperedge features in a message passing neural network that operates on a hypergraph. In the hypergraph representation, each hyperedge labelled with
$e^{(k)}_{ij}$ connects an individual$i$ with metagene$j$ and tissue$k$ if tissue$k$ was collected for individual$i$ , i.e.$k \in \mathcal{T}(i)$ . Through message passing, HYFA learns factorised representations of individual, tissue, and metagene nodes.- To infer the gene expression of an uncollected tissue
$u$ of individual$i$ , the corresponding factorised representations are fed through a multilayer perceptron (MLP) that predicts low-dimensional features$e^{(u)}_{ij}$ for each metagene$j \in 1 .. M$ . HYFA finally processes these latent representations through a decoder that recovers the uncollected gene expression sample$\hat{x}^{(u)}_{ij}$ .
- Clone this repository:
git clone https://github.com/rvinas/HYFA.git - Install the dependencies via the following command:
pip install -r requirements.txt
The installation typically takes a few minutes.
To download the processed GTEx data, please follow these steps:
wget -O data/GTEx_data.csv.zip https://figshare.com/ndownloader/files/40208074
wget -O data/GTEx_Analysis_v8_Annotations_SubjectPhenotypesDS.txt https://storage.googleapis.com/adult-gtex/annotations/v8/metadata-files/GTEx_Analysis_v8_Annotations_SubjectPhenotypesDS.txt
unzip data/GTEx_data.csv.zip -d data
To download the pre-trained model, please run this command:
wget -O data/normalised_model_default.pth https://figshare.com/ndownloader/files/40208551
-
Prepare your dataset:
- By default, the script
train_gtex.pyloads a dataset from a CSV file (GTEX_FILE) with the following format:- Columns are genes and rows are samples.
- Entries correspond to normalised gene expression values.
- The first row contains gene identifiers.
- The first column contains donor identifiers. The file might contain multiple rows per donor.
- An extra column
tissuedenotes the tissue from which the sample was collected. The combination of donor and tissue identifier is unique.
- The metadata is loaded from a separate CSV file (
METADATA_FILE; see functionGTEx_metadataintrain_gtex.py). Rows correspond to donors and columns to covariates. By default, the script expects at least two columns:AGE(integer) andSEX(integer).
Example of gene expression CSV file:
, GENE1, GENE2, GENE3, tissue INDIVIDUAL1, 0.0, 0.1, 0.2, heart INDIVIDUAL1, 0.0, 0.1, 0.2, lung INDIVIDUAL1, 0.0, 0.1, 0.2, breast INDIVIDUAL2. 0.0, 0.1, 0.2, kidney INDIVIDUAL3, 0.0, 0.1, 0.2, kidneyExample of metadata CSV file:
, AGE, SEX INDIVIDUAL1, 34, 0 INDIVIDUAL2. 55, 1 INDIVIDUAL3, 49, 1See the notebook
hyfa_tutorial.ipynbfor an overview of the data format and main features of HYFA. - By default, the script
-
Run the script
train_gtex.pyto train HYFA. This uses the default hyperparameters fromconfig/default.yaml. After training, the model will be stored in your current working directory. We recommend training the model on a GPU machine (training takes between 15 and 30 minutes on a NVIDIA TITAN Xp). -
Once the model is trained, evaluate your results via the notebook
evaluate_GTEx_v8_normalised.ipynb.
hyfa_tutorial.ipynb: Tutorial of the main features of HYFA.train_gtex.py: Main script to train the multi-tissue imputation model on normalised GTEx dataevaluate_GTEx_v8_normalised.ipynb: Analysis of multi-tissue imputation quality on normalised data (i.e. model trained viatrain_gtex.py)evaluate_GTEx_v9_signatures_normalised.ipynb: Analysis of cell-type signature imputation (i.e. fine-tunes model on GTEx-v9)
src/data.py: Data object encapsulating multi-tissue gene expressionsrc/dataset.py: Dataset that takes care of processing the datasrc/data_utils.py: Data utilities
src/hnn.py: Hypergraph neural networksrc/hypergraph_layer.py: Message passing on hypergraphsrc/hnn_utils.py: Hypergraph model utilitiessrc/metagene_encoders.py: Model transforming gene expression to metagene valuessrc/metagene_decoders.py: Model transforming metagene values to gene expression
src/train_utils.py: Train/eval loopssrc/distribions.py: Count data distributionssrc/losses.py: Loss functions for different data likelihoods
src/pathway_utils.py: Utilities to retrieve KEGG pathwayssrc/ct_signature_utils.py: Utilities for inferring cell-type signatures
If you use this code for your research, please cite our paper:
@article{vinas2023hypergraph,
title={Hypergraph factorization for multi-tissue gene expression imputation},
author={Vi{\~n}as, Ramon and Joshi, Chaitanya K and Georgiev, Dobrik and Lin, Phillip and Dumitrascu, Bianca and Gamazon, Eric R and Li{\`o}, Pietro},
journal={Nature Machine Intelligence},
pages={1--15},
year={2023},
publisher={Nature Publishing Group UK London}
}

