Hypergraph Factorisation for Multi-Tissue Gene Expression Imputation

Welcome to the repository of Hypergraph Factorisation for Multi-Tissue Gene Expression Imputation — HYFA.

Overview of HYFA

HYFA processes gene expression from a number of collected tissues (e.g. accessible tissues) and infers the transcriptomes of uncollected tissues.

HYFA Workflow

The model receives as input a variable number of gene expression samples $x^{(k)}_i$ corresponding to the collected tissues $k \in \mathcal{T}(i)$ of a given individual $i$. The samples $x^{(k)}_i$ are fed through an encoder that computes low-dimensional representations $e^{(k)}_{ij}$ for each metagene $j \in 1 .. M$. A metagene is a latent, low-dimensional representation that captures certain gene expression patterns of the high-dimensional input sample.

These representations are then used as hyperedge features in a message passing neural network that operates on a hypergraph. In the hypergraph representation, each hyperedge labelled with $e^{(k)}_{ij}$ connects an individual $i$ with metagene $j$ and tissue $k$ if tissue $k$ was collected for individual $i$, i.e. $k \in \mathcal{T}(i)$. Through message passing, HYFA learns factorised representations of individual, tissue, and metagene nodes.

To infer the gene expression of an uncollected tissue $u$ of individual $i$, the corresponding factorised representations are fed through a multilayer perceptron (MLP) that predicts low-dimensional features $e^{(u)}_{ij}$ for each metagene $j \in 1 .. M$. HYFA finally processes these latent representations through a decoder that recovers the uncollected gene expression sample $\hat{x}^{(u)}_{ij}$.

Installation

Clone this repository: git clone https://github.com/rvinas/HYFA.git
Install the dependencies via the following command: pip install -r requirements.txt

The installation typically takes a few minutes.

Data download

To download the processed GTEx data, please follow these steps:

wget -O data/GTEx_data.csv.zip https://figshare.com/ndownloader/files/40208074
wget -O data/GTEx_Analysis_v8_Annotations_SubjectPhenotypesDS.txt https://storage.googleapis.com/adult-gtex/annotations/v8/metadata-files/GTEx_Analysis_v8_Annotations_SubjectPhenotypesDS.txt
unzip data/GTEx_data.csv.zip -d data

To download the pre-trained model, please run this command:

wget -O data/normalised_model_default.pth https://figshare.com/ndownloader/files/40208551

Running the model

Prepare your dataset:
- By default, the script train_gtex.py loads a dataset from a CSV file (GTEX_FILE) with the following format:
  - Columns are genes and rows are samples.
  - Entries correspond to normalised gene expression values.
  - The first row contains gene identifiers.
  - The first column contains donor identifiers. The file might contain multiple rows per donor.
  - An extra column tissue denotes the tissue from which the sample was collected. The combination of donor and tissue identifier is unique.
- The metadata is loaded from a separate CSV file (METADATA_FILE; see function GTEx_metadata in train_gtex.py). Rows correspond to donors and columns to covariates. By default, the script expects at least two columns: AGE (integer) and SEX (integer).
Example of gene expression CSV file:
```
, GENE1, GENE2, GENE3, tissue
INDIVIDUAL1, 0.0, 0.1, 0.2, heart
INDIVIDUAL1, 0.0, 0.1, 0.2, lung
INDIVIDUAL1, 0.0, 0.1, 0.2, breast
INDIVIDUAL2. 0.0, 0.1, 0.2, kidney
INDIVIDUAL3, 0.0, 0.1, 0.2, kidney
```
Example of metadata CSV file:
```
, AGE, SEX
INDIVIDUAL1, 34, 0
INDIVIDUAL2. 55, 1
INDIVIDUAL3, 49, 1
```
See the notebook hyfa_tutorial.ipynb for an overview of the data format and main features of HYFA.
Run the script train_gtex.py to train HYFA. This uses the default hyperparameters from config/default.yaml. After training, the model will be stored in your current working directory. We recommend training the model on a GPU machine (training takes between 15 and 30 minutes on a NVIDIA TITAN Xp).
Once the model is trained, evaluate your results via the notebook evaluate_GTEx_v8_normalised.ipynb.

Quick reference of main files

hyfa_tutorial.ipynb: Tutorial of the main features of HYFA.
train_gtex.py: Main script to train the multi-tissue imputation model on normalised GTEx data
evaluate_GTEx_v8_normalised.ipynb: Analysis of multi-tissue imputation quality on normalised data (i.e. model trained via train_gtex.py)
evaluate_GTEx_v9_signatures_normalised.ipynb: Analysis of cell-type signature imputation (i.e. fine-tunes model on GTEx-v9)

Data

src/data.py: Data object encapsulating multi-tissue gene expression
src/dataset.py: Dataset that takes care of processing the data
src/data_utils.py: Data utilities

Model

src/hnn.py: Hypergraph neural network
src/hypergraph_layer.py: Message passing on hypergraph
src/hnn_utils.py: Hypergraph model utilities
src/metagene_encoders.py: Model transforming gene expression to metagene values
src/metagene_decoders.py: Model transforming metagene values to gene expression

Training

src/train_utils.py: Train/eval loops
src/distribions.py: Count data distributions
src/losses.py: Loss functions for different data likelihoods

Other utils

src/pathway_utils.py: Utilities to retrieve KEGG pathways
src/ct_signature_utils.py: Utilities for inferring cell-type signatures

Citation

If you use this code for your research, please cite our paper:

@article{vinas2023hypergraph,
  title={Hypergraph factorization for multi-tissue gene expression imputation},
  author={Vi{\~n}as, Ramon and Joshi, Chaitanya K and Georgiev, Dobrik and Lin, Phillip and Dumitrascu, Bianca and Gamazon, Eric R and Li{\`o}, Pietro},
  journal={Nature Machine Intelligence},
  pages={1--15},
  year={2023},
  publisher={Nature Publishing Group UK London}
}

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
configs		configs
data/splits		data/splits
fig		fig
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
evaluate_GTEx_v8_normalised.ipynb		evaluate_GTEx_v8_normalised.ipynb
evaluate_GTEx_v9_signatures_normalised.ipynb		evaluate_GTEx_v9_signatures_normalised.ipynb
hyfa_tutorial.ipynb		hyfa_tutorial.ipynb
requirements.txt		requirements.txt
train_gtex.py		train_gtex.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hypergraph Factorisation for Multi-Tissue Gene Expression Imputation

Installation

Data download

Running the model

Quick reference of main files

Data

Model

Training

Other utils

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

rvinas/HYFA

Folders and files

Latest commit

History

Repository files navigation

Hypergraph Factorisation for Multi-Tissue Gene Expression Imputation

Installation

Data download

Running the model

Quick reference of main files

Data

Model

Training

Other utils

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages