EMP-FM is a foundation model that can classify epithelial-mesenchymal transition (EMT) states in single cell RNA-seq data.
Epithelial–mesenchymal plasticity (EMP) plays a significant role in various biological processes including tumour progression and chemoresistance. However, the expression programmes underlying the epithelial–mesenchymal transition (EMT) in cancer are diverse, and accurately defining the EMT status of tumour cells remains a challenging task. In this study, we employed a pre-trained single-cell foundation model (scFM) to develop an EMP-foundation model (EMP-FM) that allows us to capture discrete states within the EMT continuum in single cell cancer data. In capturing EMP states, we achieved an average Area Under the Receiver Operating Characteristic curve (AUROC) of 90% across multiple cancer types. We propose a new metric, ADESI, to aid the biological interpretability of our model, and derive EMP signatures liked with energy metabolism and motility reprogramming underlying these state switches. Our study provides a proof of concept that scFMs can be applied to characterise cell states in single cell data, and proposes a generalisable framework to predict EMP in single cell RNA-seq that can be adapted and expanded to characterise other cellular states.
The preprint presenting this tool Classifying epithelial-mesenchymal transition (EMT) states in single cell cancer data using large language models is available on biorXiv.
To set up the environment, you can either use Conda or Pip:
Run the following command to recreate the environment using the saved conda environment file:
conda env create -f environment.ymlRun the following command to recreate the environment using the saved pip environment file:
pip install -r requirements.txtThe code of the scMultiNet generic classifier is included in the scFM folder. All the code for training, validating and applying the EMP-FM model is included in the Experiment folder.
The Experiment folder is structured as follows:
All the code for preprocessing the raw data in our manuscript, including the generation of the count matrix and the annotation file. Please use the "0_preprocess_example.ipynb" to generate the count matrix and the annotation file for your own dataset. And please create a Data folder in the Step_0_preprocess_raw_data folder to store the processed data.
All the code for training the EMP-FM model in phase 1 in our manuscript.
All the code for training the EMP-FM model in phase 2 in our manuscript.
baseline_roc_confusion.ipynb: visualise the ROC curve and the confusion matrix of the baseline models. It provides a comparison between the baseline models and the EMP-FM model.
plot_ROC_confusion.ipynb: visualise the ROC curve and the confusion matrix of the EMP-FM model for different tissue types.
All of the code for validating the EMP-FM model on the unseen dataset in our paper.
Visualise the embedding space of the EMP-FM model and plot the trajectory of the EMP states in the embedding space.
Visualise the ADESI score of the EMP-FM model.
If you find a bug or want to suggest a new feature for EMP-FM, please open a GitHub issue in this repository. Pull requests are also welcome!
EMP-FM is released under the GNU-GPL License. This code is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY.
