This repository contains the implementation of a ncRNA detection method, combining an RNA-MSM pre-trained model with transfer learning to identify non-coding RNAs from sequence and structural information.
Table of Contents
The pipeline follows a three-step process:
- Feature Extraction: Using the RNA-MSM pre-trained model to extract high-dimensional embeddings (768-dimensional vectors) from RNA Multiple Sequence Alignments (MSA)
- Transfer Learning Classification: Applying machine learning models (Random Forest, XGBoost, MLP) trained on the Rfam database using the RNA-MSM embeddings
- Binary Classification: Determining whether an RNA sequence is coding or non-coding with detailed confidence metrics
┌─────────────────────────────────────────────────────────────────────────┐
│ ncRNA Classification Pipeline │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ Input: RNA Multiple Sequence Alignment (MSA) │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ GGAAAUUCGGCA... │ │
│ │ GGAAACUCGG-A... │ │
│ │ GGAGAUUCGGCA... │ │
│ └───────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ Step 1: Feature Extraction using RNA-MSM │
│ ┌───────────────────┐ │
│ │ │ │
│ │ RNAMSM Model │ ─► Pretrained model: RNA_MSM_pretrained.ckpt │
│ │ │ │
│ └───────────────────┘ │
│ │
│ Output: RNA embeddings [768-dimensional vector] │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ Step 2: Transfer Learning-based Classification │
│ │
│ Models trained on Rfam database using RNA-MSM embeddings │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Random Forest │ │ XGBoost │ │ MLP │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Probability: │ │ Probability: │ │ Probability: │ │
│ │ 0.0 - 1.0 │ │ 0.0 - 1.0 │ │ 0.0 - 1.0 │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ Step 3: Results Aggregation and Binary Classification │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Model Probability Classification Confidence │ │
│ │ -------------- ------------ ----------------- ------------- │ │
│ │ random_forest 0.873 Non-coding RNA 0.746 │ │
│ │ xgboost 0.895 Non-coding RNA 0.790 │ │
│ │ MLP 0.912 Non-coding RNA 0.824 │ │
│ │ ... │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ │
│ Final Consensus: Non-coding RNA (Confidence: High, 0.79/1.0) │
└─────────────────────────────────────────────────────────────────────────┘
This project uses Poetry for dependency management and requires Python 3.12.1. Follow these steps to set up the development environment.
- Clone the repository:
git clone https://github.com/hmdlab/MSM-TLncRNA.git
cd MSM-TLncRNA- Ensure you're using Python 3.12.1:
# If using pyenv
pyenv install 3.12.1
pyenv local 3.12.1
poetry env use 3.12.1
# Verify Python version
poetry env info # Should output Python 3.12.1- Install dependencies using Poetry:
poetry install --all-extrasThis will create a virtual environment and install all required dependencies.
This project relies on several key libraries:
- PyTorch & PyTorch Lightning: For deep learning models and training
- Machine Learning Libraries: scikit-learn, XGBoost, LightGBM, CatBoost
- Bioinformatics Tools: Biopython, tape-proteins, multimolecule
The project includes several tools for development:
- Ruff: For linting and code formatting
- MyPy: For static type checking
- Pytest: For unit testing
- Pre-commit: For git hooks to ensure code quality
poetry run pre-commit install
The pipeline requires pretrained model files to function properly. Please ensure the following files are in place:
-
RNA-MSM Pretrained Model:
- Follow the instructions in the RNA-MSM GitHub repository to download the pretrained RNA-MSM model file (
RNA_MSM_pretrained.ckpt) - Place it in:
artifacts/rnamsm/RNA_MSM_pretrained.ckpt
artifacts └── rnamsm └── RNA_MSM_pretrained.ckpt - Follow the instructions in the RNA-MSM GitHub repository to download the pretrained RNA-MSM model file (
-
Pretrained Classifier Models:
- Download the trained classifier models here
- Place them in the
artifacts/ncrna_classifiers/directory
artifacts ├── ncrna_classifiers │ ├── models │ │ ├── catboost.joblib │ │ ├── lightgbm.joblib │ │ ├── logistic_regression.joblib │ │ ├── mlp.joblib │ │ ├── naive_bayes.joblib │ │ ├── random_forest.joblib │ │ ├── svm.joblib │ │ └── xgboost.joblib │ └── scaler.joblib
To reproduce the training process and experiments described in this repository, please refer to the train branch available here.
The ncRNA detection pipeline predicts whether RNA sequences are non-coding or not, by extracting features with RNA-MSM and applying trained classifiers.
The simplest way to use the pipeline is through the command-line interface:
poetry run python main.py --input examples/demo_data --output results.csv --verbose| Argument | Description | Default |
|---|---|---|
--input |
Path to the input RNA MSA data directory (containing .a2m_msa2 files) |
N/A |
--output |
Path to save the results CSV file | None |
--msm-model |
Path to the pre-trained RNA-MSM model | artifacts/rnamsm/RNA_MSM_pretrained.ckpt |
--classifier-path |
Path to the directory containing classifier models | artifacts/ncrna_classifiers |
--verbose |
Enable detailed logging | False |
Note: For detailed information about the usage, please refer to the examples/README.md
This project is licensed under the MIT License - see the LICENSE file for details.
This work is heavily based on the RNA-MSM project which is also licensed under the MIT License.
@article{10.1093/nar/gkad1031,
author = {Zhang, Yikun and Lang, Mei and Jiang, Jiuhong and Gao, Zhiqiang and Xu, Fan and Litfin, Thomas and Chen, Ke and Singh, Jaswinder and Huang, Xiansong and Song, Guoli and Tian, Yonghong and Zhan, Jian and Chen, Jie and Zhou, Yaoqi},
title = "{Multiple sequence alignment-based RNA language model and its application to structural inference}",
journal = {Nucleic Acids Research},
volume = {52},
number = {1},
pages = {e3-e3},
year = {2023},
month = {11},
abstract = "{Compared with proteins, DNA and RNA are more difficult languages to interpret because four-letter coded DNA/RNA sequences have less information content than 20-letter coded protein sequences. While BERT (Bidirectional Encoder Representations from Transformers)-like language models have been developed for RNA, they are ineffective at capturing the evolutionary information from homologous sequences becauseĀ unlike proteins, RNA sequences are less conserved. Here, we have developed an unsupervised multiple sequence alignment-based RNA language model (RNA-MSM) by utilizing homologous sequences from an automatic pipeline, RNAcmap, as it can provide significantly more homologous sequences than manually annotated Rfam. We demonstrate that the resulting unsupervised, two-dimensional attention maps and one-dimensional embeddings from RNA-MSM contain structural information. In fact, they can be directly mapped with high accuracy to 2D base pairing probabilities and 1D solvent accessibilities, respectively. Further fine-tuning led to significantly improved performance on these two downstream tasks compared with existing state-of-the-art techniques including SPOT-RNA2 and RNAsnap2. By comparison, RNA-FM, a BERT-based RNA language model, performs worse than one-hot encoding with its embedding in base pair and solvent-accessible surface area prediction. We anticipate that the pre-trained RNA-MSM model can be fine-tuned on many other tasks related to RNA structure and function.}",
issn = {0305-1048},
doi = {10.1093/nar/gkad1031},
url = {https://doi.org/10.1093/nar/gkad1031},
eprint = {https://academic.oup.com/nar/article-pdf/52/1/e3/55443207/gkad1031.pdf},
}