MSM-TLncRNA: Transfer Learning with RNA-MSM for Enhanced Non-coding RNA Detection

This repository contains the implementation of a ncRNA detection method, combining an RNA-MSM pre-trained model with transfer learning to identify non-coding RNAs from sequence and structural information.

Table of Contents

MSM-TLncRNA: Transfer Learning with RNA-MSM for Enhanced Non-coding RNA Detection

⚫︎ Overview

The pipeline follows a three-step process:

Feature Extraction: Using the RNA-MSM pre-trained model to extract high-dimensional embeddings (768-dimensional vectors) from RNA Multiple Sequence Alignments (MSA)
Transfer Learning Classification: Applying machine learning models (Random Forest, XGBoost, MLP) trained on the Rfam database using the RNA-MSM embeddings
Binary Classification: Determining whether an RNA sequence is coding or non-coding with detailed confidence metrics

┌─────────────────────────────────────────────────────────────────────────┐
│                      ncRNA Classification Pipeline                      │
└─────────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ Input: RNA Multiple Sequence Alignment (MSA)                            │
│ ┌───────────────────────────────────────────────────────────────┐       │
│ │ GGAAAUUCGGCA...                                               │       │
│ │ GGAAACUCGG-A...                                               │       │
│ │ GGAGAUUCGGCA...                                               │       │
│ └───────────────────────────────────────────────────────────────┘       │
└─────────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ Step 1: Feature Extraction using RNA-MSM                                │
│ ┌───────────────────┐                                                   │
│ │                   │                                                   │
│ │    RNAMSM Model   │ ─► Pretrained model: RNA_MSM_pretrained.ckpt      │
│ │                   │                                                   │
│ └───────────────────┘                                                   │
│                                                                         │
│ Output: RNA embeddings [768-dimensional vector]                         │
└─────────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ Step 2: Transfer Learning-based Classification                          │
│                                                                         │
│ Models trained on Rfam database using RNA-MSM embeddings                │
│                                                                         │
│ ┌─────────────────┐   ┌─────────────────┐   ┌─────────────────┐         │
│ │  Random Forest  │   │     XGBoost     │   │       MLP       │         │
│ └─────────────────┘   └─────────────────┘   └─────────────────┘         │
│         │                       │                    │                  │
│         ▼                       ▼                    ▼                  │
│ ┌─────────────────┐   ┌─────────────────┐   ┌─────────────────┐         │
│ │  Probability:   │   │  Probability:   │   │  Probability:   │         │
│ │    0.0 - 1.0    │   │    0.0 - 1.0    │   │    0.0 - 1.0    │         │
│ └─────────────────┘   └─────────────────┘   └─────────────────┘         │
└─────────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ Step 3: Results Aggregation and Binary Classification                   │
│ ┌───────────────────────────────────────────────────────────────┐       │
│ │ Model          Probability  Classification    Confidence      │       │
│ │ -------------- ------------ ----------------- -------------   │       │
│ │ random_forest     0.873     Non-coding RNA       0.746        │       │
│ │ xgboost           0.895     Non-coding RNA       0.790        │       │
│ │ MLP               0.912     Non-coding RNA       0.824        │       │
│ │ ...                                                           │       │
│ └───────────────────────────────────────────────────────────────┘       │
│                                                                         │
│ Final Consensus: Non-coding RNA (Confidence: High, 0.79/1.0)            │
└─────────────────────────────────────────────────────────────────────────┘

⚫︎ Installation

This project uses Poetry for dependency management and requires Python 3.12.1. Follow these steps to set up the development environment.

▪️ Prerequisites

Python 3.12.1 (recommended to use pyenv for Python version management)
Poetry (dependency manager)

▪️ Setup Instructions

Clone the repository:

git clone https://github.com/hmdlab/MSM-TLncRNA.git
cd MSM-TLncRNA

Ensure you're using Python 3.12.1:

# If using pyenv
pyenv install 3.12.1
pyenv local 3.12.1
poetry env use 3.12.1

# Verify Python version
poetry env info  # Should output Python 3.12.1

Install dependencies using Poetry:

poetry install --all-extras

This will create a virtual environment and install all required dependencies.

Key Dependencies

This project relies on several key libraries:

PyTorch & PyTorch Lightning: For deep learning models and training
Machine Learning Libraries: scikit-learn, XGBoost, LightGBM, CatBoost
Bioinformatics Tools: Biopython, tape-proteins, multimolecule

Development Tools

The project includes several tools for development:

Ruff: For linting and code formatting
MyPy: For static type checking
Pytest: For unit testing
Pre-commit: For git hooks to ensure code quality
```
poetry run pre-commit install
```

▪️ Required Model Files

The pipeline requires pretrained model files to function properly. Please ensure the following files are in place:

RNA-MSM Pretrained Model:
- Follow the instructions in the RNA-MSM GitHub repository to download the pretrained RNA-MSM model file (RNA_MSM_pretrained.ckpt)
- Place it in: artifacts/rnamsm/RNA_MSM_pretrained.ckpt
```
artifacts
 └── rnamsm
     └── RNA_MSM_pretrained.ckpt
```

Pretrained Classifier Models:

Download the trained classifier models here
Place them in the artifacts/ncrna_classifiers/ directory

artifacts
 ├── ncrna_classifiers
 │   ├── models
 │   │   ├── catboost.joblib
 │   │   ├── lightgbm.joblib
 │   │   ├── logistic_regression.joblib
 │   │   ├── mlp.joblib
 │   │   ├── naive_bayes.joblib
 │   │   ├── random_forest.joblib
 │   │   ├── svm.joblib
 │   │   └── xgboost.joblib
 │   └── scaler.joblib

Reproducibility

To reproduce the training process and experiments described in this repository, please refer to the train branch available here.

⚫︎ Usage

The ncRNA detection pipeline predicts whether RNA sequences are non-coding or not, by extracting features with RNA-MSM and applying trained classifiers.

▪️ Basic Command-line Usage

The simplest way to use the pipeline is through the command-line interface:

poetry run python main.py --input examples/demo_data --output results.csv --verbose

▪️ Command-line Arguments

Argument	Description	Default
`--input`	Path to the input RNA MSA data directory (containing `.a2m_msa2` files)	N/A
`--output`	Path to save the results CSV file	None
`--msm-model`	Path to the pre-trained RNA-MSM model	artifacts/rnamsm/RNA_MSM_pretrained.ckpt
`--classifier-path`	Path to the directory containing classifier models	artifacts/ncrna_classifiers
`--verbose`	Enable detailed logging	False

Note: For detailed information about the usage, please refer to the examples/README.md

⚫︎ Citation

▪️ LISENCE

This project is licensed under the MIT License - see the LICENSE file for details.

This work is heavily based on the RNA-MSM project which is also licensed under the MIT License.

@article{10.1093/nar/gkad1031,
    author = {Zhang, Yikun and Lang, Mei and Jiang, Jiuhong and Gao, Zhiqiang and Xu, Fan and Litfin, Thomas and Chen, Ke and Singh, Jaswinder and Huang, Xiansong and Song, Guoli and Tian, Yonghong and Zhan, Jian and Chen, Jie and Zhou, Yaoqi},
    title = "{Multiple sequence alignment-based RNA language model and its application to structural inference}",
    journal = {Nucleic Acids Research},
    volume = {52},
    number = {1},
    pages = {e3-e3},
    year = {2023},
    month = {11},
    abstract = "{Compared with proteins, DNA and RNA are more difficult languages to interpret because four-letter coded DNA/RNA sequences have less information content than 20-letter coded protein sequences. While BERT (Bidirectional Encoder Representations from Transformers)-like language models have been developed for RNA, they are ineffective at capturing the evolutionary information from homologous sequences becauseĀ unlike proteins, RNA sequences are less conserved. Here, we have developed an unsupervised multiple sequence alignment-based RNA language model (RNA-MSM) by utilizing homologous sequences from an automatic pipeline, RNAcmap, as it can provide significantly more homologous sequences than manually annotated Rfam. We demonstrate that the resulting unsupervised, two-dimensional attention maps and one-dimensional embeddings from RNA-MSM contain structural information. In fact, they can be directly mapped with high accuracy to 2D base pairing probabilities and 1D solvent accessibilities, respectively. Further fine-tuning led to significantly improved performance on these two downstream tasks compared with existing state-of-the-art techniques including SPOT-RNA2 and RNAsnap2. By comparison, RNA-FM, a BERT-based RNA language model, performs worse than one-hot encoding with its embedding in base pair and solvent-accessible surface area prediction. We anticipate that the pre-trained RNA-MSM model can be fine-tuned on many other tasks related to RNA structure and function.}",
    issn = {0305-1048},
    doi = {10.1093/nar/gkad1031},
    url = {https://doi.org/10.1093/nar/gkad1031},
    eprint = {https://academic.oup.com/nar/article-pdf/52/1/e3/55443207/gkad1031.pdf},
}

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
artifacts		artifacts
examples		examples
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
main.py		main.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MSM-TLncRNA: Transfer Learning with RNA-MSM for Enhanced Non-coding RNA Detection

⚫︎ Overview

⚫︎ Installation

▪️ Prerequisites

▪️ Setup Instructions

Key Dependencies

Development Tools

▪️ Required Model Files

Reproducibility

⚫︎ Usage

▪️ Basic Command-line Usage

▪️ Command-line Arguments

⚫︎ Citation

▪️ LISENCE

About

Uh oh!

Releases

Packages

Languages

License

hmdlab/MSM-TLncRNA

Folders and files

Latest commit

History

Repository files navigation

MSM-TLncRNA: Transfer Learning with RNA-MSM for Enhanced Non-coding RNA Detection

⚫︎ Overview

⚫︎ Installation

▪️ Prerequisites

▪️ Setup Instructions

Key Dependencies

Development Tools

▪️ Required Model Files

Reproducibility

⚫︎ Usage

▪️ Basic Command-line Usage

▪️ Command-line Arguments

⚫︎ Citation

▪️ LISENCE

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages