Skip to content

hmdlab/MSM-TLncRNA

Repository files navigation

MSM-TLncRNA: Transfer Learning with RNA-MSM for Enhanced Non-coding RNA Detection

This repository contains the implementation of a ncRNA detection method, combining an RNA-MSM pre-trained model with transfer learning to identify non-coding RNAs from sequence and structural information.

Table of Contents

⚫︎ Overview

The pipeline follows a three-step process:

  1. Feature Extraction: Using the RNA-MSM pre-trained model to extract high-dimensional embeddings (768-dimensional vectors) from RNA Multiple Sequence Alignments (MSA)
  2. Transfer Learning Classification: Applying machine learning models (Random Forest, XGBoost, MLP) trained on the Rfam database using the RNA-MSM embeddings
  3. Binary Classification: Determining whether an RNA sequence is coding or non-coding with detailed confidence metrics
┌─────────────────────────────────────────────────────────────────────────┐
│                      ncRNA Classification Pipeline                      │
└─────────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ Input: RNA Multiple Sequence Alignment (MSA)                            │
│ ┌───────────────────────────────────────────────────────────────┐       │
│ │ GGAAAUUCGGCA...                                               │       │
│ │ GGAAACUCGG-A...                                               │       │
│ │ GGAGAUUCGGCA...                                               │       │
│ └───────────────────────────────────────────────────────────────┘       │
└─────────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ Step 1: Feature Extraction using RNA-MSM                                │
│ ┌───────────────────┐                                                   │
│ │                   │                                                   │
│ │    RNAMSM Model   │ ─► Pretrained model: RNA_MSM_pretrained.ckpt      │
│ │                   │                                                   │
│ └───────────────────┘                                                   │
│                                                                         │
│ Output: RNA embeddings [768-dimensional vector]                         │
└─────────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ Step 2: Transfer Learning-based Classification                          │
│                                                                         │
│ Models trained on Rfam database using RNA-MSM embeddings                │
│                                                                         │
│ ┌─────────────────┐   ┌─────────────────┐   ┌─────────────────┐         │
│ │  Random Forest  │   │     XGBoost     │   │       MLP       │         │
│ └─────────────────┘   └─────────────────┘   └─────────────────┘         │
│         │                       │                    │                  │
│         ▼                       ▼                    ▼                  │
│ ┌─────────────────┐   ┌─────────────────┐   ┌─────────────────┐         │
│ │  Probability:   │   │  Probability:   │   │  Probability:   │         │
│ │    0.0 - 1.0    │   │    0.0 - 1.0    │   │    0.0 - 1.0    │         │
│ └─────────────────┘   └─────────────────┘   └─────────────────┘         │
└─────────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ Step 3: Results Aggregation and Binary Classification                   │
│ ┌───────────────────────────────────────────────────────────────┐       │
│ │ Model          Probability  Classification    Confidence      │       │
│ │ -------------- ------------ ----------------- -------------   │       │
│ │ random_forest     0.873     Non-coding RNA       0.746        │       │
│ │ xgboost           0.895     Non-coding RNA       0.790        │       │
│ │ MLP               0.912     Non-coding RNA       0.824        │       │
│ │ ...                                                           │       │
│ └───────────────────────────────────────────────────────────────┘       │
│                                                                         │
│ Final Consensus: Non-coding RNA (Confidence: High, 0.79/1.0)            │
└─────────────────────────────────────────────────────────────────────────┘

⚫︎ Installation

This project uses Poetry for dependency management and requires Python 3.12.1. Follow these steps to set up the development environment.

▪️ Prerequisites

  • Python 3.12.1 (recommended to use pyenv for Python version management)
  • Poetry (dependency manager)

▪️ Setup Instructions

  1. Clone the repository:
git clone https://github.com/hmdlab/MSM-TLncRNA.git
cd MSM-TLncRNA
  1. Ensure you're using Python 3.12.1:
# If using pyenv
pyenv install 3.12.1
pyenv local 3.12.1
poetry env use 3.12.1

# Verify Python version
poetry env info  # Should output Python 3.12.1
  1. Install dependencies using Poetry:
poetry install --all-extras

This will create a virtual environment and install all required dependencies.

Key Dependencies

This project relies on several key libraries:

  • PyTorch & PyTorch Lightning: For deep learning models and training
  • Machine Learning Libraries: scikit-learn, XGBoost, LightGBM, CatBoost
  • Bioinformatics Tools: Biopython, tape-proteins, multimolecule

Development Tools

The project includes several tools for development:

  • Ruff: For linting and code formatting
  • MyPy: For static type checking
  • Pytest: For unit testing
  • Pre-commit: For git hooks to ensure code quality
    poetry run pre-commit install

▪️ Required Model Files

The pipeline requires pretrained model files to function properly. Please ensure the following files are in place:

  1. RNA-MSM Pretrained Model:

    • Follow the instructions in the RNA-MSM GitHub repository to download the pretrained RNA-MSM model file (RNA_MSM_pretrained.ckpt)
    • Place it in: artifacts/rnamsm/RNA_MSM_pretrained.ckpt
    artifacts
     └── rnamsm
         └── RNA_MSM_pretrained.ckpt
  2. Pretrained Classifier Models:

    • Download the trained classifier models here
    • Place them in the artifacts/ncrna_classifiers/ directory
    artifacts
     ├── ncrna_classifiers
     │   ├── models
     │   │   ├── catboost.joblib
     │   │   ├── lightgbm.joblib
     │   │   ├── logistic_regression.joblib
     │   │   ├── mlp.joblib
     │   │   ├── naive_bayes.joblib
     │   │   ├── random_forest.joblib
     │   │   ├── svm.joblib
     │   │   └── xgboost.joblib
     │   └── scaler.joblib

Reproducibility

To reproduce the training process and experiments described in this repository, please refer to the train branch available here.

⚫︎ Usage

The ncRNA detection pipeline predicts whether RNA sequences are non-coding or not, by extracting features with RNA-MSM and applying trained classifiers.

▪️ Basic Command-line Usage

The simplest way to use the pipeline is through the command-line interface:

poetry run python main.py --input examples/demo_data --output results.csv --verbose

▪️ Command-line Arguments

Argument Description Default
--input Path to the input RNA MSA data directory (containing .a2m_msa2 files) N/A
--output Path to save the results CSV file None
--msm-model Path to the pre-trained RNA-MSM model artifacts/rnamsm/RNA_MSM_pretrained.ckpt
--classifier-path Path to the directory containing classifier models artifacts/ncrna_classifiers
--verbose Enable detailed logging False

Note: For detailed information about the usage, please refer to the examples/README.md

⚫︎ Citation

▪️ LISENCE

This project is licensed under the MIT License - see the LICENSE file for details.

This work is heavily based on the RNA-MSM project which is also licensed under the MIT License.

@article{10.1093/nar/gkad1031,
    author = {Zhang, Yikun and Lang, Mei and Jiang, Jiuhong and Gao, Zhiqiang and Xu, Fan and Litfin, Thomas and Chen, Ke and Singh, Jaswinder and Huang, Xiansong and Song, Guoli and Tian, Yonghong and Zhan, Jian and Chen, Jie and Zhou, Yaoqi},
    title = "{Multiple sequence alignment-based RNA language model and its application to structural inference}",
    journal = {Nucleic Acids Research},
    volume = {52},
    number = {1},
    pages = {e3-e3},
    year = {2023},
    month = {11},
    abstract = "{Compared with proteins, DNA and RNA are more difficult languages to interpret because four-letter coded DNA/RNA sequences have less information content than 20-letter coded protein sequences. While BERT (Bidirectional Encoder Representations from Transformers)-like language models have been developed for RNA, they are ineffective at capturing the evolutionary information from homologous sequences becauseĀ unlike proteins, RNA sequences are less conserved. Here, we have developed an unsupervised multiple sequence alignment-based RNA language model (RNA-MSM) by utilizing homologous sequences from an automatic pipeline, RNAcmap, as it can provide significantly more homologous sequences than manually annotated Rfam. We demonstrate that the resulting unsupervised, two-dimensional attention maps and one-dimensional embeddings from RNA-MSM contain structural information. In fact, they can be directly mapped with high accuracy to 2D base pairing probabilities and 1D solvent accessibilities, respectively. Further fine-tuning led to significantly improved performance on these two downstream tasks compared with existing state-of-the-art techniques including SPOT-RNA2 and RNAsnap2. By comparison, RNA-FM, a BERT-based RNA language model, performs worse than one-hot encoding with its embedding in base pair and solvent-accessible surface area prediction. We anticipate that the pre-trained RNA-MSM model can be fine-tuned on many other tasks related to RNA structure and function.}",
    issn = {0305-1048},
    doi = {10.1093/nar/gkad1031},
    url = {https://doi.org/10.1093/nar/gkad1031},
    eprint = {https://academic.oup.com/nar/article-pdf/52/1/e3/55443207/gkad1031.pdf},
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages