DeepDeg is a novel deep learning framework designed for the accurate prediction of mRNA degradation rates. It utilizes a sophisticated multi-branch architecture that integrates Vision Transformer (ViT), Convolutional Neural Network (CNN), and Multi-Layer Perceptron (MLP) components to capture complex biological patterns.
-
Clone the repository:
git clone https://github.com/turgaybulut/DeepDeg.git cd DeepDeg -
Create and activate the conda environment:
conda env create -f environment.yml conda activate deepdeg
-
Install the project in editable mode:
pip install -e .
Note: Please install and set up the PyBioMed package from their repository. It is only required for feature generation script.
The main configuration file is config.yaml. You can modify this file to change the model, training, and data parameters.
Some scripts are independent and can be used with your own datasets, while others are specific to the DeepDeg project.
The following scripts can be used with your own datasets by providing the necessary arguments.
scripts/feature_generator.py: Generate features from the data.- Usage:
python scripts/feature_generator.py <input_file> <output_file> [options]
- Arguments:
<input_file>: Path to input CSV file with sequence data.<output_file>: Path to output CSV file for features.
- Options:
--features <type> [<type> ...]: Feature types to extract (choices:nucleic_acid,peptide,modlamp,enzymatic,all). Default:all.--batch-size <int>: Batch size for processing. Default:100.--num-cores <int>: Number of CPU cores for parallel processing. Default: system's CPU count.
- Usage:
scripts/data_augmenter.py: Augment the data.- Usage:
python scripts/data_augmenter.py <command> <input_file> <output_file> [options]
- Commands:
sequences: Augment sequence data.features: Augment feature data with Gaussian noise.
- Arguments (common):
<input_file>: Input CSV file.<output_file>: Output CSV file for augmented data.
- Options (common):
--augmentation-factor <int>: How many times to augment the dataset. Default:2.--random-seed <int>: Random seed for reproducibility. Default:654.
- Options (for
sequencescommand):--utr-substitution-rate <float>: Substitution rate for UTRs. Default:0.015.--utr-insertion-rate <float>: Insertion rate for UTRs. Default:0.003.--utr-deletion-rate <float>: Deletion rate for UTRs. Default:0.002.--orf-substitution-rate <float>: Synonymous codon substitution rate for ORFs. Default:0.15.
- Options (for
featurescommand):--noise-level <float>: Noise level as a fraction of feature standard deviation. Default:0.05.--id-columns <col> [<col> ...]: Columns to exclude from noise application. Default:ENSID HALFLIFE.
- Usage:
The following scripts are used for training and evaluating the DeepDeg model. They use the configuration from config.yaml.
scripts/train.py: Train the model.scripts/evaluate.py: Evaluate the model.scripts/train_fold.py: Run 10-fold cross-validation.
To run any script, use the following format:
python scripts/<script_name>.pyThe notebooks/ directory contains Jupyter notebooks for generating embeddings using various pre-trained models:
mrnalm_embedding_generation.ipynb: Generate embeddings using the mRNA-LM model.protbert_embedding_generation.ipynb: Generate embeddings using the ProtBERT model.rnabert_embedding_generation.ipynb: Generate embeddings using the RNABERT model.
These notebooks can be run to pre-process data for the main DeepDeg pipeline or for independent research.
Our primary dataset can be downloaded from: https://drive.google.com/drive/folders/1YhHkKasumqMiGQv5D0Arexewy6LSxFAM