Skip to content

neural networks for cytosine methylation calling in nanopore sequencing

License

Notifications You must be signed in to change notification settings

yvlaere/Master_thesis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Detection of 5mC Modification in Nanopore Sequencing Data Using Deep Learning

A suite of Jupyter notebooks and Python scripts developed to preprocess Oxford Nanopore sequencing signals, train deep learning models for 5-methylcytosine (5mC) detection, and perform thorough performance analyses. The main purpose was the comparison of the facebookresearch ConvNeXt (https://github.com/facebookresearch/ConvNeXt) (https://doi.org/10.1109/CVPR52688.2022.01167) and Google's Transformer (https://doi.org/10.48550/arXiv.1706.03762) architectures. This codebase accompanies my master's thesis.

Table of Contents

  • Thesis Reference
  • Repository Layout
  • Getting Started
  • Notebook Descriptions
    1. final_E_coli.ipynb
    2. final_HX1.ipynb
    3. calibration_Ecoli.ipynb
    4. calibration_HX1.ipynb
    5. annotation_analyses.ipynb
  • References
  • License

Thesis Reference

Yari Van Laere (2022). Detection of 5mC Modification in Nanopore Sequencing Data Using Deep Learning. Master's thesis, Ghent University. Download PDF

Repository Layout

Master_thesis/
├── data_preprocessing/
│   ├── NSDataset6.py               # Preprocessing ONT data
│   └── WGBSDataset.py              # Preprocessing WGBS data
├── models/
|	├── Convolutional_NN/
|	│   ├── ConvNextBlock1D.py            # ConvNext block for the CNN
|	│   ├── ConvNeXt_model.py             # Old CNN architecture
|	│   ├── ConvNeXt_model_110.py         # Final CNN architecture
|	│   └── TCN_model_110.py              # Old CNN architecture
|	├── Transformer_NN/
|	│   ├── MultiHeadAttention.py         # Multi head attention
|	│   ├── Performer7.py                 # Old transformer architecture 
|	│   ├── TransformerBlock1.py          # Transformer block
|	│   ├── TransformerPreparation.py     # Mask and positional encoding
|	│   ├── Transformer_model62_.py       # Old transformer architecture 
|	│   └── Transformer_model_window16.py # Final transformer architecture
├── notebooks/
│   ├── annotation_analyses.ipynb   # Clustering & annotation analysis
│   ├── calibration_Ecoli.ipynb     # Calibration of E. coli model outputs
│   ├── calibration_HX1.ipynb       # Calibration of HX1 model outputs
│   ├── final_E_coli.ipynb          # E. coli model training and inference
│   └── final_HX1.ipynb             # Human (HX1) model training and inference
├── .gitignore                      # Git ignore rules
├── LICENSE                         # MIT license file
├── README.md                       # This file
└── thesis.pptx                     # Presentation of thesis highlights            

Getting Started

  1. Clone repository
    git clone https://github.com/yvlaere/Master_thesis.git
    cd Master_thesis
    
  2. Create and activate Python environment
    conda create -n methylation-cnn python=3.8 -y
    conda activate methylation-cnn
    
  3. Install required packages
    pip install numpy pandas scikit-learn matplotlib torch torchvision jupyter h5py math os random
    

Notebook Descriptions

The data_preprocessing and models folders contain the necessary data preprocessing and model architecture implementations used in the notebooks. Each notebook drives a stage in the workflow, from raw signal extraction to final performance analysis.

1. final_E_coli.ipynb

  • The ground truth datasets are datasets where either every cytosine was methylated or datasets where every cytosine was unmethylated
  • Prepare the oxford nanopore technologies (ONT) sequencing data to give to the model
  • Train the model (ConvNeXt or Transformer)
  • Evaluate the model (ROC curve, PR curve)
  • Get predictions

2. final_HX1.ipynb

  • Prepare whole-genome bisulphite sequencing data (WGBS) as the ground truth
  • Prepare the oxford nanopore technologies (ONT) sequencing data to give to the model
  • Train the model (ConvNeXt or Transformer)
  • Evaluate the model (ROC curve, PR curve)
  • Get predictions

3. calibration_Ecoli.ipynb`

  • Calibrate the predictions of the models using:
    • Isotonic regression
    • Platt scaling

4. calibration_HX1.ipynb

  • Calibrate the predictions of the models using:
    • Isotonic regression
    • Platt scaling

5. annotation_analyses.ipynb

  • Compare results of the models for the human data in the context of genome annotation

References

  • ConvNeXt: Liu, Zhuang & Mao, Hanzi & Wu, Chao-Yuan & Feichtenhofer, Christoph & Darrell, Trevor & Xie, Saining. (2022). A ConvNet for the 2020s. 11966-11976. https://doi.org/10.1109/CVPR52688.2022.01167.
  • Transformer: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS 2017), 30. https://doi.org/10.48550/arXiv.1706.03762

License

This project is released under the MIT License. See LICENSE for details.

About

neural networks for cytosine methylation calling in nanopore sequencing

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •