ExactCN : Predicting Exact Copy Numbers on Whole Exome Sequencing Data

ExactCN is a deep learning-based method for estimating exact integer copy numbers at the exon level from whole-exome sequencing (WES) data using read-depth signals.

Beyond exome-wide prediction, the model is designed to be fine-tuned for specific targets. We provide ExactCN-SMN, a specialized version optimized for the SMN1/2 locus, to demonstrate how the framework can be adapted for other clinically relevant and challenging genes.

The repository with processed samples, ground truth data, and CN estimations for real and simulated datasets to reproduce the analyses in the paper can be found here: ExactCN results reproduction

Deep Learning, Copy Number Variation, Whole Exome Sequencing

Authors

Erfan FarhangKia†, Ahmet Arda Ceylan†, Mert Gençtürk, Mehmet Alper Yılmaz, Furkan Karademir, and A. Ercüment Cicek

† Equal contribution

Questions & comments

[firstauthorname].[firstauthorsurname]@bilkent.edu.tr

[lastauthorsurname]@cs.bilkent.edu.tr

Installation

ExactCN is a python3 script and it is easy to run after the required packages are installed.

Requirements

For easy requirement handling, you can use exactcn_environment.yml files to initialize conda environment with requirements installed:

  

$  conda  env  create  --name  exactcn_env  -f  exactcn_environment.yml

  

$  conda  activate  exactcn_env

Note that the provided environment yml file is for Linux systems. For MacOS users, the corresponding versions of the packages might need to be changed.

Features

ExactCN provides GPU support optionally. See GPU Support section.

Pretrained Models

Pretrained model weights for both ExactCN and ExactCN-SMN, corresponding to the case study and experimental results reported in the paper, are provided in the ./models/ directory.

These models are shared to facilitate reproducibility and allow users to directly evaluate or build upon the reported results without retraining from scratch.

Instructions Manual for ExactCN

Important notice: Please call the call_exactcn.py script from the scripts directory.

Required Arguments

-m, --model

Relative or direct path to the trained model checkpoint.

-bs, --batch_size

Batch size to be used to perform CN estimation on the samples.

-i, --input

Relative or direct path for the processed HDF5 file(s) containing WES data.

-o, --output

Relative or direct output directory path to write ExactCN output file.

-s, --stats_lookup

Relative or direct path to the normalization CSV file containing signal mean and standard deviation statistics per nucleotide channel. These statistics are utilized to normalize the input signals before inference.

Optional Arguments

-j, --jobs

-Number of parallel worker processes to spawn for processing samples. Use this to speed up inference across multiple samples (Default: 2).

-v, --vocab

Relative or direct path to the gene_vocab.txt file. Providing the specific vocabulary file used during training ensures that gene IDs are mapped consistently. If omitted, the script attempts to build it from the input file.

-g, --gpu

Set to PCI BUS ID of the gpu in your system.
You can check, PCI BUS IDs of the gpus in your system with various ways. Using gpustat tool check IDs of the gpus in your system like below:

-v, --version

-Check the version of ExactCN.

-h, --help

-See help page.

Usage Example

ExactCN is very easy to use! Here, We provide an example small-sized BAM file and show how to run ExactCN on this toy dataset.

Step-0: Install conda package management

This project uses conda package management software to create virtual environment and facilitate reproducability.
For Linux users:
Please take a look at the Anaconda repo archive page, and select an appropriate version that you'd like to install.
Replace this Anaconda3-version.num-Linux-x86_64.sh with your choice

  

$  wget  -c  https://repo.continuum.io/archive/Anaconda3-vers.num-Linux-x86_64.sh

  

$  bash  Anaconda3-version.num-Linux-x86_64.sh

Step-1: Set Up your environment.

It is important to set up the conda environment which includes the necessary dependencies.
Please run the following lines to create and activate the environment:

  

$  conda  env  create  --name  exactcn_env  -f  exactcn_environment.yml

  

$  conda  activate  exactcn_env

Step-2: Run the preprocessing script.

It is necessary to perform preprocessing on WES data samples to obtain read depth and other meta data and make them ready for CN estimation.
Please run the following line:

  

$  source  preprocess_samples_inference.sh

Step-3: Run ExactCN on data obtained in Step-2

Here, we demonstrate an example to run ExactCN on gpu device 0, and obtain exone-level CN prediction.
Please run the following script:

  

$  source  call_exactcn.sh

Output file of ExactCN

At the end of the CN estimation procedure, ExactCN will write its output file to the directory given with -o option. In this tutorial it is ./exactcn_results
Output file of ExactCN is a tab-delimited.
Columns in the gene-level output file of ExactCN are the following with order: 1. Chromosome, 2. Start Position, 3. End Position, 4. Gene Name 5. CNV Calling Result 6. CN Estimation
Following figure is an example of ExactCN exone-level output file.

Instructions Manual for Finetuning ExactCN

Important notice: Please call the finetune_exactcn.py script from the scripts directory.

Required Arguments

-bs, --batch_size

Batch size to be used to perform CN estimation on the WES samples.

-i, --input

Relative or direct path for the input HDF5 dataset.

-o, --output

Relative or direct output directory path to write ExactCN output weights.

-n, --norm_file

Relative or direct path to the lookup file containing mean and standard deviation statistics of signal values. These statistics are utilized to normalize the input data during training.

-e, --epochs

The number of epochs the finetuning will be performed.

-lr, --learning_rate

The learning rate to be used in finetuning

--pretrained_dir

The path for the pretrained model weights to be loaded for finetuning

Optional Arguments

-g, --gpu

Set to PCI BUS ID of the gpu in your system.
You can check, PCI BUS IDs of the gpus in your system with various ways. Using gpustat tool check IDs of the gpus in your system like below:

-v, --version

-Check the version of ExactCN.

-h, --help

-See help page.

Finetune Example

You may want to fine-tune ExactCN with your WES dataset. We provide an example of how ExactCN can be fine-tuned using a small-sized BAM file along with its corresponding ground truth calls.

Step-0 and Step-1 are the same as the ExactCN call example.

Step-0: Install conda package management

This project uses conda package management software to create virtual environment and facilitate reproducability.
For Linux users:
Please take a look at the Anaconda repo archive page, and select an appropriate version that you'd like to install.
Replace this Anaconda3-version.num-Linux-x86_64.sh with your choice

  

$  wget  -c  https://repo.continuum.io/archive/Anaconda3-vers.num-Linux-x86_64.sh

  

$  bash  Anaconda3-version.num-Linux-x86_64.sh

Step-1: Set Up your environment.

It is important to set up the conda environment which includes the necessary dependencies.
Please run the following lines to create and activate the environment:

  

$  conda  env  create  --name  exactcn_env  -f  exactcn_environment.yml

  

$  conda  activate  exactcn_env

Step-2: Run the preprocessing script for preparing the samples for finetuning.

It is necessary to perform preprocessing on WES data samples to obtain read depth and other meta data and make them ready for ExactCN finetuning.
ExactCN Finetuning requires .bam and ground truth calls. Please see the below image for a sample ground truths format.

Please run the following line:

  

$  source  preprocess_samples_finetuning.sh

Step-3: Start ExactCN Finetuning on data obtained in Step-2

Here, we demonstrate an example to run ExactCN Finetuning on gpu device 0.
Please run the following script:

  

$  source  exactcn_finetune.sh

You can change the argument parameters within the script to run it on cpu.

Output file of ExactCN

At the end of ExactCN Finetuning, the script will save its model weights file to the directory given with -o option. In this tutorial it is ./exactcn_finetuned_model_weights

Instructions Manual for ExactCN-SMN

ExactCN-SMN is a specialised fine tuned model for aggregated SMN region CNV calling on Exon 7. We provide a specialized fine-tuned version of ExactCN, called ExactCNSMN, designed to call CNVs for the SMN1 and SMN2 genes.

Due to the extreme sequence similarity between SMN1 and SMN2, standard WES-based callers often produce unreliable results. ExactCNSMN specifically focuses on Exon 7, as this region contains the critical base difference (c.840C>T) that distinguishes SMN1 from SMN2 and disrupts a splicing enhancer. Furthermore, Exon 7 is the only region with reliable mappability in the 1000 Genomes Project WES data used for training.

Important notice: Please call the call_exactcn_smn.py script from the scripts directory.

Required Arguments

-m, --model

Relative or direct path to the trained model checkpoint.

-bs, --batch_size

Batch size to be used to perform CN estimation on the samples.

-i, --input

Relative or direct path for the processed HDF5 file(s) containing WES data.

-o, --output

Relative or direct output directory path to write ExactCN-SMN output file.

-s, --stats_lookup

Relative or direct path to the normalization CSV file containing signal mean and standard deviation statistics per nucleotide channel. These statistics are utilized to normalize the input signals before inference.

Optional Arguments

-j, --jobs

-Number of parallel worker processes to spawn for processing samples. Use this to speed up inference across multiple samples (Default: 2).

-v, --vocab

Relative or direct path to the gene_vocab.txt file. Highly Recommended: Providing the specific vocabulary file used during training ensures that gene IDs are mapped consistently. If omitted, the script attempts to build it from the input file.

-g, --gpu

Set to PCI BUS ID of the gpu in your system.
You can check, PCI BUS IDs of the gpus in your system with various ways. Using gpustat tool check IDs of the gpus in your system like below:

-v, --version

-Check the version of ExactCN.

-h, --help

-See help page.

Usage Example

We provide an example small-sized BAM file and show how to run ExactCN-SMN on this toy dataset.

Step-0: Install conda package management

This project uses conda package management software to create virtual environment and facilitate reproducability.
For Linux users:
Please take a look at the Anaconda repo archive page, and select an appropriate version that you'd like to install.
Replace this Anaconda3-version.num-Linux-x86_64.sh with your choice

  

$  wget  -c  https://repo.continuum.io/archive/Anaconda3-vers.num-Linux-x86_64.sh

  

$  bash  Anaconda3-version.num-Linux-x86_64.sh

Step-1: Set Up your environment.

It is important to set up the conda environment which includes the necessary dependencies.
Please run the following lines to create and activate the environment:

  

$  conda  env  create  --name  exactcn_env  -f  exactcn_environment.yml

  

$  conda  activate  exactcn_env

Step-2: Run the preprocessing script.

It is necessary to perform preprocessing on WES data samples to obtain read depth and other meta data and make them ready for CN estimation.
Please run the following line:

  

$  source  preprocess_samples_smn.sh

Step-3: Run ExactCN-SMN on data obtained in Step-2

Here, we demonstrate an example to run ExactCN-SMN on gpu device 0, and obtain exone-level CN prediction on SMN region.
Please run the following script:

  

$  source  call_exactcn_smn.sh

Output file of ExactCN-SMN

At the end of the CN estimation procedure, ExactCN-SMN will write its output file to the directory given with -o option. In this tutorial it is ./exactcn_smn_results
Columns in the gene-level output file of ExactCN-SMN are the following with order: 1. Chromosome, 2. Start Position, 3. End Position, 4. Gene Name 5. CNV Calling Result 6. CN Estimation
Following figure is an example of ExactCN-SMN exone-level output file:

Citations

License

CC BY-NC-SA 2.0
For commercial usage, please contact.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
example_data		example_data
finetune_example_data		finetune_example_data
images		images
models		models
scripts		scripts
LICENSE		LICENSE
README.md		README.md
call_exactcn.sh		call_exactcn.sh
call_exactcn_smn.sh		call_exactcn_smn.sh
channel_stats.csv		channel_stats.csv
exactcn_environment.yml		exactcn_environment.yml
finetune_exactcn.sh		finetune_exactcn.sh
gene_vocab.txt		gene_vocab.txt
genesInfo.txt		genesInfo.txt
hglft_genome_64dc_dcbaa0_unique.bed		hglft_genome_64dc_dcbaa0_unique.bed
preprocess_samples_finetuning.sh		preprocess_samples_finetuning.sh
preprocess_samples_inference.sh		preprocess_samples_inference.sh
preprocess_samples_smn.sh		preprocess_samples_smn.sh

Folders and files

Latest commit

History

Repository files navigation

ExactCN : Predicting Exact Copy Numbers on Whole Exome Sequencing Data

Authors

Questions & comments

Table of Contents

Installation

Requirements

Features

Pretrained Models

Instructions Manual for ExactCN

Required Arguments

-m, --model

-bs, --batch_size

-i, --input

-o, --output

-s, --stats_lookup

Optional Arguments

-j, --jobs

-v, --vocab

-g, --gpu

-v, --version

-h, --help

Usage Example

Step-0: Install conda package management

Step-1: Set Up your environment.

Step-2: Run the preprocessing script.

Step-3: Run ExactCN on data obtained in Step-2

Output file of ExactCN

Instructions Manual for Finetuning ExactCN

Required Arguments

-bs, --batch_size

-i, --input

-o, --output

-n, --norm_file

-e, --epochs

-lr, --learning_rate

--pretrained_dir

Optional Arguments

-g, --gpu

-v, --version

-h, --help

Finetune Example

Step-0: Install conda package management

Step-1: Set Up your environment.

Step-2: Run the preprocessing script for preparing the samples for finetuning.

Step-3: Start ExactCN Finetuning on data obtained in Step-2

Output file of ExactCN

Instructions Manual for ExactCN-SMN

Required Arguments

-m, --model

-bs, --batch_size

-i, --input

-o, --output

-s, --stats_lookup

Optional Arguments

-j, --jobs

-v, --vocab

-g, --gpu

-v, --version

-h, --help

Usage Example

Step-0: Install conda package management

Step-1: Set Up your environment.

Step-2: Run the preprocessing script.

Step-3: Run ExactCN-SMN on data obtained in Step-2

Output file of ExactCN-SMN

Citations

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages