ExactCN is a deep learning-based method for estimating exact integer copy numbers at the exon level from whole-exome sequencing (WES) data using read-depth signals.
Beyond exome-wide prediction, the model is designed to be fine-tuned for specific targets. We provide ExactCN-SMN, a specialized version optimized for the SMN1/2 locus, to demonstrate how the framework can be adapted for other clinically relevant and challenging genes.
The repository with processed samples, ground truth data, and CN estimations for real and simulated datasets to reproduce the analyses in the paper can be found here: ExactCN results reproduction
Deep Learning, Copy Number Variation, Whole Exome Sequencing
Erfan FarhangKia†, Ahmet Arda Ceylan†, Mert Gençtürk, Mehmet Alper Yılmaz, Furkan Karademir, and A. Ercüment Cicek
† Equal contribution
[firstauthorname].[firstauthorsurname]@bilkent.edu.tr
[lastauthorsurname]@cs.bilkent.edu.tr
Warning: Please note that ExactCN software is completely free for academic usage. However it is licenced for commercial usage. Please first refer to the License section for more info.
- ExactCN is a python3 script and it is easy to run after the required packages are installed.
For easy requirement handling, you can use exactcn_environment.yml files to initialize conda environment with requirements installed:
$ conda env create --name exactcn_env -f exactcn_environment.yml
$ conda activate exactcn_env
Note that the provided environment yml file is for Linux systems. For MacOS users, the corresponding versions of the packages might need to be changed.
- ExactCN provides GPU support optionally. See GPU Support section.
Pretrained model weights for both ExactCN and ExactCN-SMN, corresponding to the case study and experimental results reported in the paper, are provided in the ./models/ directory.
These models are shared to facilitate reproducibility and allow users to directly evaluate or build upon the reported results without retraining from scratch.
Important notice: Please call the call_exactcn.py script from the scripts directory.
- Relative or direct path to the trained model checkpoint.
- Batch size to be used to perform CN estimation on the samples.
- Relative or direct path for the processed HDF5 file(s) containing WES data.
- Relative or direct output directory path to write ExactCN output file.
- Relative or direct path to the normalization CSV file containing signal mean and standard deviation statistics per nucleotide channel. These statistics are utilized to normalize the input signals before inference.
-Number of parallel worker processes to spawn for processing samples. Use this to speed up inference across multiple samples (Default: 2).
- Relative or direct path to the
gene_vocab.txtfile. Providing the specific vocabulary file used during training ensures that gene IDs are mapped consistently. If omitted, the script attempts to build it from the input file.
-
Set to PCI BUS ID of the gpu in your system.
-
You can check, PCI BUS IDs of the gpus in your system with various ways. Using gpustat tool check IDs of the gpus in your system like below:
-Check the version of ExactCN.
-See help page.
ExactCN is very easy to use! Here, We provide an example small-sized BAM file and show how to run ExactCN on this toy dataset.
-
This project uses conda package management software to create virtual environment and facilitate reproducability.
-
For Linux users:
-
Please take a look at the Anaconda repo archive page, and select an appropriate version that you'd like to install.
-
Replace this
Anaconda3-version.num-Linux-x86_64.shwith your choice
$ wget -c https://repo.continuum.io/archive/Anaconda3-vers.num-Linux-x86_64.sh
$ bash Anaconda3-version.num-Linux-x86_64.sh
-
It is important to set up the conda environment which includes the necessary dependencies.
-
Please run the following lines to create and activate the environment:
$ conda env create --name exactcn_env -f exactcn_environment.yml
$ conda activate exactcn_env
-
It is necessary to perform preprocessing on WES data samples to obtain read depth and other meta data and make them ready for CN estimation.
-
Please run the following line:
$ source preprocess_samples_inference.sh
-
Here, we demonstrate an example to run ExactCN on gpu device 0, and obtain exone-level CN prediction.
-
Please run the following script:
$ source call_exactcn.sh
-
At the end of the CN estimation procedure, ExactCN will write its output file to the directory given with -o option. In this tutorial it is ./exactcn_results
-
Output file of ExactCN is a tab-delimited.
-
Columns in the gene-level output file of ExactCN are the following with order: 1. Chromosome, 2. Start Position, 3. End Position, 4. Gene Name 5. CNV Calling Result 6. CN Estimation
-
Following figure is an example of ExactCN exone-level output file.
Important notice: Please call the finetune_exactcn.py script from the scripts directory.
- Batch size to be used to perform CN estimation on the WES samples.
- Relative or direct path for the input HDF5 dataset.
- Relative or direct output directory path to write ExactCN output weights.
- Relative or direct path to the lookup file containing mean and standard deviation statistics of signal values. These statistics are utilized to normalize the input data during training.
- The number of epochs the finetuning will be performed.
- The learning rate to be used in finetuning
- The path for the pretrained model weights to be loaded for finetuning
-
Set to PCI BUS ID of the gpu in your system.
-
You can check, PCI BUS IDs of the gpus in your system with various ways. Using gpustat tool check IDs of the gpus in your system like below:
-Check the version of ExactCN.
-See help page.
You may want to fine-tune ExactCN with your WES dataset. We provide an example of how ExactCN can be fine-tuned using a small-sized BAM file along with its corresponding ground truth calls.
Step-0 and Step-1 are the same as the ExactCN call example.
-
This project uses conda package management software to create virtual environment and facilitate reproducability.
-
For Linux users:
-
Please take a look at the Anaconda repo archive page, and select an appropriate version that you'd like to install.
-
Replace this
Anaconda3-version.num-Linux-x86_64.shwith your choice
$ wget -c https://repo.continuum.io/archive/Anaconda3-vers.num-Linux-x86_64.sh
$ bash Anaconda3-version.num-Linux-x86_64.sh
-
It is important to set up the conda environment which includes the necessary dependencies.
-
Please run the following lines to create and activate the environment:
$ conda env create --name exactcn_env -f exactcn_environment.yml
$ conda activate exactcn_env
-
It is necessary to perform preprocessing on WES data samples to obtain read depth and other meta data and make them ready for ExactCN finetuning.
-
ExactCN Finetuning requires .bam and ground truth calls. Please see the below image for a sample ground truths format.
- Please run the following line:
$ source preprocess_samples_finetuning.sh
-
Here, we demonstrate an example to run ExactCN Finetuning on gpu device 0.
-
Please run the following script:
$ source exactcn_finetune.sh
You can change the argument parameters within the script to run it on cpu.
- At the end of ExactCN Finetuning, the script will save its model weights file to the directory given with -o option. In this tutorial it is ./exactcn_finetuned_model_weights
ExactCN-SMN is a specialised fine tuned model for aggregated SMN region CNV calling on Exon 7. We provide a specialized fine-tuned version of ExactCN, called ExactCNSMN, designed to call CNVs for the SMN1 and SMN2 genes.
Due to the extreme sequence similarity between SMN1 and SMN2, standard WES-based callers often produce unreliable results. ExactCNSMN specifically focuses on Exon 7, as this region contains the critical base difference (c.840C>T) that distinguishes SMN1 from SMN2 and disrupts a splicing enhancer. Furthermore, Exon 7 is the only region with reliable mappability in the 1000 Genomes Project WES data used for training.
Important notice: Please call the call_exactcn_smn.py script from the scripts directory.
- Relative or direct path to the trained model checkpoint.
- Batch size to be used to perform CN estimation on the samples.
- Relative or direct path for the processed HDF5 file(s) containing WES data.
- Relative or direct output directory path to write ExactCN-SMN output file.
- Relative or direct path to the normalization CSV file containing signal mean and standard deviation statistics per nucleotide channel. These statistics are utilized to normalize the input signals before inference.
-Number of parallel worker processes to spawn for processing samples. Use this to speed up inference across multiple samples (Default: 2).
- Relative or direct path to the
gene_vocab.txtfile. Highly Recommended: Providing the specific vocabulary file used during training ensures that gene IDs are mapped consistently. If omitted, the script attempts to build it from the input file.
-
Set to PCI BUS ID of the gpu in your system.
-
You can check, PCI BUS IDs of the gpus in your system with various ways. Using gpustat tool check IDs of the gpus in your system like below:
-Check the version of ExactCN.
-See help page.
We provide an example small-sized BAM file and show how to run ExactCN-SMN on this toy dataset.
-
This project uses conda package management software to create virtual environment and facilitate reproducability.
-
For Linux users:
-
Please take a look at the Anaconda repo archive page, and select an appropriate version that you'd like to install.
-
Replace this
Anaconda3-version.num-Linux-x86_64.shwith your choice
$ wget -c https://repo.continuum.io/archive/Anaconda3-vers.num-Linux-x86_64.sh
$ bash Anaconda3-version.num-Linux-x86_64.sh
-
It is important to set up the conda environment which includes the necessary dependencies.
-
Please run the following lines to create and activate the environment:
$ conda env create --name exactcn_env -f exactcn_environment.yml
$ conda activate exactcn_env
-
It is necessary to perform preprocessing on WES data samples to obtain read depth and other meta data and make them ready for CN estimation.
-
Please run the following line:
$ source preprocess_samples_smn.sh
-
Here, we demonstrate an example to run ExactCN-SMN on gpu device 0, and obtain exone-level CN prediction on SMN region.
-
Please run the following script:
$ source call_exactcn_smn.sh
-
At the end of the CN estimation procedure, ExactCN-SMN will write its output file to the directory given with -o option. In this tutorial it is ./exactcn_smn_results
-
Columns in the gene-level output file of ExactCN-SMN are the following with order: 1. Chromosome, 2. Start Position, 3. End Position, 4. Gene Name 5. CNV Calling Result 6. CN Estimation
-
Following figure is an example of ExactCN-SMN exone-level output file:
-
Copyright 2024 © ExactCN.
-
For commercial usage, please contact.



