CHALLENGER : Detecting Copy Number Variants in Challenging Regions Using Whole Genome Sequencing Data
CHALLENGER is a RoBERTa-based deep learning tool designed for predicting copy number variations (CNVs) in challenging genomic regions using short-read whole-genome sequencing (WGS) data. See the preprint for more information.
The repository with processed samples, ground truth data, and CNV predictions for all samples to reproduce the analyses in the paper can be found here : CHALLENGER results reproduction
Deep Learning, Copy Number Variation, Whole Genome Sequencing
Mehmet Alper Yilmaz, Ahmet Arda Ceylan, A. Ercument Cicek
[firstauthorname].[firstauthorsurname]@bilkent.edu.tr
[lastauthorsurname]@cs.bilkent.edu.tr
Warning: Please note that CHALLENGER software is completely free for academic usage. However it is licenced for commercial usage. Please first refer to the License section for more info.
- Installation
- Requirements
- Features
- Instructions Manual for CHALLENGER
- Required Arguments
- Optional Arguments
- Usage Example
- Step-0: Install conda package management
- Step-1: Set Up your environment
- Step-2: Run the preprocessing script
- Step-3: Run CHALLENGER
- Output file of CHALLENGER
- Instructions Manual for Fine-Tuning CHALLENGER
- Required Arguments (Fine-Tuning)
- Optional Arguments (Fine-Tuning)
- Fine-Tune Example
- Step-2: Preprocessing for Fine-Tuning
- Step-3: Start Fine-Tuning
- Model Weight Output Directory
- License
-
CHALLENGER is a python3 script and it is easy to run after the required packages are installed.
-
Please clone the repository using Git LFS:
git lfs clone https://github.com/ciceklab/CHALLENGER.git
-
The latest fine-tuned CHALLENGER models can be downloaded from here
For easy requirement handling, you can use CHALLENGER_environment.yml files to initialize conda environment with requirements installed:
$ conda env create --name challenger_env -f CHALLENGER_environment.yml
$ conda activate challenger_envNote that the provided environment yml file is for Linux systems. For MacOS users, the corresponding versions of the packages might need to be changed.
- CHALLENGER provides GPU support optionally. See GPU Support section.
Important notice: Please call the CHALLENGER_call.py script from the scripts directory.
- Mini-batch size used during evaluation or inference.
- Path to the input Parquet file containing read-depth data.
- Path to the mean/std normalization file (.txt).
- Path to the baseline gene-coverage dictionary (.pt) used for gene-specific normalization of read-depth.
- Path to the tokenizer configuration file (.json).
- Path to the fine-tuned model weight directory or checkpoint folder. You can download the fine-tuned CHALLENGER models from here. It contains the following models:
- models/CHALLENGER-LR
- models/CHALLENGER-EXP
- models/CHALLENGER-GENE/<gene_name>
- Directory where CNV calls, logs, and intermediate outputs will be saved.
- Name of the run
- Set to PCI BUS ID of the gpu in your system.
- You can check, PCI BUS IDs of the gpus in your system with various ways. Using gpustat tool check IDs of the gpus in your system like below:
- See help page.
CHALLENGER is very easy to use! Here, We provide an example small-sized BAM file and show how to run CHALLENGER on this toy dataset.
-
This project uses conda package management software to create virtual environment and facilitate reproducability.
-
For Linux users:
-
Please take a look at the Anaconda repo archive page, and select an appropriate version that you'd like to install.
-
Replace this
Anaconda3-version.num-Linux-x86_64.shwith your choice
$ wget -c https://repo.continuum.io/archive/Anaconda3-vers.num-Linux-x86_64.sh
$ bash Anaconda3-version.num-Linux-x86_64.sh- It is important to set up the conda environment which includes the necessary dependencies.
- Please run the following lines to create and activate the environment:
$ conda env create --name challenger_env -f CHALLENGER_environment.yml
$ conda activate challenger_env- Preprocessing is required to convert raw WGS data into standardized read-depth and metadata representations suitable for CNV calling. The pipeline generates per-sample read-depth files (_W50.txt) and then processes and combines these files into a single consolidated Parquet file, which serves as the model’s input.
- Please run the following line:
$ source preprocess_samples.sh- Here, we demonstrate an example to run CHALLENGER on gpu device 0, and obtain CNV call.
- Please run the following script:
$ source challenger_call.sh- At the end of the CNV calling procedure, CHALLENGER will write its output file to the directory given with -o option. In this tutorial it is ./outputs
- Output file of CHALLENGER is a tab-delimited.
- Columns in the gene-level output file of CHALLENGER are the following with order: 1. Sample Name, 2. Chromosome, 3. Gene Name, 4. CHALLENGER Prediction
- Following figure is an example of CHALLENGER gene-level output file.
Important notice: Please call the CHALLENGER_FT.py script from the scripts directory.
- Mini-batch size used during Fine-Tuning.
- Path to the input Parquet file containing read-depth data.
- Path to the mean/std normalization file (.txt).
- Path to the baseline gene-coverage dictionary (.pt) used for gene-specific normalization of read-depth.
- Path to the tokenizer configuration file (.json).
- Path to the initial model weight directory or checkpoint folder. You can download the fine-tuned CHALLENGER models from here. It contains the following models:
- models/CHALLENGER-LR
- models/CHALLENGER-EXP
- models/CHALLENGER-GENE/<gene_name>
- Number of training epochs.
- Path where the trained model weights will be saved.
- Name of the run
- Set to PCI BUS ID of the gpu in your system.
- You can check, PCI BUS IDs of the gpus in your system with various ways. Using gpustat tool check IDs of the gpus in your system like below:
- See help page.
You may want to fine-tune CHALLENGER with your WGS dataset. We provide an example of how CHALLENGER can be fine-tuned using the same small-sized CRAM file along with its corresponding ground truth calls.
Step-0 and Step-1 are the same as the CHALLENGER call example.
- Preprocessing is required to convert raw WGS data into standardized read-depth and metadata formats suitable for CNV calling. The pipeline produces per-sample read-depth files (_W50.txt) and then processes and merges them into a single consolidated Parquet file, which is required for fine-tuning.
- Please run the following line:
$ source preprocess_samples_FT.sh- Here, we demonstrate an example to fine-tune CHALLENGER on gpu device 0.
- Please run the following script:
$ source challenger_FT.sh- During fine-tuning, CHALLENGER saves the model weights to the directory specified with -o, which in this tutorial is /FT_weights.
- AGPL-3.0
- Copyright 2025 © CHALLENGER.
- For commercial usage, please contact.

