Skip to content

Comprehensive structural variant calling, filtering, annotation and consensus set generation from ONT's long read sequencing data

License

Notifications You must be signed in to change notification settings

manascripts/ontvar

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

nf-core/ontvar

GitHub Actions CI Status GitHub Actions Linting StatusAWS CICite with Zenodo nf-test

Nextflow nf-core template version run with conda run with docker run with singularity Launch on Seqera Platform

Get help on SlackFollow on BlueskyFollow on MastodonWatch on YouTube

Introduction

nf-core/ontvar is a comprehensive structural variant (SV) calling, filtering, annotation and consensus generation pipeline for Oxford Nanopore Technologies (ONT) long-read sequencing data.

Key Features

  • Multi-caller SV detection: Sniffles, cuteSV, and Severus for comprehensive variant discovery
  • Case-control aware analysis: Support for tumor-normal paired analysis and tumor-only with panel of normals
  • Consensus calling: Sample-level caller merging with configurable support thresholds
  • Population frequency filtering: Integration with gnomAD and custom population databases
  • Comprehensive annotation: AnnotSV provides gene-based and regulatory annotations
  • Cohort-level analysis: Multi-sample variant merging and analysis
  • Interactive visualizations: Detailed QC plots and summary statistics at each stage

Workflow Overview

ontvar Workflow

Usage

Note

If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.

The pipeline consists of the following major steps:

  1. SV Calling: Run Sniffles, cuteSV, and Severus callers on input samples
  2. Sample Consensus: Merge caller results per sample using Jasmine (caller support filter)
  3. Population Annotation: Add allele frequency information from gnomad and long-read sequencing based healthy population databases (using SVDB)
  4. Sample Filtering: Remove common variants based on population frequencies
  5. Sample Annotation: Comprehensive AnnotSV annotation of sample variants
  6. Cohort Merging: Create cohort-wide merged callset using Jasmine
  7. Cohort Filtering: Apply population frequency filters at cohort level
  8. Final Annotation: AnnotSV annotation of final cohort callset
  9. QC & Visualization: Generate summary statistics and plots at each stage

Quick Start

First, prepare a samplesheet with your input data that looks as follows:

samplesheet.csv:

sample,fastq,bam,fasta,control
CASE_01,/path/to/case01.fastq.gz,/path/to/case01.bam,/path/to/case01.fasta,false
CASE_02,/path/to/case02.fastq.gz,/path/to/case02.bam,,false
CONTROL_01,/path/to/control01.fastq.gz,/path/to/control01.bam,/path/to/control01.fasta,true

Samplesheet Format

Each row represents a sample with the following columns:

Column Required Description
group_id Yes Sample group used for pairing identifier
sample_id Yes Unique ID for each sample
sample_type Yes String indicating if sample is a case or control
bam_path Yes Path to aligned BAM file

Now, you can run the pipeline using:

nextflow run nf-core/ontvar \
   -profile <docker/singularity/.../institute> \
   --input samplesheet.csv \
   --outdir <OUTDIR> \
   --reference reference.fa

Required Parameters

Parameter Description Example
--input Path to comma-separated sample sheet file path/to/samplesheet.csv
--outdir Output directory path path/to/outdir
--reference Reference genome FASTA file path/to/hg38.fa

Note

It is recommemned to provide Path to AnnotSV annotation directory as the --annotsv_annotations after the first run, to avoid re-downloading them for future runs.

Customizing Pipeline Parameters

The pipeline offers extensive customization options for each step of the analysis. All parameters can be adjusted to fit your specific needs:

SV Caller Parameters: Fine-tune settings for Sniffles, cuteSV, and Severus including minimum mapping quality, SV size thresholds, read support requirements, and more.

Consensus & Filtering Parameters: Adjust caller support thresholds (e.g., require 2 or 3 callers), population frequency cutoffs, overlap ratios for merging, and distance thresholds.

Annotation Parameters: Configure AnnotSV annotation databases, genome builds, output formats, and annotation detail levels.

Database Parameters: Specify custom SVDB population databases, panel of normals files, and AnnotSV annotation paths.

These are configurable via command-line flags or in the nextflow.config file.

Chromosome Filtering

By default, the pipeline retains only main contigs (CHR1-22,X,Y,M). This is controlled in the FILTER_CHR module.

Advanced Usage

Custom Filtering Thresholds

Adjust caller support and population frequency thresholds:

nextflow run nf-core/ontvar \
   -profile docker \
   --input samplesheet.csv \
   --outdir results \
   --reference reference.fa \
   --min_caller_support 3 \        # Require 3/3 callers
   --max_gnomad_af 0.001 \         # Change population frequency cutoff
   --max_needlr_af 0.001

Pipeline output

To see the results of an example test run with a full size dataset refer to the results tab on the nf-core website pipeline page. For more details about the output files and reports, please refer to the output documentation.

The pipeline generates outputs organized into case-level and cohort-level directories:

Case-Level Outputs (results/case/)

1. Raw Calls (01_raw_calls/)

  • Individual caller VCFs for each sample
  • Subdirectories: sniffles/, cutesv/, severus/
  • Summary JSON and count plots

Files:

01_raw_calls/
├── sniffles/
│   ├── SAMPLE1_sniffles.vcf.gz
│   └── SAMPLE2_sniffles.vcf.gz
├── cutesv/
│   ├── SAMPLE1_cutesv.vcf
│   └── SAMPLE2_cutesv.vcf
├── severus/
│   ├── tumor_normal/
│   │   └── SAMPLE1_tn_severus.vcf
│   └── tumor_only/
│       └── SAMPLE2_to_severus.vcf
├── raw_calls_summary.json
├── raw_callers_plot_sv_counts_stacked.png
├── raw_callers_plot_sv_counts_callers.png
└── raw_callers_plot_sv_counts.png

2. Caller Merged (02_caller_merged/)

  • Sample-level consensus VCFs (filtereed by caller support)
  • AnnotSV annotations (pre-filtering)
  • Summary statistics and plots

Files:

02_caller_merged/
├── SAMPLE1.vcf
├── SAMPLE1.tsv                    # AnnotSV full annotation
├── SAMPLE1.annotated.tsv          # AnnotSV gene-level
├── caller_merged_summary.json
├── consensus_plot_sv_counts_stacked.png
└── consensus_plot_sv_counts.png

3. Caller Merged Filtered (03_caller_merged_filtered/)

  • Population frequency filtered VCFs
  • Final AnnotSV annotations
  • Summary statistics and plots

Files:

03_caller_merged_filtered/
├── SAMPLE1_filtered.vcf
├── SAMPLE1_filtered.tsv
├── SAMPLE1_filtered.annotated.tsv
├── filtered_summary.json
├── filtered_plot_sv_counts_stacked.png
└── filtered_plot_sv_counts.png

Cohort-Level Outputs (results/cohort/)

Files:

cohort/
├── cohort_annotated.vcf                    # AnnotSV (pre-AF filtering)
├── cohort_annotated.tsv                    # AnnotSV (pre-AF filtering)
├── cohort_filtered.vcf                     # AnnotSV (post-AF filtering)
├── cohort_filtered.tsv                     # AnnotSV (post-AF filtering)
├── cohort_annotated_summary.json
├── cohort_annotated_sv_counts.png
├── cohort_filtered_summary.json
└── cohort_filtered_sv_counts.png

MultiQC Report

A comprehensive HTML report combining all QC metrics:

results/multiqc/multiqc_report.html

Summary JSON Format

Each *_summary.json file contains SV counts by:

  • Sample
  • Caller
  • SV type (DEL, INS, DUP, INV, BND, etc.)

Example structure:

{
  "analysis_type": "multi_sample",
  "samples": {
    "SAMPLE1": {
      "callers": {
        "sniffles": {
          "sv_types": {
            "DEL": {"count": 1234},
            "INS": {"count": 567}
          }
        }
      },
      "combined_stats": {
        "sv_types": {
          "DEL": {"count": 1500}
        }
      }
    }
  }
}

Credits

nf-core/ontvar is written and maintained by Manas Sehgal.

Tools Used

This pipeline integrates the following tools:

  • Sniffles2 - SV calling from long reads
  • cuteSV - Long-read SV detection
  • Severus - Somatic SV calling
  • Jasmine - SV merging and comparison
  • AnnotSV - Structural variant annotation
  • SVDB - Structural variant population frequency annotation
  • BCFtools - VCF manipulation
  • MultiQC - Quality control reporting

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

About

Comprehensive structural variant calling, filtering, annotation and consensus set generation from ONT's long read sequencing data

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks