WGS-Analysis-VariantCalling

Introduction

This repository contains Nextflow-based pipeline for whole-genome sequencing (WGS) analysis and genetic variant calling, specifically optimized for Illumina sequencing data from bacterial genomes. It is designed to provide an automated, reproducible, and scalable solution for processing large-scale genomic data in clinical microbiology research.

Pipeline summary:

All modes in the pipeline include the following steps:

Reads Quality Control and trimming: Quality of raw sequencing data is assessed using FastQC. Low-quality bases and adapter sequences are removed with FastP, followed by another round of FastQC. A summary of all QC reports is generated with MultiQC.
Contaminant sequence removal: The taxonomic sequence classifier Kraken2 is used to identify contaminant non-bacterial reads followed by SEQTK to filter out all reads flagged as contaminants.

At this stage, the pipeline offers three modes, which differ based on the expected output —Variant Calling or Genome Assembly— and the type of input data:

For Variant Calling, the analysis can be performed using either
- a de novo assembled reference strain (--mode novo) or,
- an existing reference genome (--mode reference).
A shorter version of the pipeline is available to solely perform genome assembly (--mode assemble).

In both genome assembly (--mode assemble) and de novo Variant Calling (--mode novo), raw reads are initially assembled into a genome following the following steps:

Assembly: Filtered reads are de novo assembled using SPAdes.
Genome QC: Structural quality metrics are evaluated with QUAST, and genome completeness is assessed using BUSCO. A final combined report is generated with MultiQC.
Annotation: Genome annotation is performed using both Prokka and Bakta. The resulting GFF annotation files from both annotation tools are cleaned and combined using AGAT.
Mass screening of contigs for antimicrobial resistance and virulence genes using ABRIcate and identification of antimicrobial resistance genes and point mutations in protein and/or assembled nucleotide sequences using AMRFinder.

After the assembly process, the Variant Calling analysis and the genome assembly mode diverge in their subsequent steps.

mode --reference & mode --novo

Reference and De-novo Variant Calling
Once a reference genome is provided —either de novo assembled or an existing reference— the pipeline follows the same steps for both modes:

Alignment: Reads are aligned to the selected reference genome using BWA-MEM, and the resulting alignments are then processed with Samtools.
Variant calling and filtering: Several steps are performed to identify, filter and annotate genomic variants.
- Variant Identification: Detection of single nucleotide polymorphisms (SNPs) and insertions/deletions (indels) using PicardTools, GATK and/or FreeBayes.
- Variant Filtering: Filters are applied to obtain high-confidence variant calls (see Parameters).
- Genetic variant annotation: The toolbox SnpEff is used to annotate and predict the functional effects of genetic variants on genes and proteins.

mode --assemble

Genome Assembly
For the --mode assemble, a simplified pipeline is executed:

MLST analysis:
- A fast MLST analysis is performed using raw FASTQ reads with ARIBA
- A slow MLST analysis is performed on the assembled genome using MLST.
Staphylococcus aureus: In case --mrsa is added in the command line, the spaTyper and sccmec tools are used to identifying the spa type and the SCCmec cassettes.

Note

The pipeline includes an script to download the reads from DB using an Acc_List.txt
bash ./workflow/bin/download_reads.sh

Installation

Prerequisites to run the pipeline:

Install Nextflow (Ver. ≥ 25.10.0).
Install Docker or Singularity for container support.
Ensure Java 8 or a later version is installed.

Clone the Repository:

# Clone the workflow repository
git clone https://github.com/AMRmicrobiology/WGS-Analysis-VariantCalling.git

# Move inside the main directory
cd WGS-Analysis-VariantCalling

Local (Singularity)

If you are running the pipeline locally, remember to define the path for Singularity temporary files and cache:

SINGULARITY_TMPDIR=/PATH/singularity/tmp
SINGULARITY_CACHEDIR=/PATH/singularity/cache
TMPDIR=/PATH/singularity/tmp
export NFX_SINGULARITY_CACHEDIR =/PATH/singularity/tmp

e.g:

SINGULARITY_TMPDIR=/mnt/dades/singularity/tmp
SINGULARITY_CACHEDIR=/mnt/dades/singularity/tmp
TMPDIR=/mnt/dades/singularity/tmp
export NFX_SINGULARITY_CACHEDIR=/mnt/dades/singularity/tmp

Note

Conda environments are listed and created but have not been tested.

How to use it?

Run the pipeline using the following commands, adjusting the parameters as needed:

REFERENCE GENOME VARIANT CALLING

nextflow run main.nf --mode reference --input "/path/to/data/*_{1,2}.fastq.gz" --personal_ref "/path/to/bacterial_genome.fasta" --genome_name_db "Acinetobacter_baumanii_clinical" -profile <docker/singularity/conda>

DE NOVO VARIANT CALLING

nextflow run main.nf --mode novo --input "/path/to/data/*_{1,2}.fastq.gz" --wildtype_code "Pa01WT" --genome_name_db "Acinetobacter_baumanii_clinical" -profile <docker/singularity/conda>

GENOME ASSEMBLY

nextflow run main.nf --mode assemble --input "/path/to/data/*_{1,2}.fastq.gz" --organism "organism name" -profile <docker/singularity/conda>

Usage and parameters

Usage: nextflow run main.nf [--help] [--mode VAR] [--input VAR] [--genome_name_db VAR] [--wildtype_code VAR] [--outdir VAR] [--personal_ref VAR] [--custom_gff3 VAR] [--organism VAR] [--cut_front VAR] [--cut_tail VAR] [--cut_mean_quality VAR] [--length_required VAR] [--mrsa] -[-qual_snp VAR] [--qual_indel VAR] [--bakta_db_define VAR] [--db_select VAR] [--abricate_db VAR] [-w VAR] [-profile VAR]

Input data arguments
  --mode                TEXT        Selection of the pipeline assemble/reference/novo [required]
  --input               PATH        Input FASTQ paired-end files named *_{1,2} (.fastq.gz format) [required]
  --genome_name_db      TEXT        (--mode novo/reference) Name of the organism/strain to name the SnpEFF database [required]
  --wildtype_code       TEXT        (--mode novo) Define the sample that will be taken as reference [required]
  --personal_ref        PATH        (--mode reference) Path to the bacterial reference genome in .fasta file [required]
  --custom_gff3         PATH        (--mode reference) Path to the annotation .GGF3 file.

Nextflow arguments
  -profile              TEXT        Selection of execution profile (docker, singularity or conda) [required]
  -w                    PATH        Path to the work dir. where temporary files will be written [default: ./work ]

Output arguments
  --outdir              PATH        Directory to write the output [default: ./out]

Optional arguments 
  --help                            Show this message and exit      
  --mrsa                BOOLEAN     (--mode assemble) Specific for <Staphylococcus aureus> genome assemblies. Add this parameter to identify the spa Type and SCCmec cassettes [dafault: false]

Raw reads filtering arguments
  --cut_front           INTEGER     Move a sliding window from front (5') to tail, drop the bases in the window if its mean quality < threshold, stop otherwise. [default: 15]
  --cut_tail            INTEGER     Move a sliding window from tail (3') to front, drop the bases in the window if its mean quality < threshold, stop otherwise [default: 20]
  --cut_mean_quality    INTEGER     The mean quality requirement option shared by cut_front, cut_tail or cut_sliding. Range: 1~36. [default: 20]
  --length_required     INTEGER     Reads shorter than length_required will be discarded [default: 50]

Filtering parameters
  --snp_filter_expr     TEXT        One or more expressions used with INFO fields to quality filter SNPs [default "QUAL < 50.0 || MQ < 25.0 || DP < 30"].
  --indel_filter_expr   TEXT        One or more expressions used with INFO fields to quality filter INDELs [default: "QUAL < 200.0 || MQ < 25.0 || DP < 30"]

MLST and AMR arguments
  --organism            TEXT        To be used by ARIBA to determine the scheme to classify the [bacterial strain](./organisms_list.txt). Also, it will be used by ABRicate. ABRicate searches the following databases: vfdb_full, resfinder, plasmidfinder, and card. If Escherichia coli or Klebsiella pneumoniae is specified, ecoli_vf and argannot will be searched, respectively, instead of vfdb_full [default: ""] [required for mode --assemble] 

Databases arguments
  --bakta_db         PATH        Define the path to the user downloaded database to be used by Bakta. By default the database is downloaded if no argument is added. Another option is to copy-paste the database directly to the "./bakta_db" directory
  --db_select        TEXT/PATH   Kraken2 database to use for taxonomy classification. The options "db_16GB" or "db_full_60GB" are downloaded automatically if specified. Alternatively, a path to a user-provided database may be supplied. Another option is to copy-paste the database directly into the "./kraken_db" directory [default: "db_16GB"]
  --abricate_db      PATH        Path to the user downloaded databases to be used by Abricate

Note

Regarding GATK quality parameters --> QUAL: A confidence measure of the variant; MQ: Mapping quality; DP: Filtered reads that support each of the reported alleles (depth). More info here.

Output

This is the folder architecture and the content of the output data directories. Outputs can vary depending on the mode selected:

Folder	Subfolder	Description
1-QC	fastQC	MultiQC report of data quality assessment of all samples
2-Assembly		All final consensus genomes assemblies ("sample_ID"_consensus_wrapped.fasta)
	1-Fly_structural	Nanostats results and Flye output results directories containing the graph files
	2-Medaka_results	Medaka output directories
	3-Annotations	All combined files produced by AGAT from Bakta and Prokka annotation tools are located here. Also, Bakta and Prokka output directories
3-Prunning	ABRICATE	ABRICATE search results
3-Prunning	AMRFinder	AMRFinder search results
4-VCF		MLST results
Variant annotation		MOB-suite plasmid tool output directories

Reference:

In Silico Evaluation of Variant Calling Methods for Bacterial Whole-Genome Sequencing Assays

Recommendations for clinical interpretation of variants found in non-coding regions of the genome

An ANI gap within bacterial species that advances the definitions of intra-species units

Evaluation of serverless computing for scalable execution of a joint variant calling workflow

GATK hard filtering: tunable parameters to improve variant calling for next generation sequencing targeted gene panel data

Assembling the perfect bacterial genome using oxford nanopore and illumina sequencing

Name		Name	Last commit message	Last commit date
Latest commit History 253 Commits
bakta_db		bakta_db
bin		bin
data/in		data/in
envs		envs
kraken_db		kraken_db
subworkflow		subworkflow
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
Dockerfile.ARIBA		Dockerfile.ARIBA
Dockerfile.abricate		Dockerfile.abricate
Dockerfile.agat		Dockerfile.agat
Dockerfile.bakta		Dockerfile.bakta
Dockerfile.baktaDB		Dockerfile.baktaDB
Dockerfile.busco		Dockerfile.busco
Dockerfile.kraken2.DB		Dockerfile.kraken2.DB
Dockerfile.snpeff		Dockerfile.snpeff
LICENSE		LICENSE
PipelineCP_V2.0.png		PipelineCP_V2.0.png
README.md		README.md
gtdbtk-entrypoint.sh		gtdbtk-entrypoint.sh
kraken2-entrypoint.sh		kraken2-entrypoint.sh
main.nf		main.nf
nextflow.config		nextflow.config
organisms_list.txt		organisms_list.txt
pubmlst_getter.py		pubmlst_getter.py
rest.pubmlst.org.json		rest.pubmlst.org.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WGS-Analysis-VariantCalling

Introduction

Contents

Pipeline summary:

mode --reference & mode --novo

mode --assemble

Installation

Local (Singularity)

How to use it?

Usage and parameters

Output

Reference:

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

AMRmicrobiology/WGS-Analysis-VariantCalling

Folders and files

Latest commit

History

Repository files navigation

WGS-Analysis-VariantCalling

Introduction

Contents

Pipeline summary:

mode --reference & mode --novo

mode --assemble

Installation

Local (Singularity)

How to use it?

Usage and parameters

Output

Reference:

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages