This repository contains Nextflow-based pipeline for whole-genome sequencing (WGS) analysis and genetic variant calling, specifically optimized for Illumina sequencing data from bacterial genomes. It is designed to provide an automated, reproducible, and scalable solution for processing large-scale genomic data in clinical microbiology research.
All modes in the pipeline include the following steps:
-
Reads Quality Control and trimming: Quality of raw sequencing data is assessed using FastQC. Low-quality bases and adapter sequences are removed with FastP, followed by another round of FastQC. A summary of all QC reports is generated with MultiQC.
-
Contaminant sequence removal: The taxonomic sequence classifier Kraken2 is used to identify contaminant non-bacterial reads followed by SEQTK to filter out all reads flagged as contaminants.
At this stage, the pipeline offers three modes, which differ based on the expected output —Variant Calling or Genome Assembly— and the type of input data:
- For Variant Calling, the analysis can be performed using either
- a de novo assembled reference strain (--mode novo) or,
- an existing reference genome (--mode reference).
- A shorter version of the pipeline is available to solely perform genome assembly (--mode assemble).
In both genome assembly (--mode assemble) and de novo Variant Calling (--mode novo), raw reads are initially assembled into a genome following the following steps:
- Assembly: Filtered reads are de novo assembled using SPAdes.
- Genome QC: Structural quality metrics are evaluated with QUAST, and genome completeness is assessed using BUSCO. A final combined report is generated with MultiQC.
- Annotation: Genome annotation is performed using both Prokka and Bakta. The resulting GFF annotation files from both annotation tools are cleaned and combined using AGAT.
- Mass screening of contigs for antimicrobial resistance and virulence genes using ABRIcate and identification of antimicrobial resistance genes and point mutations in protein and/or assembled nucleotide sequences using AMRFinder.
After the assembly process, the Variant Calling analysis and the genome assembly mode diverge in their subsequent steps.
Reference and De-novo Variant Calling
Once a reference genome is provided —either de novo assembled or an existing reference— the pipeline follows the same steps for both modes:
- Alignment: Reads are aligned to the selected reference genome using BWA-MEM, and the resulting alignments are then processed with Samtools.
- Variant calling and filtering: Several steps are performed to identify, filter and annotate genomic variants.
- Variant Identification: Detection of single nucleotide polymorphisms (SNPs) and insertions/deletions (indels) using PicardTools, GATK and/or FreeBayes.
- Variant Filtering: Filters are applied to obtain high-confidence variant calls (see Parameters).
- Genetic variant annotation: The toolbox SnpEff is used to annotate and predict the functional effects of genetic variants on genes and proteins.
Genome Assembly
For the --mode assemble, a simplified pipeline is executed:
- MLST analysis:
- Staphylococcus aureus: In case --mrsa is added in the command line, the spaTyper and sccmec tools are used to identifying the spa type and the SCCmec cassettes.
Note
The pipeline includes an script to download the reads from DB using an Acc_List.txt
bash ./workflow/bin/download_reads.sh
Prerequisites to run the pipeline:
- Install Nextflow (Ver. ≥ 25.10.0).
- Install Docker or Singularity for container support.
- Ensure Java 8 or a later version is installed.
Clone the Repository:
# Clone the workflow repository
git clone https://github.com/AMRmicrobiology/WGS-Analysis-VariantCalling.git
# Move inside the main directory
cd WGS-Analysis-VariantCalling
If you are running the pipeline locally, remember to define the path for Singularity temporary files and cache:
SINGULARITY_TMPDIR=/PATH/singularity/tmp
SINGULARITY_CACHEDIR=/PATH/singularity/cache
TMPDIR=/PATH/singularity/tmp
export NFX_SINGULARITY_CACHEDIR =/PATH/singularity/tmp
e.g:
SINGULARITY_TMPDIR=/mnt/dades/singularity/tmp
SINGULARITY_CACHEDIR=/mnt/dades/singularity/tmp
TMPDIR=/mnt/dades/singularity/tmp
export NFX_SINGULARITY_CACHEDIR=/mnt/dades/singularity/tmp
Note
Conda environments are listed and created but have not been tested.
Run the pipeline using the following commands, adjusting the parameters as needed:
REFERENCE GENOME VARIANT CALLING
nextflow run main.nf --mode reference --input "/path/to/data/*_{1,2}.fastq.gz" --personal_ref "/path/to/bacterial_genome.fasta" --genome_name_db "Acinetobacter_baumanii_clinical" -profile <docker/singularity/conda>
DE NOVO VARIANT CALLING
nextflow run main.nf --mode novo --input "/path/to/data/*_{1,2}.fastq.gz" --wildtype_code "Pa01WT" --genome_name_db "Acinetobacter_baumanii_clinical" -profile <docker/singularity/conda>
GENOME ASSEMBLY
nextflow run main.nf --mode assemble --input "/path/to/data/*_{1,2}.fastq.gz" --organism "organism name" -profile <docker/singularity/conda>
Usage: nextflow run main.nf [--help] [--mode VAR] [--input VAR] [--genome_name_db VAR] [--wildtype_code VAR] [--outdir VAR] [--personal_ref VAR] [--custom_gff3 VAR] [--organism VAR] [--cut_front VAR] [--cut_tail VAR] [--cut_mean_quality VAR] [--length_required VAR] [--mrsa] -[-qual_snp VAR] [--qual_indel VAR] [--bakta_db_define VAR] [--db_select VAR] [--abricate_db VAR] [-w VAR] [-profile VAR]
Input data arguments
--mode TEXT Selection of the pipeline assemble/reference/novo [required]
--input PATH Input FASTQ paired-end files named *_{1,2} (.fastq.gz format) [required]
--genome_name_db TEXT (--mode novo/reference) Name of the organism/strain to name the SnpEFF database [required]
--wildtype_code TEXT (--mode novo) Define the sample that will be taken as reference [required]
--personal_ref PATH (--mode reference) Path to the bacterial reference genome in .fasta file [required]
--custom_gff3 PATH (--mode reference) Path to the annotation .GGF3 file.
Nextflow arguments
-profile TEXT Selection of execution profile (docker, singularity or conda) [required]
-w PATH Path to the work dir. where temporary files will be written [default: ./work ]
Output arguments
--outdir PATH Directory to write the output [default: ./out]
Optional arguments
--help Show this message and exit
--mrsa BOOLEAN (--mode assemble) Specific for <Staphylococcus aureus> genome assemblies. Add this parameter to identify the spa Type and SCCmec cassettes [dafault: false]
Raw reads filtering arguments
--cut_front INTEGER Move a sliding window from front (5') to tail, drop the bases in the window if its mean quality < threshold, stop otherwise. [default: 15]
--cut_tail INTEGER Move a sliding window from tail (3') to front, drop the bases in the window if its mean quality < threshold, stop otherwise [default: 20]
--cut_mean_quality INTEGER The mean quality requirement option shared by cut_front, cut_tail or cut_sliding. Range: 1~36. [default: 20]
--length_required INTEGER Reads shorter than length_required will be discarded [default: 50]
Filtering parameters
--snp_filter_expr TEXT One or more expressions used with INFO fields to quality filter SNPs [default "QUAL < 50.0 || MQ < 25.0 || DP < 30"].
--indel_filter_expr TEXT One or more expressions used with INFO fields to quality filter INDELs [default: "QUAL < 200.0 || MQ < 25.0 || DP < 30"]
MLST and AMR arguments
--organism TEXT To be used by ARIBA to determine the scheme to classify the [bacterial strain](./organisms_list.txt). Also, it will be used by ABRicate. ABRicate searches the following databases: vfdb_full, resfinder, plasmidfinder, and card. If Escherichia coli or Klebsiella pneumoniae is specified, ecoli_vf and argannot will be searched, respectively, instead of vfdb_full [default: ""] [required for mode --assemble]
Databases arguments
--bakta_db PATH Define the path to the user downloaded database to be used by Bakta. By default the database is downloaded if no argument is added. Another option is to copy-paste the database directly to the "./bakta_db" directory
--db_select TEXT/PATH Kraken2 database to use for taxonomy classification. The options "db_16GB" or "db_full_60GB" are downloaded automatically if specified. Alternatively, a path to a user-provided database may be supplied. Another option is to copy-paste the database directly into the "./kraken_db" directory [default: "db_16GB"]
--abricate_db PATH Path to the user downloaded databases to be used by Abricate
Note
Regarding GATK quality parameters --> QUAL: A confidence measure of the variant; MQ: Mapping quality; DP: Filtered reads that support each of the reported alleles (depth). More info here.
This is the folder architecture and the content of the output data directories. Outputs can vary depending on the mode selected:
| Folder | Subfolder | Description |
|---|---|---|
| 1-QC | fastQC | MultiQC report of data quality assessment of all samples |
| 2-Assembly | All final consensus genomes assemblies ("sample_ID"_consensus_wrapped.fasta) | |
| 1-Fly_structural | Nanostats results and Flye output results directories containing the graph files | |
| 2-Medaka_results | Medaka output directories | |
| 3-Annotations | All combined files produced by AGAT from Bakta and Prokka annotation tools are located here. Also, Bakta and Prokka output directories | |
| 3-Prunning | ABRICATE | ABRICATE search results |
| AMRFinder | AMRFinder search results | |
| 4-VCF | MLST results | |
| Variant annotation | MOB-suite plasmid tool output directories |
In Silico Evaluation of Variant Calling Methods for Bacterial Whole-Genome Sequencing Assays
Recommendations for clinical interpretation of variants found in non-coding regions of the genome
An ANI gap within bacterial species that advances the definitions of intra-species units
Evaluation of serverless computing for scalable execution of a joint variant calling workflow
Assembling the perfect bacterial genome using oxford nanopore and illumina sequencing
