TAXODACTYL - High-confidence, evidence-based taxonomic assignment of DNA sequences

qcif/taxodactyl is a modular, reproducible Nextflow workflow for the conservative taxonomy assignment to DNA sequences, designed for high-confidence, auditable results in biosecurity and biodiversity contexts. The workflow integrates multiple bioinformatics tools and databases, automates best-practice analysis steps, and produces detailed reports with supporting evidence for each taxonomic assignment.

Caution

The BOLD database search functionality is currently not operational. Please use the BLAST Core Nucleotide Database option instead.

Quick links

Workflow Overview

The pipeline orchestrates a series of analytical steps, each encapsulated in a dedicated module or subworkflow. The main stages are:

Environment Configuration Sets up environment variables and paths required for downstream processes, ensuring reproducibility and portability.
Input Validation Checks the integrity and compatibility of input files (FASTA sequences, metadata, databases), preventing downstream errors.
Sequence Search
- BLAST Core Nucleotide Database (BLASTN): Queries input sequences against the NCBI nucleotide database using BLASTN.
- BOLD v4 (API): Queries input sequences against the Barcode of Life Data Systems. Taxonomic lineage included in the results.
Hit Extraction Parses BLAST results to extract relevant hits for each query.
Taxonomic ID Extraction Retrieve taxids for BLAST hit records.
Build Taxonomic Lineage Maps taxonomic IDs to full lineages, enabling downstream filtering and reporting.
Candidate Evaluation Identifies candidate species for each query, applying configurable thresholds for identity and coverage.
Supporting Evidence Evaluation
- Supporting Publications: Assesses the diversity of publications supporting each candidate species' reference sequences.
- Database Coverage: Evaluates the representation of candidate species, taxa of interest and preliminary taxonomic ID in global databases (GBIF, GenBank, BOLD).
Multiple Sequence Alignment (MAFFT) Aligns candidate and query sequences to prepare for phylogenetic analysis.
Phylogenetic Tree Construction (FastMe) Builds a phylogenetic tree to visualise relatedness of candidate and query sequences.
Workflow report Generates detailed HTML and text reports, including sequence alignments, phylogenetic trees, database coverage, and supporting evidence for each assignment.

Usage

Software

To run the qcif/taxodactyl pipeline, you will need the following software installed:

Nextflow Tested versions: 24.10.3, 24.10.6
Java Required by Nextflow. Tested version: 17.0.13
Singularity Used for containerised execution of all bioinformatics tools, ensuring reproducibility. Tested version: 3.7.0

Note

Instructions on how to set up Nextflow and a compatible version of Java can be found on this page.
To install singularity follow instructions from this website.
We provide different profiles as per the default nf-core configuration however this pipeline was only tested with singularity.
The pipeline was tested only on a Linux-based operating system - specifically, Ubuntu 24.04.1 LTS.
If you have never downloaded or run a Nextflow pipeline, we have some additional tips and bash commands in the step-by-step guide.

NCBI API Key

API Key is used to authenticate with the NCBI Entrez API for an increased rate limit. You can generate it following the instructions from this article.

TaxonKit

Download the NCBI taxonomy data files and extract them to ~/.taxonkit. Similarly, download the taxonkit tool and move into the same folder.

Note

The current version of TAXODACTYL is not compatible with taxonkit/taxdump files downloaded before April 2025; this affects viruses, and evaluating database coverage for viruses will likely return an error.

BLAST Core Nucleotide Database

To search sequences against the BLAST Core Nucleotide Database, you must download it first. We recommend running the update_blastdb.pl program. Follow instructions from this book. Perl installation is required. The command should look like this: perl ~/ncbi-blast-2.16.0+/bin/update_blastdb.pl --decompress core_nt

Sequences file (`sequences.fasta`)

You can provide query sequences in either of two ways:

Provide a FASTA file using --sequences.
Add a sequence column to metadata.csv and omit --sequences.

If using a FASTA file, it should contain the query sequences (up to 100), e.g.

>VE24-1075_COI
TGGATCATCTCTTAGAATTTTAATTCGATTAGAATTAAGACAAATTAATTCTATTATTWATAATAATCAATTATATAATGTAATTGTTCACAATTCATGCTTTTATTATAATTTTTTTTATAACTATACCAATTGTAATTGGTGGATTTGGAAATTGATTAATTCCTATAATAATAGGATGTCCTGATATATCATTTCCACSTTTAAATAATATTAGATTTTGATTATTACCTCCATCATTAATAATAATAATTTGTAGATTTTTAATTAATAATGGAACAGGAACAGGATGAACAATTTAYCCHCCTTTATCAAACAATATTGCACATAATAACATTTCAGTTGATTTAACTATTTTTTCTTTACATTTAGCAGGWATCTCATCAATTTTAGGAGCAATTAACTTTATTTGTACAATTCTTAATATAATAYCAAAYAATATAAAACTAAATCAAATTCCTCTTTTTCCTTGATCAATTTTAATTACAGCTATTTTATTAATTTTATMTTTACCAGTTTTAGCTGGTGCCATTACAATATTATTAACTGATCGTAATTTAAATACATCATTTTTGATCCAGCAGGAGGAGGAGATCC
>VE24-1079_COI
AACTTTATATTTCATTTTTGGAATATGGGCAGGTATATTAGGAACTTCACTAAGATGAATTATTCGAATTGAACTTGGACAACCAGGATCATTTATTGGAGATGATCAAATTTATAATGTAGTAGTTACCGCACACGCATTTATTATAATTTTCTTTATAGTTATACCAATTATAATTGGAGGATTTGGAAATTGATTAGTACCTCTAATAATTGGAGCACCAGATATAGCATTCCCACGGATAAATAATATAAGATTTTGATTATTACCACCCTCAATTACACTTCTTATTATAAGATCTATAGTAGAAAGAGGAGCAGGAACTGGATGAACAGTATATCCCCCACTATCATCAAATATTGCACATAGTGGAGCATCAGTAGACCTAGCAATTTTTTCACTACATTTAGCAGGTGTATCTTCAATTTTAGGAGCAATTAATTTCATCTCAACAATTATTAATATACGACCTGAAGGCATATCTCCAGAACGAATTCCATTATTTGTATGATCAGTAGGTATTACAGCATTACTATTATTATTATCATTACCAGTTCTAGCTGGAGCTATTACAATATTATTAACAGATCGAAACTTTAATACCTCATTCTTTGACCCAGTAGGAGGAGGAGATCCTATCTTATATCAACATTTATTTTGATTTTTT

Note

Example can be downloaded from test/query.fasta.
Supported FASTA file extensions are .fa, .fas, .fna, and .fasta.

Metadata file (`metadata.csv`)

The metadata file provides essential information about each sequence and must follow the structure below. Each row corresponds to a sample and should include required and, optionally, additional columns.

Required Columns

sample_id - Unique identifier for the sample. Must match the sequence ID in the sequences.fasta file. Cannot contain spaces.
locus - Name of the genetic locus for the sample, which must be in the list of permitted loci. If deliberately providing no locus, the value NA is also accepted.
[!NOTE]
- By default, COX1_SPECIES_PUBLIC (all published COI records from BOLD and GenBank with a minimum sequence length of 500bp) is used for BOLD search, so the locus from metadata will be ignored when db_type = bold.
- You can modify the BOLD database by changing the bold_database_name parameter (see docs/params.md). However, we have not tested other BOLD databases besides COX1_SPECIES_PUBLIC.
- Loci synonyms will be checked as well (see scripts/config/loci.json).
- If you need to modify which loci and synonyms are permitted, see the technical documentation.
preliminary_id - Preliminary morphology ID of the sample.

Optional Columns

taxa_of_interest - Taxa of interest for the sample. If multiple, separate them with a | character.
country - Country of origin for the sample. If unknown, leave this field empty (do not use NA, which is the country code for Namibia).
classification - High-level taxonomic classification for the sample. Must be one of animalia, plantae, fungi, chromista, bacteria, archaea, viruses.
sequence - Nucleotide sequence for the sample (required if --sequences is not provided).

In addition to the above, you can include arbitrary columns (e.g., host, sequencing_platform, sequencing_read_coverage) which will be displayed in the workflow report's "Sample metadata" section.

Example

sample_id	locus	preliminary_id	taxa_of_interest	country	classification	host	sequencing_platform	sequencing_read_coverage
VE24-1075_COI	COI	Aphididae	Myzus persicae\|Aphididae	Ecuador	animalia	Cut flower Rosa	Nanopore	30x
VE24-1079_COI	COI	Miridae	Lygus pratensis	Netherlands	animalia	Cut flower Paenonia	Nanopore	30x

Note

All required columns must be present for every sample.
Optional columns can be left blank or completely omitted if not applicable.
Arbitrary columns (such as host, sequencing_platform, sequencing_read_coverage) will be displayed in the workflow report "Sample metadata" section.
For more details on the metadata schema, see assets/schema_input.json.
Example can be downloaded from test/metadata.csv.

To run the pipeline against local BLAST Core Nt Database:

nextflow run /path/to/pipeline/taxodactyl/main.nf \
    --metadata /path/to/metadata.csv \
    --sequences /path/to/sequences.fasta \
    --blastdb /path/to/blastdbs/core_nt \
    --outdir /path/to/output \
    -profile singularity \
    --taxdb /path/to/.taxonkit/ \
    --ncbi_api_key API_KEY \
    --ncbi_user_email EMAIL \
    --analyst_name "Magdalena Antczak" \
    --facility_name "QCIF" \
    -resume

To run the pipeline using sequences stored in metadata.csv:

nextflow run /path/to/pipeline/taxodactyl/main.nf \
  --metadata /path/to/metadata.csv \
  --blastdb /path/to/blastdbs/core_nt \
  --outdir /path/to/output \
  -profile singularity \
  --taxdb /path/to/.taxonkit/ \
  --ncbi_api_key API_KEY \
  --ncbi_user_email EMAIL \
  --analyst_name "Magdalena Antczak" \
  --facility_name "QCIF" \
  -resume

To run the pipeline using the BOLD web database:

nextflow run /path/to/pipeline/taxodactyl/main.nf \
    --metadata /path/to/metadata.csv \
    --sequences /path/to/sequences.fasta \
    --db_type bold \
    --outdir /path/to/output \
    -profile singularity \
    --taxdb /path/to/.taxonkit/ \
    --ncbi_api_key API_KEY \
    --ncbi_user_email EMAIL \
    --analyst_name "Magdalena Antczak" \
    --facility_name "QCIF" \
    -resume

Note

For a detailed explanation of all pipeline parameters, see parameter documentation.
We recommend avoiding spaces in file and folder names to prevent issues in command-line operations.
The error strategy for the workflow is set to ignore. It means that even if a process encounters an error, Nextflow will continue executing subsequent processes rather than terminating the workflow. This is to avoid interrupting the entire workflow with multiple queries when only one of them fails. Unfortunately, this behaviour prevents more detailed errors from being displayed in the standard output. Instead, you will only see which tasks failed, and the hashes assigned to them that you can use to navigate the work folder to find specific errors. As a workaround, you can run the following script at the end of your run from the directory where the pipeline was executed: bash /path/to/pipeline/taxodactyl/bin/collect_errors.sh. As a result, a list of processes should be displayed together with their work directory paths, the last 10 lines of standard error and the last 10 lines of standard output.
You can find detailed instructions and practical examples for customising the pipeline configuration in the docs/customise.md file. This guide covers how to set parameters, adjust resources, change error strategies, and modify the Singularity cache directory for your Nextflow runs.

Pipeline output

After running the pipeline, the output directory will contain a separate folder for each query sequence and a folder with information about the run. Here, we show the results folder structure when using the two databases. For more information, see the output documentation. See this document for a detailed description of the analysis and interpretation of the workflow report.

BLAST Core Nucleotide Database

.
├── blast_result.xml
├── pipeline_info
│   ├── execution_report_2025-06-22_22-53-15.html
│   ├── execution_timeline_2025-06-22_22-53-15.html
│   ├── execution_trace_2025-06-22_22-53-15.txt
│   ├── params_2025-06-22_22-53-29.json
│   └── pipeline_dag_2025-06-22_22-53-15.html
├── query_001_VE24-1075_COI
│   ├── all_hits.fasta
│   ├── candidates.csv
│   ├── candidates.fasta
│   ├── candidates_identity_boxplot.png
│   ├── candidates_phylogeny.fasta
│   ├── candidates_phylogeny.msa
│   ├── candidates_phylogeny.nwk
│   └── report_VE24-1075_COI_20250622_225319.html
└── query_002_VE24-1079_COI
    ├── all_hits.fasta
    ├── candidates.csv
    ├── candidates.fasta
    ├── candidates_phylogeny.fasta
    ├── candidates_phylogeny.msa
    ├── candidates_phylogeny.nwk
    └── report_VE24-1079_COI_20250622_225319.html

BOLD

.
├── pipeline_info
│   ├── execution_report_2025-06-22_22-53-22.html
│   ├── execution_timeline_2025-06-22_22-53-22.html
│   ├── execution_trace_2025-06-22_22-53-22.txt
│   ├── params_2025-06-22_22-53-34.json
│   └── pipeline_dag_2025-06-22_22-53-22.html
├── query_001_VE24-1075_COI
│   ├── all_hits.fasta
│   ├── candidates.csv
│   ├── candidates.fasta
│   ├── candidates_phylogeny.fasta
│   ├── candidates_phylogeny.msa
│   ├── candidates_phylogeny.nwk
│   └── report_BOLD_VE24-1075_COI_20250622_225326.html
└── query_002_VE24-1079_COI
    ├── all_hits.fasta
    ├── candidates.csv
    ├── candidates.fasta
    ├── candidates_identity_boxplot.png
    ├── candidates_phylogeny.fasta
    ├── candidates_phylogeny.msa
    ├── candidates_phylogeny.nwk
    └── report_BOLD_VE24-1079_COI_20250622_225326.html

Credits

qcif/taxodactyl was originally written by Magdalena Antczak, Cameron Hyde, Daisy Li from QCIF Ltd. The project was funded by the Department of Agriculture, Fisheries and Forestry and the Australian BioCommons.

The workflow was designed by:

Cameron Hyde
Magdalena Antczak
Lanxi (Daisy) Li
Valentine Murigneux
Sarah Williams
Michael Thang
Bradley Pease
Shaun Bochow
Grace Sun

Citations

If you use qcif/taxodactyl for your analysis, please cite it using the following

Antczak, M., Hyde, C., Li, Lanxi (Daisy), Murigneux, V., Williams, S., Thang, M., Pease, B., Bochow, S., & Sun, G. (2025). TAXODACTYL - High-confidence, evidence-based taxonomic assignment of DNA sequences. WorkflowHub. https://doi.org/10.48546/WORKFLOWHUB.WORKFLOW.1782.3

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

qcif/taxodactyl uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Name		Name	Last commit message	Last commit date
Latest commit History 1,059 Commits
.github		.github
.vscode		.vscode
assets		assets
bin		bin
conf		conf
docs		docs
modules		modules
scripts		scripts
services		services
subworkflows		subworkflows
test		test
workflows		workflows
.editorconfig		.editorconfig
.gitignore		.gitignore
.gitpod.yml		.gitpod.yml
.nf-core.yml		.nf-core.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierignore		.prettierignore
.prettierrc.yml		.prettierrc.yml
CITATIONS.md		CITATIONS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
main.nf		main.nf
modules.json		modules.json
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json
nf-test.config		nf-test.config
ro-crate-metadata.json		ro-crate-metadata.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TAXODACTYL - High-confidence, evidence-based taxonomic assignment of DNA sequences

Quick links

Workflow Overview

Usage

Software

NCBI API Key

TaxonKit

BLAST Core Nucleotide Database

Sequences file (`sequences.fasta`)

Metadata file (`metadata.csv`)

Required Columns

Optional Columns

Example

Pipeline output

Credits

Citations

About

Uh oh!

Releases 16

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TAXODACTYL - High-confidence, evidence-based taxonomic assignment of DNA sequences

Quick links

Workflow Overview

Usage

Software

NCBI API Key

TaxonKit

BLAST Core Nucleotide Database

Sequences file (sequences.fasta)

Metadata file (metadata.csv)

Required Columns

Optional Columns

Example

Pipeline output

Credits

Citations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 16

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Sequences file (`sequences.fasta`)

Metadata file (`metadata.csv`)

Packages