Caution
The BOLD database search functionality is currently not operational. Please use the BLAST Core Nucleotide Database option instead.
- Example workflow report
- Documentation of the analysis
- Running tests with nf-test
- Python scripts (for developers)
The pipeline orchestrates a series of analytical steps, each encapsulated in a dedicated module or subworkflow. The main stages are:
-
Environment Configuration Sets up environment variables and paths required for downstream processes, ensuring reproducibility and portability.
-
Input Validation Checks the integrity and compatibility of input files (FASTA sequences, metadata, databases), preventing downstream errors.
-
Sequence Search
- BLAST Core Nucleotide Database (BLASTN): Queries input sequences against the NCBI nucleotide database using BLASTN.
- BOLD v4 (API): Queries input sequences against the Barcode of Life Data Systems. Taxonomic lineage included in the results.
-
Hit Extraction Parses BLAST results to extract relevant hits for each query.
-
Taxonomic ID Extraction Retrieve taxids for BLAST hit records.
-
Build Taxonomic Lineage Maps taxonomic IDs to full lineages, enabling downstream filtering and reporting.
-
Candidate Evaluation Identifies candidate species for each query, applying configurable thresholds for identity and coverage.
-
Supporting Evidence Evaluation
-
Multiple Sequence Alignment (MAFFT) Aligns candidate and query sequences to prepare for phylogenetic analysis.
-
Phylogenetic Tree Construction (FastMe) Builds a phylogenetic tree to visualise relatedness of candidate and query sequences.
-
Workflow report Generates detailed HTML and text reports, including sequence alignments, phylogenetic trees, database coverage, and supporting evidence for each assignment.
To run the qcif/taxodactyl pipeline, you will need the following software installed:
-
Nextflow Tested versions: 24.10.3, 24.10.6
-
Java Required by Nextflow. Tested version: 17.0.13
-
Singularity Used for containerised execution of all bioinformatics tools, ensuring reproducibility. Tested version: 3.7.0
Note
- Instructions on how to set up Nextflow and a compatible version of Java can be found on this page.
- To install singularity follow instructions from this website.
- We provide different profiles as per the default nf-core configuration however this pipeline was only tested with singularity.
- The pipeline was tested only on a Linux-based operating system - specifically, Ubuntu 24.04.1 LTS.
- If you have never downloaded or run a Nextflow pipeline, we have some additional tips and bash commands in the step-by-step guide.
API Key is used to authenticate with the NCBI Entrez API for an increased rate limit. You can generate it following the instructions from this article.
Download the NCBI taxonomy data files and extract them to ~/.taxonkit. Similarly, download the taxonkit tool and move into the same folder.
Note
- The current version of TAXODACTYL is not compatible with
taxonkit/taxdumpfiles downloaded beforeApril 2025; this affects viruses, and evaluating database coverage for viruses will likely return an error.
To search sequences against the BLAST Core Nucleotide Database, you must download it first. We recommend running the update_blastdb.pl program. Follow instructions from this book. Perl installation is required.
The command should look like this:
perl ~/ncbi-blast-2.16.0+/bin/update_blastdb.pl --decompress core_nt
You can provide query sequences in either of two ways:
- Provide a FASTA file using
--sequences. - Add a
sequencecolumn tometadata.csvand omit--sequences.
If using a FASTA file, it should contain the query sequences (up to 100), e.g.
>VE24-1075_COI
TGGATCATCTCTTAGAATTTTAATTCGATTAGAATTAAGACAAATTAATTCTATTATTWATAATAATCAATTATATAATGTAATTGTTCACAATTCATGCTTTTATTATAATTTTTTTTATAACTATACCAATTGTAATTGGTGGATTTGGAAATTGATTAATTCCTATAATAATAGGATGTCCTGATATATCATTTCCACSTTTAAATAATATTAGATTTTGATTATTACCTCCATCATTAATAATAATAATTTGTAGATTTTTAATTAATAATGGAACAGGAACAGGATGAACAATTTAYCCHCCTTTATCAAACAATATTGCACATAATAACATTTCAGTTGATTTAACTATTTTTTCTTTACATTTAGCAGGWATCTCATCAATTTTAGGAGCAATTAACTTTATTTGTACAATTCTTAATATAATAYCAAAYAATATAAAACTAAATCAAATTCCTCTTTTTCCTTGATCAATTTTAATTACAGCTATTTTATTAATTTTATMTTTACCAGTTTTAGCTGGTGCCATTACAATATTATTAACTGATCGTAATTTAAATACATCATTTTTGATCCAGCAGGAGGAGGAGATCC
>VE24-1079_COI
AACTTTATATTTCATTTTTGGAATATGGGCAGGTATATTAGGAACTTCACTAAGATGAATTATTCGAATTGAACTTGGACAACCAGGATCATTTATTGGAGATGATCAAATTTATAATGTAGTAGTTACCGCACACGCATTTATTATAATTTTCTTTATAGTTATACCAATTATAATTGGAGGATTTGGAAATTGATTAGTACCTCTAATAATTGGAGCACCAGATATAGCATTCCCACGGATAAATAATATAAGATTTTGATTATTACCACCCTCAATTACACTTCTTATTATAAGATCTATAGTAGAAAGAGGAGCAGGAACTGGATGAACAGTATATCCCCCACTATCATCAAATATTGCACATAGTGGAGCATCAGTAGACCTAGCAATTTTTTCACTACATTTAGCAGGTGTATCTTCAATTTTAGGAGCAATTAATTTCATCTCAACAATTATTAATATACGACCTGAAGGCATATCTCCAGAACGAATTCCATTATTTGTATGATCAGTAGGTATTACAGCATTACTATTATTATTATCATTACCAGTTCTAGCTGGAGCTATTACAATATTATTAACAGATCGAAACTTTAATACCTCATTCTTTGACCCAGTAGGAGGAGGAGATCCTATCTTATATCAACATTTATTTTGATTTTTT
Note
- Example can be downloaded from
test/query.fasta. - Supported FASTA file extensions are
.fa,.fas,.fna, and.fasta.
The metadata file provides essential information about each sequence and must follow the structure below. Each row corresponds to a sample and should include required and, optionally, additional columns.
- sample_id - Unique identifier for the sample. Must match the sequence ID in the
sequences.fastafile. Cannot contain spaces. - locus - Name of the genetic locus for the sample, which must be in the list of permitted loci. If deliberately providing no locus, the value
NAis also accepted.[!NOTE]
- By default,
COX1_SPECIES_PUBLIC(all published COI records from BOLD and GenBank with a minimum sequence length of 500bp) is used for BOLD search, so the locus from metadata will be ignored whendb_type = bold. - You can modify the BOLD database by changing the
bold_database_nameparameter (see docs/params.md). However, we have not tested other BOLD databases besidesCOX1_SPECIES_PUBLIC. - Loci synonyms will be checked as well (see
scripts/config/loci.json). - If you need to modify which loci and synonyms are permitted, see the technical documentation.
- By default,
- preliminary_id - Preliminary morphology ID of the sample.
- taxa_of_interest - Taxa of interest for the sample. If multiple, separate them with a
|character. - country - Country of origin for the sample. If unknown, leave this field empty (do not use
NA, which is the country code for Namibia). - classification - High-level taxonomic classification for the sample. Must be one of
animalia,plantae,fungi,chromista,bacteria,archaea,viruses. - sequence - Nucleotide sequence for the sample (required if
--sequencesis not provided).
In addition to the above, you can include arbitrary columns (e.g., host, sequencing_platform, sequencing_read_coverage) which will be displayed in the workflow report's "Sample metadata" section.
| sample_id | locus | preliminary_id | taxa_of_interest | country | classification | host | sequencing_platform | sequencing_read_coverage |
|---|---|---|---|---|---|---|---|---|
| VE24-1075_COI | COI | Aphididae | Myzus persicae|Aphididae | Ecuador | animalia | Cut flower Rosa | Nanopore | 30x |
| VE24-1079_COI | COI | Miridae | Lygus pratensis | Netherlands | animalia | Cut flower Paenonia | Nanopore | 30x |
Note
- All required columns must be present for every sample.
- Optional columns can be left blank or completely omitted if not applicable.
- Arbitrary columns (such as
host,sequencing_platform,sequencing_read_coverage) will be displayed in the workflow report "Sample metadata" section. - For more details on the metadata schema, see
assets/schema_input.json. - Example can be downloaded from
test/metadata.csv.
To run the pipeline against local BLAST Core Nt Database:
nextflow run /path/to/pipeline/taxodactyl/main.nf \
--metadata /path/to/metadata.csv \
--sequences /path/to/sequences.fasta \
--blastdb /path/to/blastdbs/core_nt \
--outdir /path/to/output \
-profile singularity \
--taxdb /path/to/.taxonkit/ \
--ncbi_api_key API_KEY \
--ncbi_user_email EMAIL \
--analyst_name "Magdalena Antczak" \
--facility_name "QCIF" \
-resumeTo run the pipeline using sequences stored in metadata.csv:
nextflow run /path/to/pipeline/taxodactyl/main.nf \
--metadata /path/to/metadata.csv \
--blastdb /path/to/blastdbs/core_nt \
--outdir /path/to/output \
-profile singularity \
--taxdb /path/to/.taxonkit/ \
--ncbi_api_key API_KEY \
--ncbi_user_email EMAIL \
--analyst_name "Magdalena Antczak" \
--facility_name "QCIF" \
-resumeTo run the pipeline using the BOLD web database:
nextflow run /path/to/pipeline/taxodactyl/main.nf \
--metadata /path/to/metadata.csv \
--sequences /path/to/sequences.fasta \
--db_type bold \
--outdir /path/to/output \
-profile singularity \
--taxdb /path/to/.taxonkit/ \
--ncbi_api_key API_KEY \
--ncbi_user_email EMAIL \
--analyst_name "Magdalena Antczak" \
--facility_name "QCIF" \
-resumeNote
- For a detailed explanation of all pipeline parameters, see parameter documentation.
- We recommend avoiding spaces in file and folder names to prevent issues in command-line operations.
- The error strategy for the workflow is set to
ignore. It means that even if a process encounters an error, Nextflow will continue executing subsequent processes rather than terminating the workflow. This is to avoid interrupting the entire workflow with multiple queries when only one of them fails. Unfortunately, this behaviour prevents more detailed errors from being displayed in the standard output. Instead, you will only see which tasks failed, and the hashes assigned to them that you can use to navigate the work folder to find specific errors. As a workaround, you can run the following script at the end of your run from the directory where the pipeline was executed:bash /path/to/pipeline/taxodactyl/bin/collect_errors.sh. As a result, a list of processes should be displayed together with their work directory paths, the last 10 lines of standard error and the last 10 lines of standard output. - You can find detailed instructions and practical examples for customising the pipeline configuration in the docs/customise.md file. This guide covers how to set parameters, adjust resources, change error strategies, and modify the Singularity cache directory for your Nextflow runs.
After running the pipeline, the output directory will contain a separate folder for each query sequence and a folder with information about the run. Here, we show the results folder structure when using the two databases. For more information, see the output documentation. See this document for a detailed description of the analysis and interpretation of the workflow report.
BLAST Core Nucleotide Database
.
├── blast_result.xml
├── pipeline_info
│ ├── execution_report_2025-06-22_22-53-15.html
│ ├── execution_timeline_2025-06-22_22-53-15.html
│ ├── execution_trace_2025-06-22_22-53-15.txt
│ ├── params_2025-06-22_22-53-29.json
│ └── pipeline_dag_2025-06-22_22-53-15.html
├── query_001_VE24-1075_COI
│ ├── all_hits.fasta
│ ├── candidates.csv
│ ├── candidates.fasta
│ ├── candidates_identity_boxplot.png
│ ├── candidates_phylogeny.fasta
│ ├── candidates_phylogeny.msa
│ ├── candidates_phylogeny.nwk
│ └── report_VE24-1075_COI_20250622_225319.html
└── query_002_VE24-1079_COI
├── all_hits.fasta
├── candidates.csv
├── candidates.fasta
├── candidates_phylogeny.fasta
├── candidates_phylogeny.msa
├── candidates_phylogeny.nwk
└── report_VE24-1079_COI_20250622_225319.html
BOLD
.
├── pipeline_info
│ ├── execution_report_2025-06-22_22-53-22.html
│ ├── execution_timeline_2025-06-22_22-53-22.html
│ ├── execution_trace_2025-06-22_22-53-22.txt
│ ├── params_2025-06-22_22-53-34.json
│ └── pipeline_dag_2025-06-22_22-53-22.html
├── query_001_VE24-1075_COI
│ ├── all_hits.fasta
│ ├── candidates.csv
│ ├── candidates.fasta
│ ├── candidates_phylogeny.fasta
│ ├── candidates_phylogeny.msa
│ ├── candidates_phylogeny.nwk
│ └── report_BOLD_VE24-1075_COI_20250622_225326.html
└── query_002_VE24-1079_COI
├── all_hits.fasta
├── candidates.csv
├── candidates.fasta
├── candidates_identity_boxplot.png
├── candidates_phylogeny.fasta
├── candidates_phylogeny.msa
├── candidates_phylogeny.nwk
└── report_BOLD_VE24-1079_COI_20250622_225326.html
qcif/taxodactyl was originally written by Magdalena Antczak, Cameron Hyde, Daisy Li from QCIF Ltd. The project was funded by the Department of Agriculture, Fisheries and Forestry and the Australian BioCommons.
The workflow was designed by:
- Cameron Hyde
- Magdalena Antczak
- Lanxi (Daisy) Li
- Valentine Murigneux
- Sarah Williams
- Michael Thang
- Bradley Pease
- Shaun Bochow
- Grace Sun
If you use qcif/taxodactyl for your analysis, please cite it using the following
Antczak, M., Hyde, C., Li, Lanxi (Daisy), Murigneux, V., Williams, S., Thang, M., Pease, B., Bochow, S., & Sun, G. (2025). TAXODACTYL - High-confidence, evidence-based taxonomic assignment of DNA sequences. WorkflowHub. https://doi.org/10.48546/WORKFLOWHUB.WORKFLOW.1782.3
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.
qcif/taxodactyl uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.



