TACO

Telomere-Aware Contig Optimization

TACO is a telomere-aware all-in-one multi-assembler comparison and refinement pipeline for genome assembly benchmarking, decision-making, and chromosome-end improvement. Developed for small eukaryotic genomes with a focus on fungal genomes, TACO runs multiple assemblers, standardizes their outputs, evaluates assembly quality, detects telomere-supported contigs, and can either (1) stop after generating a unified assembler comparison table for benchmarking, or (2) continue into telomere-aware backbone refinement for an improved chromosome-scale candidate assembly.

TACO was developed at the Grainger Bioinformatics Center, Field Museum of Natural History.

Overview

Genome assemblers often produce different results from the same long-read dataset. One assembler may recover longer contigs, another may preserve more complete chromosome ends, and another may provide a better balance of completeness, contiguity, and redundancy. TACO makes these comparisons systematic, interpretable, and reproducible.

TACO operates in two modes. In assembly-only mode (--assembly-only), the pipeline runs all assemblers, standardizes their outputs, evaluates quality with BUSCO, QUAST, telomere detection, and optional Merqury, then produces a unified comparison table at assemblies/assembly_info.csv. In full refinement mode, TACO continues from the comparison step into telomere-pool construction and backbone refinement, producing an improved chromosome-scale candidate assembly with preserved telomeric ends.

Features

Runs six long-read assemblers (HiCanu, NextDenovo, Peregrine, IPA, Flye, Hifiasm) from a single command
Supports PacBio HiFi, Oxford Nanopore, and PacBio CLR reads via --platform
Standardizes assembly outputs for direct cross-assembler comparison
Hybrid telomere detection with de novo k-mer discovery, built-in motif families, and per-end composite scoring
Three-tier telomere classification: strict T2T, single-end strong, and telomere-supported
Benchmarks assemblies with BUSCO, QUAST, telomere metrics, and optional Merqury
Biologically informed automatic backbone selection (smart scoring)
Assembly-only mode for convenient benchmarking without refinement
Telomere-aware backbone refinement with redundancy reduction and telomeric end rescue
Machine-readable benchmark logs for reproducible reporting

Installation

See INSTALLATION.md for detailed instructions.

Quick Install

git clone https://github.com/yksun/TACO.git
cd TACO
conda env create -f taco-env.yml
conda activate taco
pip install -e .

# Verify
taco --help

After installation, the taco command is available system-wide within the conda environment.

Requirements

TACO requires a Unix-like system (Linux or macOS) with Python >= 3.8 and Conda. Most dependencies are installed via taco-env.yml. TACO's Python modules use only the standard library with no additional pip packages. Canu, Peregrine, and IPA require manual installation (see INSTALLATION.md). If any assembler is absent or fails, TACO skips that step and continues with the others.

Quick Start

Full refinement run (comparison + telomere-aware refinement):

mkdir -p my_project && cd my_project

taco -g 12m -t 16 \
  --fastq /path/to/reads.fastq \
  --taxon fungal

Assembly-only comparison (benchmarking only):

taco -g 12m -t 16 \
  --fastq /path/to/reads.fastq \
  --taxon fungal \
  --assembly-only

With Merqury QV scoring (optional, auto-detected if installed):

Merqury is automatically enabled when merqury.sh is on PATH and a .meryl database is found in the working directory. You can also specify the database explicitly:

taco -g 12m -t 16 \
  --fastq /path/to/reads.fastq \
  --taxon fungal \
  --merqury-db reads.meryl

With explicit motif override (only if biologically known):

taco -g 12m -t 16 \
  --fastq /path/to/reads.fastq \
  -m TTAGGG

Usage

Parameters

Parameter	Description
`-g`, `--genomesize`	Estimated haploid genome size (e.g., `12m`, `40m`, `2g`)
`-t`, `--threads`	Number of CPU threads
`--fastq`	Input FASTQ file (use absolute path)
`--taxon`	Taxonomy preset for telomere detection: `vertebrate`, `animal`, `plant`, `insect`, `fungal`, or `other` (default). Sets motif-family priors and detection behavior automatically.
`-m`, `--motif`	Telomere motif override (optional). Only use when the exact motif is biologically known for the species. When omitted, taxon-aware hybrid detection is used instead.
`--platform`	Sequencing platform: `pacbio-hifi` (default), `nanopore`, or `pacbio`. Also determines default polishing tool.
`-s`, `--steps`	Run selected steps only (e.g., `1,3-5`)
`--reference`, `-ref`	Reference FASTA for comparison. Included as the "reference" assembler in all comparison tables.
`--busco`	Run BUSCO (optionally specify lineage dataset)
`--choose`	Manually choose the backbone assembler
`--assembly-only`	Stop after assembler comparison
`--auto-mode`	Backbone selection mode: `smart` (default) or `n50`
`--merqury`	Force-enable Merqury (auto-detected if `merqury.sh` + `.meryl` db found)
`--merqury-db`	Enable Merqury with a specific `.meryl` database path
`--no-merqury`	Disable Merqury even if installed and auto-detected
`--no-purge-dups`	Skip purge_dups after refinement
`--no-polish`	Skip automatic polishing after refinement
`--allow-t2t-replace`	Allow rescue donors to replace immutable Tier 1 (T2T) contigs. Disabled by default for safety

Assembly-Only Mode

Use --assembly-only when the goal is assembler benchmarking and comparison without refinement. TACO runs all assemblers, standardizes outputs, runs BUSCO, telomere detection, QUAST, and optional Merqury, then writes the combined comparison table to assemblies/assembly_info.csv and a summary to final_results/assembly_only_result.csv.

Sequencing Platform Support

TACO supports three sequencing platforms. Each assembler receives platform-appropriate flags automatically. The platform also determines the default polishing strategy: HiFi assemblies are polished with NextPolish2 by default (k-mer-based, safe for high-accuracy reads; requires yak for k-mer database construction), Nanopore assemblies use Medaka (neural-network polisher; falls back to Racon), and CLR assemblies use Racon.

Platform	`--platform`	Assemblers Used	Notes
PacBio HiFi	`pacbio-hifi` (default)	canu, nextDenovo, peregrine, IPA, flye, hifiasm	All 6 assemblers
Oxford Nanopore	`nanopore`	canu, nextDenovo, flye	Peregrine, IPA, hifiasm skipped
PacBio CLR	`pacbio`	canu, nextDenovo, peregrine, flye	IPA, hifiasm skipped

Incompatible assemblers are automatically skipped with a warning. IPA and hifiasm only support PacBio HiFi reads. Peregrine does not support Nanopore reads.

Platform-Specific Assembler Flags

Each assembler receives the appropriate read-type flag automatically:

Assembler	HiFi	Nanopore	CLR
Canu	`-pacbio-hifi`	`-nanopore`	`-pacbio`
Flye	`--pacbio-hifi`	`--nano-hq` (Q20+)	`--pacbio-raw`
Hifiasm	default mode	skipped	skipped
IPA	default mode	skipped	skipped
Peregrine	default mode	skipped	default mode
NextDenovo	via config file	via config file	via config file

For older ONT data that is not Q20+ basecalled, set FLYE_ONT_FLAG=--nano-raw in the environment.

Platform-Specific Polishing Strategy

Platform	Polishing Tool	Notes
HiFi	NextPolish2 (yak k-mer based)	K-mer polishing corrects residual errors safely; requires `nextpolish2` + `yak`
Nanopore	Medaka (Racon fallback)	Neural-network polisher; set `MEDAKA_MODEL` for non-default chemistry
CLR	Racon	Standard error-correction for CLR reads

Combined Platform × Taxon Strategy Overview

Component	Fungal	Plant	Vertebrate/Animal	Insect/Other
Telomere motifs	TTAGGG + TG1-3 + Candida	TTTAGGG	TTAGGG	TTAGG (insect) / all (other)
Score window	300 bp	1000 bp	1000 bp	500 bp
Backbone T2T weight	350 (high)	200 (reduced)	200 (reduced)	300 (default)
BUSCO D penalty	600 (strict)	300 (relaxed)	500 (default)	500 (default)
BUSCO trial C-drop	2% (strict)	4% (relaxed)	3% (moderate)	2% (default)
purge_dups mode	single-round	two-round + polyploid warning	two-round	single-round
Polishing (HiFi)	NextPolish2 (yak k-mer based)	NextPolish2 (yak k-mer based)	NextPolish2 (yak k-mer based)	NextPolish2 (yak k-mer based)
Polishing (ONT)	Medaka → Racon	Medaka → Racon	Medaka → Racon	Medaka → Racon
Polishing (CLR)	Racon	Racon	Racon	Racon

Pipeline Steps

Step	Description
1	HiCanu assembly
2	NextDenovo assembly
3	Peregrine assembly
4	IPA assembly
5	Flye assembly
6	Hifiasm assembly
7	Copy and normalize all assemblies
8	BUSCO on all assemblies
9	Telomere contig detection and telomere metrics
10	Build optimized telomere pool across assemblies
11	QUAST for assembler comparison
12	Backbone selection and telomere-aware refinement
13	BUSCO on final assembly
14	Telomere analysis of final assembly
15	QUAST on final assembly
16	Final comparison report
17	Cleanup into structured output folders
18	Assembly-only comparison summary

With --assembly-only, TACO follows the comparison path (Steps 1-11, 18) and stops before backbone refinement.

Telomere Detection

TACO v1.2.0 uses a taxon-aware hybrid telomere detection system that combines built-in motif families with de novo k-mer discovery.

Taxon-Aware Presets

Use --taxon to select the appropriate telomere motif priors for your organism. This is the recommended approach instead of forcing --motif directly.

`--taxon`	Primary Motifs	Notes
`vertebrate`	TTAGGG	Highly conserved; exact motif matching is most reliable here
`animal`	TTAGGG	Strong prior for vertebrates, less certain for distant metazoans
`plant`	TTTAGGG	Common plant repeat; some lineages vary
`insect`	TTAGG	Common insect repeat; not universal across all insect orders
`fungal`	TTAGGG, TG1-3, Candida	Diverse fungal telomeres — all built-in families used
`other` (default)	All families	Unknown taxon — relies on de novo discovery plus all priors

The --motif flag is an optional override. Do not force --motif unless the telomere repeat is biologically confirmed for your species or lineage. For fungi and unknown taxa especially, forcing a motif may miss true telomeres.

Built-in Motif Families

TACO ships with five motif families: the canonical vertebrate/filamentous fungal TTAGGG repeat, the budding yeast TG1-3/C1-3A degenerate repeat, the Candida 23-bp repeat (ACGGATGTCTAACTTCTTGGTGT), the plant TTTAGGG repeat, and the insect TTAGG repeat.

Scoring System

Each contig end is scored using a composite of four metrics: telomere density (weight 0.40), longest consecutive run (weight 0.30), distance of repeats from the contig terminus (weight 0.20), and covered base pairs (weight 0.10). The composite score ranges from 0 to 1.

Classification Tiers

Contigs are classified into three tiers based on their end scores: strict T2T contigs have strong telomere signal at both ends (score >= 0.25 at each end), single-end strong contigs have strong signal at one end only, and telomere-supported contigs have at least weak signal (score >= 0.08) at one end.

Assembly Selection Strategy

When --choose is not provided, TACO automatically selects the backbone assembly for refinement. The scoring formula adapts its weights based on --taxon to match the biological characteristics of each organism group.

Smart Scoring (default)

TACO ranks assemblies using a composite score. Merqury QV and completeness are included when available (optional — auto-detected if merqury.sh is installed and a .meryl database is found; otherwise these terms contribute 0):

score = BUSCO_S × w_busco_s + T2T × w_t2t + single_tel × w_single
      + MerquryComp × 200 + MerquryQV × 20      ← optional, 0 if Merqury not available
      - contigs × w_contigs + log10(N50) × w_n50
      - BUSCO_D × w_busco_d

BUSCO single-copy completeness (S%) is used instead of total completeness (C%) to avoid rewarding highly duplicated assemblies. BUSCO duplication (D%) is explicitly penalised. When Merqury is available, its k-mer-based QV and completeness provide an independent quality signal that helps distinguish assemblies with similar BUSCO scores. The weights are tuned per taxon as described below.

Taxon-Specific Scoring Strategies

Fungal (--taxon fungal): Fungal genomes are typically small (10–60 Mb) with well-defined chromosomes. TACO uses strict BUSCO duplicate penalty (w_busco_d = 600) because duplicated assemblies are almost always artefactual in haploid fungi. T2T contigs are weighted heavily (w_t2t = 350) since telomere rescue is highly effective for small genomes where individual T2T chromosomes can be resolved. The contig-count penalty remains moderate (w_contigs = 30) because most fungal genomes have few chromosomes.

Plant (--taxon plant): Plant genomes vary enormously in size and ploidy. TACO relaxes the BUSCO duplicate penalty (w_busco_d = 300) because polyploidy naturally inflates D% even in correct assemblies. The contig-count penalty is increased (w_contigs = 50) to discourage fragmented assemblies in these often large genomes. T2T weight is reduced (w_t2t = 200) because long repetitive arrays near telomeres can produce false-positive signals, and interstitial telomeric repeats (ITRs) are common in plants.

Vertebrate / Animal (--taxon vertebrate or --taxon animal): Vertebrate genomes are large (1–3+ Gb) and repeat-rich. TACO increases the N50 weight (w_n50 = 200) to favour contiguous assemblies and moderates the contig-count penalty (w_contigs = 40). T2T weight is reduced (w_t2t = 200) because interstitial telomeric repeats are frequent in vertebrates and can inflate telomere counts. BUSCO duplicate penalty stays at the default (w_busco_d = 500).

Insect / Other (--taxon insect or --taxon other): These taxa use the balanced default weights: w_busco_s = 1000, w_t2t = 300, w_single = 150, w_contigs = 30, w_n50 = 150, w_busco_d = 500. This is appropriate when the biological characteristics of the target organism are not well characterised.

Weight	Fungal	Plant	Vertebrate/Animal	Insect/Other
`w_busco_s`	1000	1000	1000	1000
`w_t2t`	350	200	200	300
`w_single`	150	150	150	150
`w_contigs`	30	50	40	30
`w_n50`	150	150	200	150
`w_busco_d`	600	300	500	500

Taxon-Specific Telomere Detection Windows

The default telomere score window also varies by taxon to match typical telomere array lengths: fungi use 300 bp (fungal telomere arrays are short, often 50–300 bp), plants and vertebrates use 1000 bp (longer repeat arrays), and other taxa use 500 bp (balanced default).

Taxon-Specific BUSCO Trial Thresholds (Step 12F)

When validating rescue candidates via BUSCO trial, the maximum acceptable C% drop, M% rise, and D% rise depend on taxon:

Fungi: strict thresholds (2% C-drop, 0.3% M-rise, 2% D-rise) — haploid genomes with stable BUSCO profiles.
Plant: relaxed thresholds (4% C-drop, 1.0% M-rise, 6% D-rise) — polyploidy causes natural BUSCO variability.
Vertebrate: moderate thresholds (3% C-drop, 0.5% M-rise, 4% D-rise).
Other: balanced defaults (2.5% C-drop, 0.5% M-rise, 3% D-rise).

A D-rise (duplicated BUSCO increase) check catches cases where a rescue introduces redundant copies of single-copy orthologs — a sign of retained haplotigs or mis-joined contigs. All thresholds can be overridden via STEP12_MAX_BUSCO_C_DROP, STEP12_MAX_BUSCO_M_RISE, and STEP12_MAX_BUSCO_D_RISE environment variables.

Additional environment variables for fine-tuning Step 12: PROTECT_COV / PROTECT_ID (strict dedup thresholds, default 0.95/0.95), DEDUP_MAX_BUSCO_C_DROP (maximum tolerated BUSCO C drop after dedup, default 3.0%), CHIMERA_MIN_CROSS_COV (minimum cross-assembly coverage for chimera mapping check, default 0.60), AGGR_NONTELO_COV / AGGR_NONTELO_ID (taxon-aware non-telo dedup), SELFDEDUP_COV / SELFDEDUP_ID (self-dedup thresholds).

Taxon-Specific Rescue Limits (Step 12F)

The maximum number of accepted rescue candidates per run is taxon-aware: fungi allow up to 20 rescues (many small chromosomes), vertebrates 10, plants 8 (conservative due to polyploidy risk), and other taxa 15. This prevents runaway replacement in complex genomes.

Taxon-Specific purge_dups Behaviour (Step 12H)

purge_dups uses single-round purging for fungal, insect, and other genomes to avoid over-purging small or simple genomes. For vertebrate, animal, and plant genomes, two-round purging (-2 flag) is used for more thorough cleanup of larger, more complex genomes. A warning is emitted for plant genomes because purge_dups may incorrectly collapse homeologous sequences in polyploid species — use --no-purge-dups if this is a concern.

N50-only Mode

--auto-mode n50 selects the assembly with the highest N50. This reproduces legacy behavior but may favor contiguous assemblies that lack completeness.

Step 12 — T2T-First Telomere-Aware Backbone Refinement

Step 12 adopts a T2T-first assembly philosophy with a two-tier confidence model:

Tier 1 (Immutable): T2T contigs — contigs with verified telomere signal at both ends. These are treated as protected chromosomal anchors and are never replaced during rescue, unless --allow-t2t-replace is explicitly set. This protects the highest-confidence contigs from accidental degradation.
Tier 2 (Editable): Backbone contigs — gap-fill contigs that cover chromosomal regions not represented by T2T contigs. These may be replaced by telomere-bearing rescue donors if the replacement passes BUSCO trial validation.

Duplicate non-telomeric backbone contigs are aggressively removed (with taxon-aware thresholds), and rescue donors must carry verified telomere signal.

Step 12 Sub-step Flow

12A — Merqury QV scoring (optional; auto-detected if installed, or enabled with --merqury/--merqury-db).
12B — auto-select backbone assembler (smart scoring with taxon-aware weights).
12C — prepare cleaned backbone + chimera safety using two strategies: (a) size gate — contigs > 1.5× the largest individual assembler contig are flagged; (b) cross-assembly mapping — each protected contig is aligned against all other assembler outputs; contigs not well-covered (≥60%) by any single assembler's contig are flagged as potential chimeras. Configurable via CHIMERA_MIN_CROSS_COV.
12D — T2T-first foundation building:
- 12D1 strict dedup (95%/95%): remove backbone contigs near-identical to T2T pool. Each removal is logged (name, length, coverage, identity).
- 12D2 post-dedup BUSCO safety check: runs BUSCO on the combined assembly (protected + remaining backbone) and compares to the backbone alone. Warns if BUSCO C drops > 3% (configurable via DEDUP_MAX_BUSCO_C_DROP), with remediation suggestions.
- 12D2b fragment removal (50%/90%): remove backbone fragments partially overlapping T2T chromosomes.
- 12D3 backbone telomere classification: run telomere detection on remaining backbone contigs to identify which carry telomere signal.
- 12D4 aggressive non-telomeric dedup (taxon-aware): backbone contigs lacking telomere support that overlap the T2T pool are removed. Thresholds: fungi 70%/85%, plant/vertebrate 85%/92%, other 75%/88%.
- 12D5 non-telomeric self-dedup (taxon-aware): when two non-telomeric backbone contigs overlap, the shorter one is removed. Thresholds: fungi 80%/90%, plant/vertebrate 90%/95%, other 85%/92%. Telomere-bearing contigs are always kept.
12E — telomere rescue with two-tier protection: align donor pool to backbone, compute structural metrics, verify each donor carries telomere signal, and classify each replacement. Tier 1 (T2T) contigs are immutable by default — candidates targeting them are rejected unless --allow-t2t-replace is set. Each accepted candidate is assigned a replacement class: fill_missing_end, replace_non_telo_backbone, replace_single_with_better, or replace_protected_t2t.
12F — BUSCO trial validation: for each telomere-verified candidate, build a trial assembly and run BUSCO. Rejection thresholds are taxon-aware (fungi: 2% C-drop / 2% D-rise, plant: 4% / 6%, vertebrate: 3% / 4%). Maximum accepted rescues are also taxon-aware (fungi: 20, plant: 8, vertebrate: 10, other: 15). An additional safety check rejects replace_single_with_better candidates if the replacement loses telomere evidence at either end.
12G — final combine: T2T foundation + telomere-rescued backbone gap-fill.
12H — purge_dups: taxon-aware haplotig/duplicate purging (skip with --no-purge-dups).
12I — automatic polishing: NextPolish2 for HiFi (k-mer-based via yak; skip with --no-polish), Medaka for ONT (Racon fallback), Racon for CLR.
12J — telomere-aware genome-size pruning: only non-telomeric contigs are removed when assembly exceeds the size budget. Telomere-bearing contigs are never pruned.

BUSCO Trial Validation

TACO validates each telomere rescue candidate by building a trial assembly where one backbone contig is replaced by one donor contig, then running BUSCO with the same lineage selected by the user. Rejection is triggered by three independent BUSCO metrics: C% drop (completeness loss), M% rise (missing gene increase), and D% rise (duplicated BUSCO increase, catching retained haplotigs). Rejection thresholds are taxon-aware: fungi use strict thresholds (2% C-drop, 2% D-rise), plants use relaxed thresholds (4% C-drop, 6% D-rise, accounting for polyploidy), and vertebrates use moderate thresholds (3% C-drop, 4% D-rise). An additional safety check rejects replace_single_with_better candidates if telomere evidence weakens at either end after replacement (suspicious size drops >30% also trigger rejection). This greedy, sequential approach ensures that each accepted rescue improves or maintains assembly quality. The trial summary TSV includes replacement_class and D (duplicated %) columns for full traceability.

Post-Refinement Stack

After the rescued/combined assembly is produced, TACO runs purge_dups by default to remove leftover haplotigs, overlapping fragments, and residual duplicates. purge_dups behaviour is taxon-aware: vertebrate, animal, and plant genomes use two-round purging (-2 flag) for more thorough cleanup of larger, more complex genomes, while fungal and insect genomes use single-round to avoid over-purging. A warning is emitted for plant genomes due to polyploid risk. This is followed by automatic polishing selected from --platform: HiFi assemblies are polished with NextPolish2 by default (builds yak k-mer databases at k=21 and k=31 from HiFi reads, then applies k-mer-based correction — safe and effective for high-accuracy reads), Nanopore assemblies use Medaka (falls back to Racon if Medaka is not installed), and CLR assemblies use Racon. Both steps can be skipped with --no-purge-dups and --no-polish respectively.

Diploid and Polyploid Note

TACO is designed for producing a best primary-style chromosome-level assembly, not a fully phased diploid or polyploid reconstruction. For strongly diploid or polyploid genomes, telomere-bearing contigs from different haplotypes may appear as rescue donors, and purge_dups may collapse alternative haplotigs. This is acceptable when the goal is a cleaned primary reference assembly.

Provenance GFF3

TACO writes a GFF3 annotation file (final.merged.provenance.gff3) alongside the final assembly. Each contig gets one GFF3 record (type=contig) spanning its full length, with attributes documenting its full provenance chain: source_assembler (which assembler produced it), assembler_contig (the original contig name from that assembler before Step 10 pool renaming), source_type (assembler or quickmerge), role (backbone, upgrade_donor, or novel_t2t), replacement_class (for upgrade donors), replaced_contig (which backbone contig was replaced), and description (a human-readable summary like "Entire replacement: peregrine contig 'contig_5' replaced by canu contig 'tig00000015' (class: upgrade_tier2_to_t2t)").

For quickmerge-derived contigs, the GFF3 includes additional contig-level attributes (qm_assembler1, qm_assembler2) identifying the two source assemblers, plus child records (type=region) with Parent linking to the contig. Each region record spans the genomic coordinates contributed by a specific assembler, with source_assembler and assembler_contig showing the original source. For example, a quickmerge contig produced from canu × flye will have region records like "Region 1-500000 from canu contig 'tig00001'" and "Region 400000-900000 from flye contig 'contig_3'", enabling users to trace every base pair back to its assembler of origin.

A companion file pool_contig_provenance.tsv maps every pool contig back to its source assembler and original contig name, with extended columns for quickmerge contigs: qm_assembler1, qm_assembler2, and qm_regions (semicolon-delimited start-end:assembler:contig entries).

Output Structure

project_directory/
├── assemblies/
│   ├── assembly_info.csv                # Unified comparison table
│   ├── canu.fasta                       # Normalized assembly outputs
│   ├── nextdenovo.fasta
│   ├── ...
│   ├── single_tel.replaced.debug.tsv    # All rescue alignment hits
│   ├── single_tel.candidates.tsv        # Plausible rescue candidates
│   ├── rescue_rejection_summary.txt     # Rejection reasons
│   ├── rescue_trial_summary.tsv         # BUSCO trial results (with replacement_class, D%)
│   ├── final_merge.raw.fasta            # Pre-purge combined assembly
│   └── *.busco/                         # BUSCO results per assembly
├── final_results/
│   ├── final_result.csv                 # Final comparison report
│   ├── final_assembly.fasta             # Refined assembly (full mode)
│   ├── final.merged.provenance.gff3     # GFF3 provenance: full assembler tracing per contig
│   ├── pool_contig_provenance.tsv       # Pool contig → assembler + original name mapping
│   └── assembly_only_result.csv         # Comparison summary (assembly-only)
├── telomere_pool/                       # Telomere pool intermediates
├── quast_results/                       # QUAST output
├── logs/                                # Per-step log files
├── benchmark_logs/                      # Machine-readable benchmark data
│   ├── run_metadata.tsv
│   ├── step_benchmark.tsv
│   └── run_summary.txt
└── version.txt                          # Software versions

Project Structure

TACO/
├── setup.py                # pip install entry point
├── run_taco                # Shell wrapper (no install needed)
├── taco/                   # Python package
│   ├── __init__.py         # Package metadata (v1.2.0)
│   ├── __main__.py         # CLI entry point: taco [options]
│   ├── cli.py              # Argument parsing
│   ├── pipeline.py         # Pipeline runner, logging, benchmarking
│   ├── steps.py            # All 18 step implementations
│   ├── utils.py            # Shared utilities and FASTA I/O
│   ├── telomere_detect.py  # Hybrid telomere detection engine
│   ├── telomere_pool.py    # Telomere pool classification
│   ├── clustering.py       # Minimap2-based contig clustering
│   ├── backbone.py         # Backbone selection and scoring
│   └── reporting.py        # Final report generation
├── docs/                   # Documentation and images
├── taco-env.yml            # Conda environment
├── INSTALLATION.md
├── README.md
├── LICENSE
└── .gitignore

Troubleshooting

Canu reports master +XX changes or Step 1 fails with a Java error: The conda environment now includes openjdk>=11 to fix the Java runtime. If you still see this error, the bioconda canu may be a dev build — download a stable binary from https://github.com/marbl/canu/releases and place it on PATH. TACO detects dev builds and warns you; if canu fails, the pipeline continues with the remaining assemblers.

IPA or Peregrine skipped: These assemblers only support certain platforms. IPA requires PacBio HiFi; Peregrine does not support Nanopore. Use --platform to match your data type.

Telomere motif appears incorrect: Do not force --motif unless the telomere repeat is biologically known for your species. Use --taxon to select the appropriate preset instead. TACO's built-in motif families cover canonical TTAGGG, budding yeast TG1-3, Candida, plant TTTAGGG, and insect TTAGG repeats.

purge_dups or polishing not running: These tools must be installed in the conda environment. Use conda install -c bioconda purge_dups nextpolish2 yak racon medaka or skip with --no-purge-dups / --no-polish. For HiFi polishing, NextPolish2 and yak are both required. For Nanopore polishing, Medaka is preferred; if unavailable, Racon is used as fallback.

TACO.sh: command not found: Add the TACO directory to your PATH or run with the full path.

Missing Python modules: TACO uses only the Python standard library. If you see import errors, ensure Python >= 3.8 is installed and the taco/ directory is alongside TACO.sh.

Merqury not working: Merqury is optional. Install with conda install -c bioconda merqury meryl or use --no-merqury to skip.

Citation

If you use TACO in a publication, please cite the software and archive the exact release used for reproducibility (e.g., via Zenodo).

TACO was developed at the Grainger Bioinformatics Center, Field Museum of Natural History, Chicago, Illinois, USA.

Changelog

See docs/CHANGELOG.md for the full version history, including detailed notes on the v0.5.6 → v1.0.0 Bash-to-Python conversion.

License

TACO is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 121 Commits
docs		docs
taco		taco
.gitignore		.gitignore
INSTALLATION.md		INSTALLATION.md
LICENSE		LICENSE
README.md		README.md
run_taco		run_taco
setup.py		setup.py
taco-env.yml		taco-env.yml

Folders and files

Latest commit

History

Repository files navigation

TACO

Table of Contents

Overview

Features

Installation

Quick Install

Requirements

Quick Start

Usage

Parameters

Assembly-Only Mode

Sequencing Platform Support

Platform-Specific Assembler Flags

Platform-Specific Polishing Strategy

Combined Platform × Taxon Strategy Overview

Pipeline Steps

Telomere Detection

Taxon-Aware Presets

Built-in Motif Families

Scoring System

Classification Tiers

Assembly Selection Strategy

Smart Scoring (default)

Taxon-Specific Scoring Strategies

Taxon-Specific Telomere Detection Windows

Taxon-Specific BUSCO Trial Thresholds (Step 12F)

Taxon-Specific Rescue Limits (Step 12F)

Taxon-Specific purge_dups Behaviour (Step 12H)

N50-only Mode

Step 12 — T2T-First Telomere-Aware Backbone Refinement

Step 12 Sub-step Flow

BUSCO Trial Validation

Post-Refinement Stack

Diploid and Polyploid Note

Provenance GFF3

Output Structure

Project Structure

Troubleshooting

Citation

Changelog

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages