Telomere-Aware Contig Optimization
TACO is a telomere-aware all-in-one multi-assembler comparison and refinement pipeline for genome assembly benchmarking, decision-making, and chromosome-end improvement. Developed for small eukaryotic genomes with a focus on fungal genomes, TACO runs multiple assemblers, standardizes their outputs, evaluates assembly quality, detects telomere-supported contigs, and can either (1) stop after generating a unified assembler comparison table for benchmarking, or (2) continue into telomere-aware backbone refinement for an improved chromosome-scale candidate assembly.
TACO was developed at the Grainger Bioinformatics Center, Field Museum of Natural History.
- Overview
- Features
- Installation
- Quick Start
- Usage
- Sequencing Platform Support
- Pipeline Steps
- Telomere Detection
- Assembly Selection Strategy
- Output Structure
- Project Structure
- Troubleshooting
- Citation
- License
Genome assemblers often produce different results from the same long-read dataset. One assembler may recover longer contigs, another may preserve more complete chromosome ends, and another may provide a better balance of completeness, contiguity, and redundancy. TACO makes these comparisons systematic, interpretable, and reproducible.
TACO operates in two modes. In assembly-only mode (--assembly-only), the pipeline runs all assemblers, standardizes their outputs, evaluates quality with BUSCO, QUAST, telomere detection, and optional Merqury, then produces a unified comparison table at assemblies/assembly_info.csv. In full refinement mode, TACO continues from the comparison step into telomere-pool construction and backbone refinement, producing an improved chromosome-scale candidate assembly with preserved telomeric ends.
- Runs six long-read assemblers (HiCanu, NextDenovo, Peregrine, IPA, Flye, Hifiasm) from a single command
- Supports PacBio HiFi, Oxford Nanopore, and PacBio CLR reads via
--platform - Standardizes assembly outputs for direct cross-assembler comparison
- Hybrid telomere detection with de novo k-mer discovery, built-in motif families, and per-end composite scoring
- Three-tier telomere classification: strict T2T, single-end strong, and telomere-supported
- Benchmarks assemblies with BUSCO, QUAST, telomere metrics, and optional Merqury
- Biologically informed automatic backbone selection (smart scoring)
- Assembly-only mode for convenient benchmarking without refinement
- Telomere-aware backbone refinement with redundancy reduction and telomeric end rescue
- Machine-readable benchmark logs for reproducible reporting
See INSTALLATION.md for detailed instructions.
git clone https://github.com/yksun/TACO.git
cd TACO
conda env create -f taco-env.yml
conda activate taco
pip install -e .
# Verify
taco --helpAfter installation, the taco command is available system-wide within the conda environment.
TACO requires a Unix-like system (Linux or macOS) with Python >= 3.8 and Conda. Most dependencies are installed via taco-env.yml. TACO's Python modules use only the standard library with no additional pip packages. Canu, Peregrine, and IPA require manual installation (see INSTALLATION.md). If any assembler is absent or fails, TACO skips that step and continues with the others.
Full refinement run (comparison + telomere-aware refinement):
mkdir -p my_project && cd my_project
taco -g 12m -t 16 \
--fastq /path/to/reads.fastq \
--taxon fungalAssembly-only comparison (benchmarking only):
taco -g 12m -t 16 \
--fastq /path/to/reads.fastq \
--taxon fungal \
--assembly-onlyWith Merqury QV scoring (optional, auto-detected if installed):
Merqury is automatically enabled when merqury.sh is on PATH and a .meryl database is found in the working directory. You can also specify the database explicitly:
taco -g 12m -t 16 \
--fastq /path/to/reads.fastq \
--taxon fungal \
--merqury-db reads.merylWith explicit motif override (only if biologically known):
taco -g 12m -t 16 \
--fastq /path/to/reads.fastq \
-m TTAGGG| Parameter | Description |
|---|---|
-g, --genomesize |
Estimated haploid genome size (e.g., 12m, 40m, 2g) |
-t, --threads |
Number of CPU threads |
--fastq |
Input FASTQ file (use absolute path) |
--taxon |
Taxonomy preset for telomere detection: vertebrate, animal, plant, insect, fungal, or other (default). Sets motif-family priors and detection behavior automatically. |
-m, --motif |
Telomere motif override (optional). Only use when the exact motif is biologically known for the species. When omitted, taxon-aware hybrid detection is used instead. |
--platform |
Sequencing platform: pacbio-hifi (default), nanopore, or pacbio. Also determines default polishing tool. |
-s, --steps |
Run selected steps only (e.g., 1,3-5) |
--reference, -ref |
Reference FASTA for comparison. Included as the "reference" assembler in all comparison tables. |
--busco |
Run BUSCO (optionally specify lineage dataset) |
--choose |
Manually choose the backbone assembler |
--assembly-only |
Stop after assembler comparison |
--auto-mode |
Backbone selection mode: smart (default) or n50 |
--merqury |
Force-enable Merqury (auto-detected if merqury.sh + .meryl db found) |
--merqury-db |
Enable Merqury with a specific .meryl database path |
--no-merqury |
Disable Merqury even if installed and auto-detected |
--no-purge-dups |
Skip purge_dups after refinement |
--no-polish |
Skip automatic polishing after refinement |
--allow-t2t-replace |
Allow rescue donors to replace immutable Tier 1 (T2T) contigs. Disabled by default for safety |
Use --assembly-only when the goal is assembler benchmarking and comparison without refinement. TACO runs all assemblers, standardizes outputs, runs BUSCO, telomere detection, QUAST, and optional Merqury, then writes the combined comparison table to assemblies/assembly_info.csv and a summary to final_results/assembly_only_result.csv.
TACO supports three sequencing platforms. Each assembler receives platform-appropriate flags automatically. The platform also determines the default polishing strategy: HiFi assemblies are polished with NextPolish2 by default (k-mer-based, safe for high-accuracy reads; requires yak for k-mer database construction), Nanopore assemblies use Medaka (neural-network polisher; falls back to Racon), and CLR assemblies use Racon.
| Platform | --platform |
Assemblers Used | Notes |
|---|---|---|---|
| PacBio HiFi | pacbio-hifi (default) |
canu, nextDenovo, peregrine, IPA, flye, hifiasm | All 6 assemblers |
| Oxford Nanopore | nanopore |
canu, nextDenovo, flye | Peregrine, IPA, hifiasm skipped |
| PacBio CLR | pacbio |
canu, nextDenovo, peregrine, flye | IPA, hifiasm skipped |
Incompatible assemblers are automatically skipped with a warning. IPA and hifiasm only support PacBio HiFi reads. Peregrine does not support Nanopore reads.
Each assembler receives the appropriate read-type flag automatically:
| Assembler | HiFi | Nanopore | CLR |
|---|---|---|---|
| Canu | -pacbio-hifi |
-nanopore |
-pacbio |
| Flye | --pacbio-hifi |
--nano-hq (Q20+) |
--pacbio-raw |
| Hifiasm | default mode | skipped | skipped |
| IPA | default mode | skipped | skipped |
| Peregrine | default mode | skipped | default mode |
| NextDenovo | via config file | via config file | via config file |
For older ONT data that is not Q20+ basecalled, set FLYE_ONT_FLAG=--nano-raw in the environment.
| Platform | Polishing Tool | Notes |
|---|---|---|
| HiFi | NextPolish2 (yak k-mer based) | K-mer polishing corrects residual errors safely; requires nextpolish2 + yak |
| Nanopore | Medaka (Racon fallback) | Neural-network polisher; set MEDAKA_MODEL for non-default chemistry |
| CLR | Racon | Standard error-correction for CLR reads |
| Component | Fungal | Plant | Vertebrate/Animal | Insect/Other |
|---|---|---|---|---|
| Telomere motifs | TTAGGG + TG1-3 + Candida | TTTAGGG | TTAGGG | TTAGG (insect) / all (other) |
| Score window | 300 bp | 1000 bp | 1000 bp | 500 bp |
| Backbone T2T weight | 350 (high) | 200 (reduced) | 200 (reduced) | 300 (default) |
| BUSCO D penalty | 600 (strict) | 300 (relaxed) | 500 (default) | 500 (default) |
| BUSCO trial C-drop | 2% (strict) | 4% (relaxed) | 3% (moderate) | 2% (default) |
| purge_dups mode | single-round | two-round + polyploid warning | two-round | single-round |
| Polishing (HiFi) | NextPolish2 (yak k-mer based) | NextPolish2 (yak k-mer based) | NextPolish2 (yak k-mer based) | NextPolish2 (yak k-mer based) |
| Polishing (ONT) | Medaka → Racon | Medaka → Racon | Medaka → Racon | Medaka → Racon |
| Polishing (CLR) | Racon | Racon | Racon | Racon |
| Step | Description |
|---|---|
| 1 | HiCanu assembly |
| 2 | NextDenovo assembly |
| 3 | Peregrine assembly |
| 4 | IPA assembly |
| 5 | Flye assembly |
| 6 | Hifiasm assembly |
| 7 | Copy and normalize all assemblies |
| 8 | BUSCO on all assemblies |
| 9 | Telomere contig detection and telomere metrics |
| 10 | Build optimized telomere pool across assemblies |
| 11 | QUAST for assembler comparison |
| 12 | Backbone selection and telomere-aware refinement |
| 13 | BUSCO on final assembly |
| 14 | Telomere analysis of final assembly |
| 15 | QUAST on final assembly |
| 16 | Final comparison report |
| 17 | Cleanup into structured output folders |
| 18 | Assembly-only comparison summary |
With --assembly-only, TACO follows the comparison path (Steps 1-11, 18) and stops before backbone refinement.
TACO v1.2.0 uses a taxon-aware hybrid telomere detection system that combines built-in motif families with de novo k-mer discovery.
Use --taxon to select the appropriate telomere motif priors for your organism. This is the recommended approach instead of forcing --motif directly.
--taxon |
Primary Motifs | Notes |
|---|---|---|
vertebrate |
TTAGGG | Highly conserved; exact motif matching is most reliable here |
animal |
TTAGGG | Strong prior for vertebrates, less certain for distant metazoans |
plant |
TTTAGGG | Common plant repeat; some lineages vary |
insect |
TTAGG | Common insect repeat; not universal across all insect orders |
fungal |
TTAGGG, TG1-3, Candida | Diverse fungal telomeres — all built-in families used |
other (default) |
All families | Unknown taxon — relies on de novo discovery plus all priors |
The --motif flag is an optional override. Do not force --motif unless the telomere repeat is biologically confirmed for your species or lineage. For fungi and unknown taxa especially, forcing a motif may miss true telomeres.
TACO ships with five motif families: the canonical vertebrate/filamentous fungal TTAGGG repeat, the budding yeast TG1-3/C1-3A degenerate repeat, the Candida 23-bp repeat (ACGGATGTCTAACTTCTTGGTGT), the plant TTTAGGG repeat, and the insect TTAGG repeat.
Each contig end is scored using a composite of four metrics: telomere density (weight 0.40), longest consecutive run (weight 0.30), distance of repeats from the contig terminus (weight 0.20), and covered base pairs (weight 0.10). The composite score ranges from 0 to 1.
Contigs are classified into three tiers based on their end scores: strict T2T contigs have strong telomere signal at both ends (score >= 0.25 at each end), single-end strong contigs have strong signal at one end only, and telomere-supported contigs have at least weak signal (score >= 0.08) at one end.
When --choose is not provided, TACO automatically selects the backbone assembly for refinement. The scoring formula adapts its weights based on --taxon to match the biological characteristics of each organism group.
TACO ranks assemblies using a composite score. Merqury QV and completeness are included when available (optional — auto-detected if merqury.sh is installed and a .meryl database is found; otherwise these terms contribute 0):
score = BUSCO_S × w_busco_s + T2T × w_t2t + single_tel × w_single
+ MerquryComp × 200 + MerquryQV × 20 ← optional, 0 if Merqury not available
- contigs × w_contigs + log10(N50) × w_n50
- BUSCO_D × w_busco_d
BUSCO single-copy completeness (S%) is used instead of total completeness (C%) to avoid rewarding highly duplicated assemblies. BUSCO duplication (D%) is explicitly penalised. When Merqury is available, its k-mer-based QV and completeness provide an independent quality signal that helps distinguish assemblies with similar BUSCO scores. The weights are tuned per taxon as described below.
Fungal (--taxon fungal): Fungal genomes are typically small (10–60 Mb) with well-defined chromosomes. TACO uses strict BUSCO duplicate penalty (w_busco_d = 600) because duplicated assemblies are almost always artefactual in haploid fungi. T2T contigs are weighted heavily (w_t2t = 350) since telomere rescue is highly effective for small genomes where individual T2T chromosomes can be resolved. The contig-count penalty remains moderate (w_contigs = 30) because most fungal genomes have few chromosomes.
Plant (--taxon plant): Plant genomes vary enormously in size and ploidy. TACO relaxes the BUSCO duplicate penalty (w_busco_d = 300) because polyploidy naturally inflates D% even in correct assemblies. The contig-count penalty is increased (w_contigs = 50) to discourage fragmented assemblies in these often large genomes. T2T weight is reduced (w_t2t = 200) because long repetitive arrays near telomeres can produce false-positive signals, and interstitial telomeric repeats (ITRs) are common in plants.
Vertebrate / Animal (--taxon vertebrate or --taxon animal): Vertebrate genomes are large (1–3+ Gb) and repeat-rich. TACO increases the N50 weight (w_n50 = 200) to favour contiguous assemblies and moderates the contig-count penalty (w_contigs = 40). T2T weight is reduced (w_t2t = 200) because interstitial telomeric repeats are frequent in vertebrates and can inflate telomere counts. BUSCO duplicate penalty stays at the default (w_busco_d = 500).
Insect / Other (--taxon insect or --taxon other): These taxa use the balanced default weights: w_busco_s = 1000, w_t2t = 300, w_single = 150, w_contigs = 30, w_n50 = 150, w_busco_d = 500. This is appropriate when the biological characteristics of the target organism are not well characterised.
| Weight | Fungal | Plant | Vertebrate/Animal | Insect/Other |
|---|---|---|---|---|
w_busco_s |
1000 | 1000 | 1000 | 1000 |
w_t2t |
350 | 200 | 200 | 300 |
w_single |
150 | 150 | 150 | 150 |
w_contigs |
30 | 50 | 40 | 30 |
w_n50 |
150 | 150 | 200 | 150 |
w_busco_d |
600 | 300 | 500 | 500 |
The default telomere score window also varies by taxon to match typical telomere array lengths: fungi use 300 bp (fungal telomere arrays are short, often 50–300 bp), plants and vertebrates use 1000 bp (longer repeat arrays), and other taxa use 500 bp (balanced default).
When validating rescue candidates via BUSCO trial, the maximum acceptable C% drop, M% rise, and D% rise depend on taxon:
- Fungi: strict thresholds (2% C-drop, 0.3% M-rise, 2% D-rise) — haploid genomes with stable BUSCO profiles.
- Plant: relaxed thresholds (4% C-drop, 1.0% M-rise, 6% D-rise) — polyploidy causes natural BUSCO variability.
- Vertebrate: moderate thresholds (3% C-drop, 0.5% M-rise, 4% D-rise).
- Other: balanced defaults (2.5% C-drop, 0.5% M-rise, 3% D-rise).
A D-rise (duplicated BUSCO increase) check catches cases where a rescue introduces redundant copies of single-copy orthologs — a sign of retained haplotigs or mis-joined contigs. All thresholds can be overridden via STEP12_MAX_BUSCO_C_DROP, STEP12_MAX_BUSCO_M_RISE, and STEP12_MAX_BUSCO_D_RISE environment variables.
Additional environment variables for fine-tuning Step 12: PROTECT_COV / PROTECT_ID (strict dedup thresholds, default 0.95/0.95), DEDUP_MAX_BUSCO_C_DROP (maximum tolerated BUSCO C drop after dedup, default 3.0%), CHIMERA_MIN_CROSS_COV (minimum cross-assembly coverage for chimera mapping check, default 0.60), AGGR_NONTELO_COV / AGGR_NONTELO_ID (taxon-aware non-telo dedup), SELFDEDUP_COV / SELFDEDUP_ID (self-dedup thresholds).
The maximum number of accepted rescue candidates per run is taxon-aware: fungi allow up to 20 rescues (many small chromosomes), vertebrates 10, plants 8 (conservative due to polyploidy risk), and other taxa 15. This prevents runaway replacement in complex genomes.
purge_dups uses single-round purging for fungal, insect, and other genomes to avoid over-purging small or simple genomes. For vertebrate, animal, and plant genomes, two-round purging (-2 flag) is used for more thorough cleanup of larger, more complex genomes. A warning is emitted for plant genomes because purge_dups may incorrectly collapse homeologous sequences in polyploid species — use --no-purge-dups if this is a concern.
--auto-mode n50 selects the assembly with the highest N50. This reproduces legacy behavior but may favor contiguous assemblies that lack completeness.
Step 12 adopts a T2T-first assembly philosophy with a two-tier confidence model:
- Tier 1 (Immutable): T2T contigs — contigs with verified telomere signal at both ends. These are treated as protected chromosomal anchors and are never replaced during rescue, unless
--allow-t2t-replaceis explicitly set. This protects the highest-confidence contigs from accidental degradation. - Tier 2 (Editable): Backbone contigs — gap-fill contigs that cover chromosomal regions not represented by T2T contigs. These may be replaced by telomere-bearing rescue donors if the replacement passes BUSCO trial validation.
Duplicate non-telomeric backbone contigs are aggressively removed (with taxon-aware thresholds), and rescue donors must carry verified telomere signal.
- 12A — Merqury QV scoring (optional; auto-detected if installed, or enabled with
--merqury/--merqury-db). - 12B — auto-select backbone assembler (smart scoring with taxon-aware weights).
- 12C — prepare cleaned backbone + chimera safety using two strategies: (a) size gate — contigs > 1.5× the largest individual assembler contig are flagged; (b) cross-assembly mapping — each protected contig is aligned against all other assembler outputs; contigs not well-covered (≥60%) by any single assembler's contig are flagged as potential chimeras. Configurable via
CHIMERA_MIN_CROSS_COV. - 12D — T2T-first foundation building:
- 12D1 strict dedup (95%/95%): remove backbone contigs near-identical to T2T pool. Each removal is logged (name, length, coverage, identity).
- 12D2 post-dedup BUSCO safety check: runs BUSCO on the combined assembly (protected + remaining backbone) and compares to the backbone alone. Warns if BUSCO C drops > 3% (configurable via
DEDUP_MAX_BUSCO_C_DROP), with remediation suggestions. - 12D2b fragment removal (50%/90%): remove backbone fragments partially overlapping T2T chromosomes.
- 12D3 backbone telomere classification: run telomere detection on remaining backbone contigs to identify which carry telomere signal.
- 12D4 aggressive non-telomeric dedup (taxon-aware): backbone contigs lacking telomere support that overlap the T2T pool are removed. Thresholds: fungi 70%/85%, plant/vertebrate 85%/92%, other 75%/88%.
- 12D5 non-telomeric self-dedup (taxon-aware): when two non-telomeric backbone contigs overlap, the shorter one is removed. Thresholds: fungi 80%/90%, plant/vertebrate 90%/95%, other 85%/92%. Telomere-bearing contigs are always kept.
- 12E — telomere rescue with two-tier protection: align donor pool to backbone, compute structural metrics, verify each donor carries telomere signal, and classify each replacement. Tier 1 (T2T) contigs are immutable by default — candidates targeting them are rejected unless
--allow-t2t-replaceis set. Each accepted candidate is assigned a replacement class:fill_missing_end,replace_non_telo_backbone,replace_single_with_better, orreplace_protected_t2t. - 12F — BUSCO trial validation: for each telomere-verified candidate, build a trial assembly and run BUSCO. Rejection thresholds are taxon-aware (fungi: 2% C-drop / 2% D-rise, plant: 4% / 6%, vertebrate: 3% / 4%). Maximum accepted rescues are also taxon-aware (fungi: 20, plant: 8, vertebrate: 10, other: 15). An additional safety check rejects
replace_single_with_bettercandidates if the replacement loses telomere evidence at either end. - 12G — final combine: T2T foundation + telomere-rescued backbone gap-fill.
- 12H — purge_dups: taxon-aware haplotig/duplicate purging (skip with
--no-purge-dups). - 12I — automatic polishing: NextPolish2 for HiFi (k-mer-based via yak; skip with
--no-polish), Medaka for ONT (Racon fallback), Racon for CLR. - 12J — telomere-aware genome-size pruning: only non-telomeric contigs are removed when assembly exceeds the size budget. Telomere-bearing contigs are never pruned.
TACO validates each telomere rescue candidate by building a trial assembly where one backbone contig is replaced by one donor contig, then running BUSCO with the same lineage selected by the user. Rejection is triggered by three independent BUSCO metrics: C% drop (completeness loss), M% rise (missing gene increase), and D% rise (duplicated BUSCO increase, catching retained haplotigs). Rejection thresholds are taxon-aware: fungi use strict thresholds (2% C-drop, 2% D-rise), plants use relaxed thresholds (4% C-drop, 6% D-rise, accounting for polyploidy), and vertebrates use moderate thresholds (3% C-drop, 4% D-rise). An additional safety check rejects replace_single_with_better candidates if telomere evidence weakens at either end after replacement (suspicious size drops >30% also trigger rejection). This greedy, sequential approach ensures that each accepted rescue improves or maintains assembly quality. The trial summary TSV includes replacement_class and D (duplicated %) columns for full traceability.
After the rescued/combined assembly is produced, TACO runs purge_dups by default to remove leftover haplotigs, overlapping fragments, and residual duplicates. purge_dups behaviour is taxon-aware: vertebrate, animal, and plant genomes use two-round purging (-2 flag) for more thorough cleanup of larger, more complex genomes, while fungal and insect genomes use single-round to avoid over-purging. A warning is emitted for plant genomes due to polyploid risk. This is followed by automatic polishing selected from --platform: HiFi assemblies are polished with NextPolish2 by default (builds yak k-mer databases at k=21 and k=31 from HiFi reads, then applies k-mer-based correction — safe and effective for high-accuracy reads), Nanopore assemblies use Medaka (falls back to Racon if Medaka is not installed), and CLR assemblies use Racon. Both steps can be skipped with --no-purge-dups and --no-polish respectively.
TACO is designed for producing a best primary-style chromosome-level assembly, not a fully phased diploid or polyploid reconstruction. For strongly diploid or polyploid genomes, telomere-bearing contigs from different haplotypes may appear as rescue donors, and purge_dups may collapse alternative haplotigs. This is acceptable when the goal is a cleaned primary reference assembly.
TACO writes a GFF3 annotation file (final.merged.provenance.gff3) alongside the final assembly. Each contig gets one GFF3 record (type=contig) spanning its full length, with attributes documenting its full provenance chain: source_assembler (which assembler produced it), assembler_contig (the original contig name from that assembler before Step 10 pool renaming), source_type (assembler or quickmerge), role (backbone, upgrade_donor, or novel_t2t), replacement_class (for upgrade donors), replaced_contig (which backbone contig was replaced), and description (a human-readable summary like "Entire replacement: peregrine contig 'contig_5' replaced by canu contig 'tig00000015' (class: upgrade_tier2_to_t2t)").
For quickmerge-derived contigs, the GFF3 includes additional contig-level attributes (qm_assembler1, qm_assembler2) identifying the two source assemblers, plus child records (type=region) with Parent linking to the contig. Each region record spans the genomic coordinates contributed by a specific assembler, with source_assembler and assembler_contig showing the original source. For example, a quickmerge contig produced from canu × flye will have region records like "Region 1-500000 from canu contig 'tig00001'" and "Region 400000-900000 from flye contig 'contig_3'", enabling users to trace every base pair back to its assembler of origin.
A companion file pool_contig_provenance.tsv maps every pool contig back to its source assembler and original contig name, with extended columns for quickmerge contigs: qm_assembler1, qm_assembler2, and qm_regions (semicolon-delimited start-end:assembler:contig entries).
project_directory/
├── assemblies/
│ ├── assembly_info.csv # Unified comparison table
│ ├── canu.fasta # Normalized assembly outputs
│ ├── nextdenovo.fasta
│ ├── ...
│ ├── single_tel.replaced.debug.tsv # All rescue alignment hits
│ ├── single_tel.candidates.tsv # Plausible rescue candidates
│ ├── rescue_rejection_summary.txt # Rejection reasons
│ ├── rescue_trial_summary.tsv # BUSCO trial results (with replacement_class, D%)
│ ├── final_merge.raw.fasta # Pre-purge combined assembly
│ └── *.busco/ # BUSCO results per assembly
├── final_results/
│ ├── final_result.csv # Final comparison report
│ ├── final_assembly.fasta # Refined assembly (full mode)
│ ├── final.merged.provenance.gff3 # GFF3 provenance: full assembler tracing per contig
│ ├── pool_contig_provenance.tsv # Pool contig → assembler + original name mapping
│ └── assembly_only_result.csv # Comparison summary (assembly-only)
├── telomere_pool/ # Telomere pool intermediates
├── quast_results/ # QUAST output
├── logs/ # Per-step log files
├── benchmark_logs/ # Machine-readable benchmark data
│ ├── run_metadata.tsv
│ ├── step_benchmark.tsv
│ └── run_summary.txt
└── version.txt # Software versions
TACO/
├── setup.py # pip install entry point
├── run_taco # Shell wrapper (no install needed)
├── taco/ # Python package
│ ├── __init__.py # Package metadata (v1.2.0)
│ ├── __main__.py # CLI entry point: taco [options]
│ ├── cli.py # Argument parsing
│ ├── pipeline.py # Pipeline runner, logging, benchmarking
│ ├── steps.py # All 18 step implementations
│ ├── utils.py # Shared utilities and FASTA I/O
│ ├── telomere_detect.py # Hybrid telomere detection engine
│ ├── telomere_pool.py # Telomere pool classification
│ ├── clustering.py # Minimap2-based contig clustering
│ ├── backbone.py # Backbone selection and scoring
│ └── reporting.py # Final report generation
├── docs/ # Documentation and images
├── taco-env.yml # Conda environment
├── INSTALLATION.md
├── README.md
├── LICENSE
└── .gitignore
Canu reports master +XX changes or Step 1 fails with a Java error: The conda environment now includes openjdk>=11 to fix the Java runtime. If you still see this error, the bioconda canu may be a dev build — download a stable binary from https://github.com/marbl/canu/releases and place it on PATH. TACO detects dev builds and warns you; if canu fails, the pipeline continues with the remaining assemblers.
IPA or Peregrine skipped: These assemblers only support certain platforms. IPA requires PacBio HiFi; Peregrine does not support Nanopore. Use --platform to match your data type.
Telomere motif appears incorrect: Do not force --motif unless the telomere repeat is biologically known for your species. Use --taxon to select the appropriate preset instead. TACO's built-in motif families cover canonical TTAGGG, budding yeast TG1-3, Candida, plant TTTAGGG, and insect TTAGG repeats.
purge_dups or polishing not running: These tools must be installed in the conda environment. Use conda install -c bioconda purge_dups nextpolish2 yak racon medaka or skip with --no-purge-dups / --no-polish. For HiFi polishing, NextPolish2 and yak are both required. For Nanopore polishing, Medaka is preferred; if unavailable, Racon is used as fallback.
TACO.sh: command not found: Add the TACO directory to your PATH or run with the full path.
Missing Python modules: TACO uses only the Python standard library. If you see import errors, ensure Python >= 3.8 is installed and the taco/ directory is alongside TACO.sh.
Merqury not working: Merqury is optional. Install with conda install -c bioconda merqury meryl or use --no-merqury to skip.
If you use TACO in a publication, please cite the software and archive the exact release used for reproducibility (e.g., via Zenodo).
TACO was developed at the Grainger Bioinformatics Center, Field Museum of Natural History, Chicago, Illinois, USA.
See docs/CHANGELOG.md for the full version history, including detailed notes on the v0.5.6 → v1.0.0 Bash-to-Python conversion.
TACO is released under the MIT License.