Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -192,3 +192,7 @@ delete.slurm


playground.ipynb
old_results/
svm.ipynb
zip.slurm
scripts/benchmarks.sh
87 changes: 76 additions & 11 deletions docs/benchmark.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,99 @@
# Benchmark

XspecT is a tool designed for fast and accurate species classification of genome assemblies and simulated reads. To evaluate its classification accuracy, we conducted a benchmark using a set of Acinetobacter genomes.
XspecT is a tool designed for fast and accurate species classification of genome assemblies and simulated reads. To evaluate its classification accuracy, we conducted a benchmark using a set of *Acinetobacter* genomes.

The benchmark was performed by first downloading all available Acinetobacter genomes from RefSeq, filtered on a passed ("OK") taxonomy check status and on them not being part of the training dataset. Genomes assigned to strain IDs were remapped to their respective species IDs, after which genomes with species IDs not contained in XspecT's Acinetobacter model were removed. The remaining genomes were then used to classify both assemblies and simulated reads generated from them. Simulated reads were generated by first filtering on genomes that were categorized as "complete" or "chromosome" by NCBI. The reads were then simulated from the longest contig of each genome (assumed to be the chromosome) using ART. 100 000 reads were simulated for each genome based on the HiSeq 2500 profile, with a read length of 125 bp. The reads were then classified using XspecT with predictions based on the maximum-scoring species.
```mermaid
flowchart TD

A([Start]) --> DL[Download genomes]

genomes[(NCBI RefSeq<br/>assemblies<br/>latest, non-atypical)]
meta@{ shape: doc, label: "Assembly metadata" }
tax@{ shape: doc, label: "Taxonomy report" }
xspect@{ shape: doc, label: "XspecT model" }

genomes --> DL
genomes --> meta

subgraph Data_preparation[Data Preparation]
DP1[Keep only assemblies with OK taxonomy check status]
DP2[Map taxonomy IDs:<br/>strain → species]
DP3[Remove species IDs not in XspecT model]
DP4[Remove assemblies used for model training]
end

DL --> DP1
meta --> DP1
tax --> DP2
DP1 --> DP2
DP2 --> DP3
DP3 --> DP4
DP4 --> assemblies_clean@{ shape: docs, label: "Filtered assemblies" }
xspect --> DP3
xspect --> DP4

subgraph Assembly_level_evaluation[Assemblies]
assemblies_clean --> AssClassify[Classify assemblies]
AssClassify --> AssSummary[Summarize assembly classifications]
AssSummary --> AssMatrices[Generate assembly confusion matrices]
end

xspect --> AssClassify

subgraph Read_level_evaluation[Reads]
assemblies_clean --> SelectReads[Select assemblies for read generation]
SelectReads --> SimReads[Generate simulated reads]
SimReads --> ReadsClassify[Classify simulated reads]
ReadsClassify --> ReadsSummary[Summarize read classifications]
ReadsSummary --> ReadMatrices[Generate read confusion matrix]
end

xspect --> ReadsClassify

AssSummary --> Stats[Calculate overall statistics]
ReadsSummary --> Stats
Stats --> Z([End])
```

The benchmark was performed by first downloading all available *Acinetobacter* genomes from RefSeq (latest version only, excluding atypical), filtered on a passed ("OK") taxonomy check status and on them not being part of the training dataset. Genomes assigned to strain IDs were remapped to their respective species IDs, after which genomes with species IDs not contained in the XspecT *Acinetobacter* model were removed. The remaining genomes were then used to classify both assemblies and simulated reads generated from them. Simulated reads were generated by first filtering on genomes that were categorized as "complete" or "chromosome" by NCBI. The reads were then simulated from the longest contig of each genome (assumed to be the chromosome) using InSilicoSeq. 100 000 reads were simulated for each genome based on the NovaSeq profile, with a read length of 150 bp. The reads were then classified using XspecT with predictions based on the maximum-scoring species.
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with the table below that uses comma separators for large numbers, "100 000" should be written as "100,000" with a comma separator.

Suggested change
The benchmark was performed by first downloading all available *Acinetobacter* genomes from RefSeq (latest version only, excluding atypical), filtered on a passed ("OK") taxonomy check status and on them not being part of the training dataset. Genomes assigned to strain IDs were remapped to their respective species IDs, after which genomes with species IDs not contained in the XspecT *Acinetobacter* model were removed. The remaining genomes were then used to classify both assemblies and simulated reads generated from them. Simulated reads were generated by first filtering on genomes that were categorized as "complete" or "chromosome" by NCBI. The reads were then simulated from the longest contig of each genome (assumed to be the chromosome) using InSilicoSeq. 100 000 reads were simulated for each genome based on the NovaSeq profile, with a read length of 150 bp. The reads were then classified using XspecT with predictions based on the maximum-scoring species.
The benchmark was performed by first downloading all available *Acinetobacter* genomes from RefSeq (latest version only, excluding atypical), filtered on a passed ("OK") taxonomy check status and on them not being part of the training dataset. Genomes assigned to strain IDs were remapped to their respective species IDs, after which genomes with species IDs not contained in the XspecT *Acinetobacter* model were removed. The remaining genomes were then used to classify both assemblies and simulated reads generated from them. Simulated reads were generated by first filtering on genomes that were categorized as "complete" or "chromosome" by NCBI. The reads were then simulated from the longest contig of each genome (assumed to be the chromosome) using InSilicoSeq. 100,000 reads were simulated for each genome based on the NovaSeq profile, with a read length of 150 bp. The reads were then classified using XspecT with predictions based on the maximum-scoring species.

Copilot uses AI. Check for mistakes.

## Benchmark Results

The benchmark results show that XspecT achieves very high classification accuracy of nearly 100% for whole genomes and strong but reduced accuracy of 70% for simulated reads. However, the low macro-average F1 score (0.21) for the read dataset highlights a substantial class imbalance.
The benchmark results show that XspecT achieves very high classification accuracy of nearly 100% for whole genomes and strong but reduced accuracy of 76% for simulated reads. However, the low macro-average F1 score (0.24) for the read dataset highlights a substantial class imbalance.

| Dataset | Total Samples | Matches | Mismatches | Match Rate | Mismatch Rate | Accuracy | Macro Avg F1 | Weighted Avg F1 |
|-----------|--------------:|----------:|-----------:|-----------:|--------------:|---------:|-------------:|----------------:|
| Assembly | 13,795 | 13,776 | 19 | 99.86% | 0.14% | ≈1.00 | 0.96 | ≈1.00 |
| Reads | 121,590,139 | 85,679,572| 35,910,567 | 70.47% | 29.53% | 0.70 | 0.21 | 0.79 |
| Assemblies| 13,786 | 13,776 | 19 | 99.86% | 0.14% | ≈1.00 | 0.96 | ≈1.00 |
| Reads | 121,800,000 | 88,368,547| 33,431,453 | 72.55% | 27.45% | 0.73 | 0.21 | 0.81 |

Counting instances in which the highest number of hits are shared by multiple species as abstentions, the a selective accuracy of 82.80% is achieved for simulated reads, with a coverage of 87.63%. Rejection recall is 45.09%.
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar error: "the a selective accuracy" should be "a selective accuracy" (remove "the").

Suggested change
Counting instances in which the highest number of hits are shared by multiple species as abstentions, the a selective accuracy of 82.80% is achieved for simulated reads, with a coverage of 87.63%. Rejection recall is 45.09%.
Counting instances in which the highest number of hits are shared by multiple species as abstentions, a selective accuracy of 82.80% is achieved for simulated reads, with a coverage of 87.63%. Rejection recall is 45.09%.

Copilot uses AI. Check for mistakes.

## Running the benchmark yourself

To benchmark XspecT performance yourself, you can use the Nextflow workflow provided in the `scripts/benchmark` directory. This workflow allows you to run XspecT on a set of samples and measure species classification accuracy on both genome assemblies, as well as on simulated reads.

Before you run the benchmark, you first need to download benchmarking data to the `data` directory, for example from NCBI. To do so, you can use the bash script in `scripts/benchmark-data` to download the data using the [NCBI Datasets CLI](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/command-line-tools/download-and-install/), which needs to be installed first. The script will download all available Acinetobacter genomes, as well as taxonomic data.
Before you run the benchmark, you first need to download benchmarking data to the `data` directory, for example from NCBI. To do so, you can use the bash script in `scripts/benchmark-data` to download the data using the [NCBI Datasets CLI](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/command-line-tools/download-and-install/), which needs to be installed first. The script will download all available *Acinetobacter* genomes, as well as taxonomic data.

To run the benchmark, install [Nextflow](https://www.nextflow.io/docs/latest/install.html) and run the following command:

```bash
nextflow run scripts/benchmark
```

This will execute the benchmark workflow, which will classify the samples, as well as reads generated from them, using XspecT. The results will be saved in the `results` directory:
The workflow can be parameterized using the following flags/arguments:

- `--publishDir`: Directory to save benchmark results to (default: `results/benchmark`)
- `--xspectModel`: XspecT model to use for classification (default: `Acinetobacter`)
- `--excludedSpeciesIDs`: Comma-separated list of species IDs to exclude from the benchmark (default: none)
- `--maxForks`: Maximum number of parallel processes to use (default: 50)
- `--validate`: Whether to use mapping-based classification validation (default: false)
- `--seqPlatform`: InSilicoSeq profile to use for read simulation (default: `NovaSeq` for NovaSeq)

Workflow results will be saved in the `results` directory:

- `results/classifications.tsv` for the classifications of the assemblies
- `results/read_classifications.tsv` for the classifications of the simulated reads
- `results/confusion_matrix.png` for the confusion matrix of genome assembly classifications
- `results/classifications.tsv` for classifications of the assemblies
- `results/read_classifications.tsv` for classifications of the simulated reads
- `results/confusion_matrix.png` for a confusion matrix of genome assembly classifications
- `results/mismatches_confusion_matrix.png` for a confusion matrix filtered on mismatches of genome assembly classifications
- `results/stats.txt` for the statistics of the benchmark run
- `results/read_confusion_matrix.png` for a confusion matrix of simulated read classifications
- `results/stats.txt` for overall statistics of the benchmark run
5 changes: 5 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,11 @@ plugins:
repo_url: https://github.com/BIONF/XspecT
markdown_extensions:
- attr_list
- pymdownx.superfences:
custom_fences:
- name: mermaid
class: mermaid
format: !!python/name:pymdownx.superfences.fence_code_format
nav:
- Home: index.md
- Quickstart: quickstart.md
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[project]
name = "XspecT"
version = "0.7.2"
version = "0.7.3"
description = "Tool to monitor and characterize pathogens using Bloom filters."
readme = {file = "README.md", content-type = "text/markdown"}
license = {file = "LICENSE"}
Expand Down
16 changes: 8 additions & 8 deletions scripts/benchmark/classify/main.nf
Original file line number Diff line number Diff line change
@@ -1,23 +1,23 @@
process classifySample {
conda "./scripts/benchmark/environment.yml"
conda "./scripts/nextflow-utils/environment.yml"
cpus 4
memory '32 GB'
errorStrategy 'retry'
maxRetries 3
maxForks params.maxForks

input:
path sample
val model
val excludedSpeciesIDs

output:
path "${sample.baseName}.json"

script:
def excludeOptions = excludedSpeciesIDs ? "--exclude-species ${excludedSpeciesIDs}" : ''
def validateFlag = params.validate ? "--validation" : ''
"""
xspect classify species -g ${model} -i ${sample} -o ${sample.baseName}.json
"""

stub:
"""
mkdir -p results
touch results/${sample.baseName}.json
xspect classify species -g ${model} -i ${sample} -o ${sample.baseName}.json ${excludeOptions} ${validateFlag}
Comment on lines +18 to +21
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The classifySample process interpolates model, sample, and excludeOptions directly into a shell command (xspect classify ...) without quoting or escaping, and excludeOptions is built from the excludedSpeciesIDs parameter. An attacker who can control params.xspectModel or params.excludedSpeciesIDs (for example via nextflow run ... --xspectModel 'Acinetobacter; malicious_cmd' or --excludedSpeciesIDs '1; malicious_cmd') can inject arbitrary shell commands that run with the Nextflow task's privileges. To prevent command injection, ensure these values are safely quoted/escaped (or passed via safe Nextflow interpolation mechanisms or environment variables) so that they are treated as data, not executable shell syntax.

Copilot uses AI. Check for mistakes.
"""
}
8 changes: 0 additions & 8 deletions scripts/benchmark/environment.yml

This file was deleted.

Loading