-
Notifications
You must be signed in to change notification settings - Fork 1
v0.7.3 #38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v0.7.3 #38
Changes from all commits
3511498
53bcb75
ec83b12
e6c6a54
2de17f4
2285845
257dcc2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -192,3 +192,7 @@ delete.slurm | |
|
|
||
|
|
||
| playground.ipynb | ||
| old_results/ | ||
| svm.ipynb | ||
| zip.slurm | ||
| scripts/benchmarks.sh | ||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -1,34 +1,99 @@ | ||||||
| # Benchmark | ||||||
|
|
||||||
| XspecT is a tool designed for fast and accurate species classification of genome assemblies and simulated reads. To evaluate its classification accuracy, we conducted a benchmark using a set of Acinetobacter genomes. | ||||||
| XspecT is a tool designed for fast and accurate species classification of genome assemblies and simulated reads. To evaluate its classification accuracy, we conducted a benchmark using a set of *Acinetobacter* genomes. | ||||||
|
|
||||||
| The benchmark was performed by first downloading all available Acinetobacter genomes from RefSeq, filtered on a passed ("OK") taxonomy check status and on them not being part of the training dataset. Genomes assigned to strain IDs were remapped to their respective species IDs, after which genomes with species IDs not contained in XspecT's Acinetobacter model were removed. The remaining genomes were then used to classify both assemblies and simulated reads generated from them. Simulated reads were generated by first filtering on genomes that were categorized as "complete" or "chromosome" by NCBI. The reads were then simulated from the longest contig of each genome (assumed to be the chromosome) using ART. 100 000 reads were simulated for each genome based on the HiSeq 2500 profile, with a read length of 125 bp. The reads were then classified using XspecT with predictions based on the maximum-scoring species. | ||||||
| ```mermaid | ||||||
| flowchart TD | ||||||
|
|
||||||
| A([Start]) --> DL[Download genomes] | ||||||
|
|
||||||
| genomes[(NCBI RefSeq<br/>assemblies<br/>latest, non-atypical)] | ||||||
| meta@{ shape: doc, label: "Assembly metadata" } | ||||||
| tax@{ shape: doc, label: "Taxonomy report" } | ||||||
| xspect@{ shape: doc, label: "XspecT model" } | ||||||
|
|
||||||
| genomes --> DL | ||||||
| genomes --> meta | ||||||
|
|
||||||
| subgraph Data_preparation[Data Preparation] | ||||||
| DP1[Keep only assemblies with OK taxonomy check status] | ||||||
| DP2[Map taxonomy IDs:<br/>strain → species] | ||||||
| DP3[Remove species IDs not in XspecT model] | ||||||
| DP4[Remove assemblies used for model training] | ||||||
| end | ||||||
|
|
||||||
| DL --> DP1 | ||||||
| meta --> DP1 | ||||||
| tax --> DP2 | ||||||
| DP1 --> DP2 | ||||||
| DP2 --> DP3 | ||||||
| DP3 --> DP4 | ||||||
| DP4 --> assemblies_clean@{ shape: docs, label: "Filtered assemblies" } | ||||||
| xspect --> DP3 | ||||||
| xspect --> DP4 | ||||||
|
|
||||||
| subgraph Assembly_level_evaluation[Assemblies] | ||||||
| assemblies_clean --> AssClassify[Classify assemblies] | ||||||
| AssClassify --> AssSummary[Summarize assembly classifications] | ||||||
| AssSummary --> AssMatrices[Generate assembly confusion matrices] | ||||||
| end | ||||||
|
|
||||||
| xspect --> AssClassify | ||||||
|
|
||||||
| subgraph Read_level_evaluation[Reads] | ||||||
| assemblies_clean --> SelectReads[Select assemblies for read generation] | ||||||
| SelectReads --> SimReads[Generate simulated reads] | ||||||
| SimReads --> ReadsClassify[Classify simulated reads] | ||||||
| ReadsClassify --> ReadsSummary[Summarize read classifications] | ||||||
| ReadsSummary --> ReadMatrices[Generate read confusion matrix] | ||||||
| end | ||||||
|
|
||||||
| xspect --> ReadsClassify | ||||||
|
|
||||||
| AssSummary --> Stats[Calculate overall statistics] | ||||||
| ReadsSummary --> Stats | ||||||
| Stats --> Z([End]) | ||||||
| ``` | ||||||
|
|
||||||
| The benchmark was performed by first downloading all available *Acinetobacter* genomes from RefSeq (latest version only, excluding atypical), filtered on a passed ("OK") taxonomy check status and on them not being part of the training dataset. Genomes assigned to strain IDs were remapped to their respective species IDs, after which genomes with species IDs not contained in the XspecT *Acinetobacter* model were removed. The remaining genomes were then used to classify both assemblies and simulated reads generated from them. Simulated reads were generated by first filtering on genomes that were categorized as "complete" or "chromosome" by NCBI. The reads were then simulated from the longest contig of each genome (assumed to be the chromosome) using InSilicoSeq. 100 000 reads were simulated for each genome based on the NovaSeq profile, with a read length of 150 bp. The reads were then classified using XspecT with predictions based on the maximum-scoring species. | ||||||
|
|
||||||
| ## Benchmark Results | ||||||
|
|
||||||
| The benchmark results show that XspecT achieves very high classification accuracy of nearly 100% for whole genomes and strong but reduced accuracy of 70% for simulated reads. However, the low macro-average F1 score (0.21) for the read dataset highlights a substantial class imbalance. | ||||||
| The benchmark results show that XspecT achieves very high classification accuracy of nearly 100% for whole genomes and strong but reduced accuracy of 76% for simulated reads. However, the low macro-average F1 score (0.24) for the read dataset highlights a substantial class imbalance. | ||||||
|
|
||||||
| | Dataset | Total Samples | Matches | Mismatches | Match Rate | Mismatch Rate | Accuracy | Macro Avg F1 | Weighted Avg F1 | | ||||||
| |-----------|--------------:|----------:|-----------:|-----------:|--------------:|---------:|-------------:|----------------:| | ||||||
| | Assembly | 13,795 | 13,776 | 19 | 99.86% | 0.14% | ≈1.00 | 0.96 | ≈1.00 | | ||||||
| | Reads | 121,590,139 | 85,679,572| 35,910,567 | 70.47% | 29.53% | 0.70 | 0.21 | 0.79 | | ||||||
| | Assemblies| 13,786 | 13,776 | 19 | 99.86% | 0.14% | ≈1.00 | 0.96 | ≈1.00 | | ||||||
| | Reads | 121,800,000 | 88,368,547| 33,431,453 | 72.55% | 27.45% | 0.73 | 0.21 | 0.81 | | ||||||
|
|
||||||
| Counting instances in which the highest number of hits are shared by multiple species as abstentions, the a selective accuracy of 82.80% is achieved for simulated reads, with a coverage of 87.63%. Rejection recall is 45.09%. | ||||||
|
||||||
| Counting instances in which the highest number of hits are shared by multiple species as abstentions, the a selective accuracy of 82.80% is achieved for simulated reads, with a coverage of 87.63%. Rejection recall is 45.09%. | |
| Counting instances in which the highest number of hits are shared by multiple species as abstentions, a selective accuracy of 82.80% is achieved for simulated reads, with a coverage of 87.63%. Rejection recall is 45.09%. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,23 +1,23 @@ | ||
| process classifySample { | ||
| conda "./scripts/benchmark/environment.yml" | ||
| conda "./scripts/nextflow-utils/environment.yml" | ||
| cpus 4 | ||
| memory '32 GB' | ||
| errorStrategy 'retry' | ||
| maxRetries 3 | ||
| maxForks params.maxForks | ||
|
|
||
| input: | ||
| path sample | ||
| val model | ||
| val excludedSpeciesIDs | ||
|
|
||
| output: | ||
| path "${sample.baseName}.json" | ||
|
|
||
| script: | ||
| def excludeOptions = excludedSpeciesIDs ? "--exclude-species ${excludedSpeciesIDs}" : '' | ||
| def validateFlag = params.validate ? "--validation" : '' | ||
| """ | ||
| xspect classify species -g ${model} -i ${sample} -o ${sample.baseName}.json | ||
| """ | ||
|
|
||
| stub: | ||
| """ | ||
| mkdir -p results | ||
| touch results/${sample.baseName}.json | ||
| xspect classify species -g ${model} -i ${sample} -o ${sample.baseName}.json ${excludeOptions} ${validateFlag} | ||
|
Comment on lines
+18
to
+21
|
||
| """ | ||
| } | ||
This file was deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency with the table below that uses comma separators for large numbers, "100 000" should be written as "100,000" with a comma separator.