Conversation
aromberg
commented
Dec 18, 2025
- Add a CLI command to run the full XspecT pipeline
- Add workflows for pan genome-informed training, score-based validation using SVM
- Update benchmark & benchmark docs
There was a problem hiding this comment.
Pull request overview
This PR bumps XspecT to version 0.7.3 and introduces significant enhancements to the tool's functionality and benchmarking infrastructure. The key additions include a new CLI command for running the complete XspecT pipeline end-to-end, comprehensive Nextflow workflows for pan-genome-informed model training and SVM-based validation, plus substantial updates to the benchmark workflow and documentation.
Key Changes:
- New
xspect allCLI command that orchestrates genus filtering, species classification, and optional MLST typing in a single pipeline - Nextflow workflows for pan-genome analysis-based training (
pangenome-train) and score-based validation using SVM (score-svm) - Enhanced benchmark workflow with support for abstention metrics, read simulation using InSilicoSeq, and improved confusion matrix generation
Reviewed changes
Copilot reviewed 14 out of 15 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
tests/test_cli.py |
Adds comprehensive test coverage for the new "all" pipeline command with multiple test scenarios |
src/xspect/main.py |
Implements the new all_pipeline command that chains genus filtering, species classification, and MLST typing |
scripts/score-svm/main.nf |
New Nextflow workflow for downsampling classification data and training SVM models for validation |
scripts/score-svm/nextflow.config |
Configuration for the SVM validation workflow with SLURM executor settings |
scripts/pangenome-train/main.nf |
New comprehensive workflow for pan-genome-informed model training using PPanGGolin |
scripts/pangenome-train/nextflow.config |
Configuration for the pan-genome training workflow |
scripts/nextflow-utils/main.nf |
Adds reusable confusion matrix generation processes for benchmarking |
scripts/nextflow-utils/environment.yml |
Updates XspecT version dependency from 0.5.4 to 0.7.2 |
scripts/benchmark/main.nf |
Major updates including abstention metrics, InSilicoSeq integration, and enhanced error handling |
scripts/benchmark/environment.yml |
Removes benchmark-specific environment file (now uses shared nextflow-utils environment) |
scripts/benchmark/classify/main.nf |
Updates classification process to use shared environment and adds retry logic |
docs/benchmark.md |
Comprehensive documentation updates with Mermaid flowchart and updated benchmark results |
pyproject.toml |
Version bump from 0.7.2 to 0.7.3 |
mkdocs.yml |
Adds Mermaid diagram support for documentation |
.gitignore |
Adds exclusions for temporary files and old results |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Extract valid species-level tax IDs whose scientific names do NOT contain 'sp.' or 'Candidatus' (case-insensitive) | ||
| jq -r '.reports[] | ||
| | .taxonomy | ||
| | select((.current_scientific_name.name // "") | test("(?i)\\\\bCandidatus\\\\b|\\\\bsp\\\\.", "i") | not) |
There was a problem hiding this comment.
The regex pattern includes redundant flags. The comment says "case-insensitive" but the test function already has "i" flag specified, and there's an additional (?i) in the pattern itself. Remove the inline (?i) from the pattern since the flag is already passed as the second argument to test().
| Stats --> Z([End]) | ||
| ``` | ||
|
|
||
| The benchmark was performed by first downloading all available *Acinetobacter* genomes from RefSeq (latest version only, excluding atypical), filtered on a passed ("OK") taxonomy check status and on them not being part of the training dataset. Genomes assigned to strain IDs were remapped to their respective species IDs, after which genomes with species IDs not contained in the XspecT *Acinetobacter* model were removed. The remaining genomes were then used to classify both assemblies and simulated reads generated from them. Simulated reads were generated by first filtering on genomes that were categorized as "complete" or "chromosome" by NCBI. The reads were then simulated from the longest contig of each genome (assumed to be the chromosome) using InSilicoSeq. 100 000 reads were simulated for each genome based on the NovaSeq profile, with a read length of 150 bp. The reads were then classified using XspecT with predictions based on the maximum-scoring species. |
There was a problem hiding this comment.
For consistency with the table below that uses comma separators for large numbers, "100 000" should be written as "100,000" with a comma separator.
| The benchmark was performed by first downloading all available *Acinetobacter* genomes from RefSeq (latest version only, excluding atypical), filtered on a passed ("OK") taxonomy check status and on them not being part of the training dataset. Genomes assigned to strain IDs were remapped to their respective species IDs, after which genomes with species IDs not contained in the XspecT *Acinetobacter* model were removed. The remaining genomes were then used to classify both assemblies and simulated reads generated from them. Simulated reads were generated by first filtering on genomes that were categorized as "complete" or "chromosome" by NCBI. The reads were then simulated from the longest contig of each genome (assumed to be the chromosome) using InSilicoSeq. 100 000 reads were simulated for each genome based on the NovaSeq profile, with a read length of 150 bp. The reads were then classified using XspecT with predictions based on the maximum-scoring species. | |
| The benchmark was performed by first downloading all available *Acinetobacter* genomes from RefSeq (latest version only, excluding atypical), filtered on a passed ("OK") taxonomy check status and on them not being part of the training dataset. Genomes assigned to strain IDs were remapped to their respective species IDs, after which genomes with species IDs not contained in the XspecT *Acinetobacter* model were removed. The remaining genomes were then used to classify both assemblies and simulated reads generated from them. Simulated reads were generated by first filtering on genomes that were categorized as "complete" or "chromosome" by NCBI. The reads were then simulated from the longest contig of each genome (assumed to be the chromosome) using InSilicoSeq. 100,000 reads were simulated for each genome based on the NovaSeq profile, with a read length of 150 bp. The reads were then classified using XspecT with predictions based on the maximum-scoring species. |
| | Assemblies| 13,786 | 13,776 | 19 | 99.86% | 0.14% | ≈1.00 | 0.96 | ≈1.00 | | ||
| | Reads | 121,800,000 | 88,368,547| 33,431,453 | 72.55% | 27.45% | 0.73 | 0.21 | 0.81 | | ||
|
|
||
| Counting instances in which the highest number of hits are shared by multiple species as abstentions, the a selective accuracy of 82.80% is achieved for simulated reads, with a coverage of 87.63%. Rejection recall is 45.09%. |
There was a problem hiding this comment.
Grammar error: "the a selective accuracy" should be "a selective accuracy" (remove "the").
| Counting instances in which the highest number of hits are shared by multiple species as abstentions, the a selective accuracy of 82.80% is achieved for simulated reads, with a coverage of 87.63%. Rejection recall is 45.09%. | |
| Counting instances in which the highest number of hits are shared by multiple species as abstentions, a selective accuracy of 82.80% is achieved for simulated reads, with a coverage of 87.63%. Rejection recall is 45.09%. |
|
|
||
| df = pd.read_csv("${classifications}", sep="\\t") | ||
|
|
||
| # apply use_to_numeric to reduce memory usage |
There was a problem hiding this comment.
Typo in comment: "use_to_numeric" should be "to_numeric" (remove "use_").
| # apply use_to_numeric to reduce memory usage | |
| # apply to_numeric to reduce memory usage |
| @click.option( | ||
| "--sparse-sampling-step", | ||
| type=int, | ||
| help="Sparse sampling step (e. g. only every 500th kmer for '--sparse-sampling-step 500').", |
There was a problem hiding this comment.
Typo in help text: "e. g." should be "e.g." (remove space after the period).
| help="Sparse sampling step (e. g. only every 500th kmer for '--sparse-sampling-step 500').", | |
| help="Sparse sampling step (e.g. only every 500th kmer for '--sparse-sampling-step 500').", |
| df[col] = pd.to_numeric(df[col], downcast='float') | ||
|
|
||
| # Same rows may already be rejected due to ambiguous classifications - filter them out | ||
| df = df[df["Rejected"] == False].copy() |
There was a problem hiding this comment.
Using direct comparison with boolean False is not recommended in pandas. Use ~df["Rejected"] or df["Rejected"] == False with the equality operator, but the most idiomatic approach is ~df["Rejected"] which is more pythonic and clearer.
| df = df[df["Rejected"] == False].copy() | |
| df = df[~df["Rejected"]].copy() |
| iss generate \ | ||
| --model ${params.seqPlatform} \ | ||
| --genomes "${sample}" \ | ||
| --n_reads 100000 \ | ||
| --seed 42 \ | ||
| --cpus ${task.cpus} \ | ||
| --output "${sample.baseName}_simulated" |
There was a problem hiding this comment.
The generateReads process constructs the iss generate command using --model ${params.seqPlatform} without quoting or escaping the seqPlatform parameter. A malicious value for params.seqPlatform (e.g., NovaSeq; malicious_cmd) would be inserted verbatim into the shell command, allowing arbitrary command execution with the task's privileges. To avoid this command injection risk, ensure params.seqPlatform is safely quoted/escaped or otherwise passed to iss in a way that the shell cannot interpret embedded metacharacters as additional commands.
| def excludeOptions = excludedSpeciesIDs ? "--exclude-species ${excludedSpeciesIDs}" : '' | ||
| def validateFlag = params.validate ? "--validation" : '' | ||
| """ | ||
| xspect classify species -g ${model} -i ${sample} -o ${sample.baseName}.json | ||
| """ | ||
|
|
||
| stub: | ||
| """ | ||
| mkdir -p results | ||
| touch results/${sample.baseName}.json | ||
| xspect classify species -g ${model} -i ${sample} -o ${sample.baseName}.json ${excludeOptions} ${validateFlag} |
There was a problem hiding this comment.
The classifySample process interpolates model, sample, and excludeOptions directly into a shell command (xspect classify ...) without quoting or escaping, and excludeOptions is built from the excludedSpeciesIDs parameter. An attacker who can control params.xspectModel or params.excludedSpeciesIDs (for example via nextflow run ... --xspectModel 'Acinetobacter; malicious_cmd' or --excludedSpeciesIDs '1; malicious_cmd') can inject arbitrary shell commands that run with the Nextflow task's privileges. To prevent command injection, ensure these values are safely quoted/escaped (or passed via safe Nextflow interpolation mechanisms or environment variables) so that they are treated as data, not executable shell syntax.
| excluded_species="${excludedSpeciesIDs}" | ||
| if [ -n "\$excluded_species" ]; then | ||
| awk -F'\t' -v excluded="\$excluded_species" ' | ||
| BEGIN { | ||
| # split on commas into array arr | ||
| n = split(excluded, arr, /,/); | ||
| for (i = 1; i <= n; i++) { | ||
| if (arr[i] != "") { | ||
| excluded_map[arr[i]] = 1; | ||
| } | ||
| } | ||
| } | ||
| NR==1 { print; next } | ||
| !(\$6 in excluded_map) { print } | ||
| ' assemblies.tsv > temp_assemblies.tsv | ||
| mv temp_assemblies.tsv assemblies.tsv |
There was a problem hiding this comment.
In createAssemblyTable, the line excluded_species="${excludedSpeciesIDs}" embeds the excludedSpeciesIDs parameter directly into the bash script, and the subsequent awk invocation uses this shell variable, all without proper quoting or escaping. If params.excludedSpeciesIDs contains shell metacharacters (e.g., "; rm -rf /; #), the generated script line will break out of the string and execute arbitrary commands with the workflow user's privileges before awk runs. Treat excludedSpeciesIDs strictly as data by safely quoting/escaping it (or passing it via a safer mechanism such as environment variables or Nextflow's protected interpolation) so that it cannot alter the shell command structure.