v0.7.3 by aromberg · Pull Request #38 · BIONF/XspecT

aromberg · 2025-12-18T12:50:40Z

Add a CLI command to run the full XspecT pipeline
Add workflows for pan genome-informed training, score-based validation using SVM
Update benchmark & benchmark docs

Copilot

Pull request overview

This PR bumps XspecT to version 0.7.3 and introduces significant enhancements to the tool's functionality and benchmarking infrastructure. The key additions include a new CLI command for running the complete XspecT pipeline end-to-end, comprehensive Nextflow workflows for pan-genome-informed model training and SVM-based validation, plus substantial updates to the benchmark workflow and documentation.

Key Changes:

New xspect all CLI command that orchestrates genus filtering, species classification, and optional MLST typing in a single pipeline
Nextflow workflows for pan-genome analysis-based training (pangenome-train) and score-based validation using SVM (score-svm)
Enhanced benchmark workflow with support for abstention metrics, read simulation using InSilicoSeq, and improved confusion matrix generation

Reviewed changes

Copilot reviewed 14 out of 15 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
`tests/test_cli.py`	Adds comprehensive test coverage for the new "all" pipeline command with multiple test scenarios
`src/xspect/main.py`	Implements the new `all_pipeline` command that chains genus filtering, species classification, and MLST typing
`scripts/score-svm/main.nf`	New Nextflow workflow for downsampling classification data and training SVM models for validation
`scripts/score-svm/nextflow.config`	Configuration for the SVM validation workflow with SLURM executor settings
`scripts/pangenome-train/main.nf`	New comprehensive workflow for pan-genome-informed model training using PPanGGolin
`scripts/pangenome-train/nextflow.config`	Configuration for the pan-genome training workflow
`scripts/nextflow-utils/main.nf`	Adds reusable confusion matrix generation processes for benchmarking
`scripts/nextflow-utils/environment.yml`	Updates XspecT version dependency from 0.5.4 to 0.7.2
`scripts/benchmark/main.nf`	Major updates including abstention metrics, InSilicoSeq integration, and enhanced error handling
`scripts/benchmark/environment.yml`	Removes benchmark-specific environment file (now uses shared nextflow-utils environment)
`scripts/benchmark/classify/main.nf`	Updates classification process to use shared environment and adds retry logic
`docs/benchmark.md`	Comprehensive documentation updates with Mermaid flowchart and updated benchmark results
`pyproject.toml`	Version bump from 0.7.2 to 0.7.3
`mkdocs.yml`	Adds Mermaid diagram support for documentation
`.gitignore`	Adds exclusions for temporary files and old results

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-18T13:03:16Z

scripts/pangenome-train/main.nf

+    # Extract valid species-level tax IDs whose scientific names do NOT contain 'sp.' or 'Candidatus' (case-insensitive)
+    jq -r '.reports[]
+      | .taxonomy
+      | select((.current_scientific_name.name // "") | test("(?i)\\\\bCandidatus\\\\b|\\\\bsp\\\\.", "i") | not)


The regex pattern includes redundant flags. The comment says "case-insensitive" but the test function already has "i" flag specified, and there's an additional (?i) in the pattern itself. Remove the inline (?i) from the pattern since the flag is already passed as the second argument to test().

Copilot · 2025-12-18T13:03:17Z

docs/benchmark.md

+  Stats --> Z([End])
+```
+
+The benchmark was performed by first downloading all available *Acinetobacter* genomes from RefSeq (latest version only, excluding atypical), filtered on a passed ("OK") taxonomy check status and on them not being part of the training dataset. Genomes assigned to strain IDs were remapped to their respective species IDs, after which genomes with species IDs not contained in the XspecT *Acinetobacter* model were removed. The remaining genomes were then used to classify both assemblies and simulated reads generated from them. Simulated reads were generated by first filtering on genomes that were categorized as "complete" or "chromosome" by NCBI. The reads were then simulated from the longest contig of each genome (assumed to be the chromosome) using InSilicoSeq. 100 000 reads were simulated for each genome based on the NovaSeq profile, with a read length of 150 bp. The reads were then classified using XspecT with predictions based on the maximum-scoring species. 


For consistency with the table below that uses comma separators for large numbers, "100 000" should be written as "100,000" with a comma separator.

Suggested change

The benchmark was performed by first downloading all available *Acinetobacter* genomes from RefSeq (latest version only, excluding atypical), filtered on a passed ("OK") taxonomy check status and on them not being part of the training dataset. Genomes assigned to strain IDs were remapped to their respective species IDs, after which genomes with species IDs not contained in the XspecT *Acinetobacter* model were removed. The remaining genomes were then used to classify both assemblies and simulated reads generated from them. Simulated reads were generated by first filtering on genomes that were categorized as "complete" or "chromosome" by NCBI. The reads were then simulated from the longest contig of each genome (assumed to be the chromosome) using InSilicoSeq. 100 000 reads were simulated for each genome based on the NovaSeq profile, with a read length of 150 bp. The reads were then classified using XspecT with predictions based on the maximum-scoring species.

The benchmark was performed by first downloading all available *Acinetobacter* genomes from RefSeq (latest version only, excluding atypical), filtered on a passed ("OK") taxonomy check status and on them not being part of the training dataset. Genomes assigned to strain IDs were remapped to their respective species IDs, after which genomes with species IDs not contained in the XspecT *Acinetobacter* model were removed. The remaining genomes were then used to classify both assemblies and simulated reads generated from them. Simulated reads were generated by first filtering on genomes that were categorized as "complete" or "chromosome" by NCBI. The reads were then simulated from the longest contig of each genome (assumed to be the chromosome) using InSilicoSeq. 100,000 reads were simulated for each genome based on the NovaSeq profile, with a read length of 150 bp. The reads were then classified using XspecT with predictions based on the maximum-scoring species.

Copilot · 2025-12-18T13:03:17Z

docs/benchmark.md

+| Assemblies| 13,786        | 13,776    | 19         | 99.86%     | 0.14%         | ≈1.00    | 0.96         | ≈1.00           |
+| Reads     | 121,800,000   | 88,368,547| 33,431,453 | 72.55%     | 27.45%        | 0.73     | 0.21         | 0.81            |
+
+Counting instances in which the highest number of hits are shared by multiple species as abstentions, the a selective accuracy of 82.80% is achieved for simulated reads, with a coverage of 87.63%. Rejection recall is 45.09%.


Grammar error: "the a selective accuracy" should be "a selective accuracy" (remove "the").

Suggested change

Counting instances in which the highest number of hits are shared by multiple species as abstentions, the a selective accuracy of 82.80% is achieved for simulated reads, with a coverage of 87.63%. Rejection recall is 45.09%.

Counting instances in which the highest number of hits are shared by multiple species as abstentions, a selective accuracy of 82.80% is achieved for simulated reads, with a coverage of 87.63%. Rejection recall is 45.09%.

Copilot · 2025-12-18T13:03:17Z

scripts/score-svm/main.nf

+
+  df = pd.read_csv("${classifications}", sep="\\t")
+
+  # apply use_to_numeric to reduce memory usage


Typo in comment: "use_to_numeric" should be "to_numeric" (remove "use_").

Suggested change

# apply use_to_numeric to reduce memory usage

# apply to_numeric to reduce memory usage

Copilot · 2025-12-18T13:03:18Z

src/xspect/main.py

+@click.option(
+    "--sparse-sampling-step",
+    type=int,
+    help="Sparse sampling step (e. g. only every 500th kmer for '--sparse-sampling-step 500').",


Typo in help text: "e. g." should be "e.g." (remove space after the period).

Suggested change

help="Sparse sampling step (e. g. only every 500th kmer for '--sparse-sampling-step 500').",

help="Sparse sampling step (e.g. only every 500th kmer for '--sparse-sampling-step 500').",

Copilot · 2025-12-18T13:03:18Z

scripts/score-svm/main.nf

+      df[col] = pd.to_numeric(df[col], downcast='float')
+
+  # Same rows may already be rejected due to ambiguous classifications - filter them out
+  df = df[df["Rejected"] == False].copy()


Using direct comparison with boolean False is not recommended in pandas. Use ~df["Rejected"] or df["Rejected"] == False with the equality operator, but the most idiomatic approach is ~df["Rejected"] which is more pythonic and clearer.

Suggested change

df = df[df["Rejected"] == False].copy()

df = df[~df["Rejected"]].copy()

Copilot · 2025-12-18T13:03:18Z

scripts/benchmark/main.nf

+  iss generate \
+    --model ${params.seqPlatform} \
+    --genomes "${sample}" \
+    --n_reads 100000 \
+    --seed 42 \
+    --cpus ${task.cpus} \
+    --output "${sample.baseName}_simulated"


The generateReads process constructs the iss generate command using --model ${params.seqPlatform} without quoting or escaping the seqPlatform parameter. A malicious value for params.seqPlatform (e.g., NovaSeq; malicious_cmd) would be inserted verbatim into the shell command, allowing arbitrary command execution with the task's privileges. To avoid this command injection risk, ensure params.seqPlatform is safely quoted/escaped or otherwise passed to iss in a way that the shell cannot interpret embedded metacharacters as additional commands.

Copilot · 2025-12-18T13:03:19Z

scripts/benchmark/classify/main.nf

+  def excludeOptions = excludedSpeciesIDs ? "--exclude-species ${excludedSpeciesIDs}" : ''
+  def validateFlag = params.validate ? "--validation" : ''
  """
-  xspect classify species -g ${model} -i ${sample} -o ${sample.baseName}.json
-  """
-
-  stub:
-  """
-  mkdir -p results
-  touch results/${sample.baseName}.json
+  xspect classify species -g ${model} -i ${sample} -o ${sample.baseName}.json ${excludeOptions} ${validateFlag}


The classifySample process interpolates model, sample, and excludeOptions directly into a shell command (xspect classify ...) without quoting or escaping, and excludeOptions is built from the excludedSpeciesIDs parameter. An attacker who can control params.xspectModel or params.excludedSpeciesIDs (for example via nextflow run ... --xspectModel 'Acinetobacter; malicious_cmd' or --excludedSpeciesIDs '1; malicious_cmd') can inject arbitrary shell commands that run with the Nextflow task's privileges. To prevent command injection, ensure these values are safely quoted/escaped (or passed via safe Nextflow interpolation mechanisms or environment variables) so that they are treated as data, not executable shell syntax.

Copilot · 2025-12-18T13:03:19Z

scripts/benchmark/main.nf

+  excluded_species="${excludedSpeciesIDs}"
+  if [ -n "\$excluded_species" ]; then
+    awk -F'\t' -v excluded="\$excluded_species" '
+      BEGIN {
+        # split on commas into array arr
+        n = split(excluded, arr, /,/);
+        for (i = 1; i <= n; i++) {
+          if (arr[i] != "") {
+            excluded_map[arr[i]] = 1;
+          }
+        }
+      }
+      NR==1 { print; next }
+      !(\$6 in excluded_map) { print }
+    ' assemblies.tsv > temp_assemblies.tsv
+    mv temp_assemblies.tsv assemblies.tsv


In createAssemblyTable, the line excluded_species="${excludedSpeciesIDs}" embeds the excludedSpeciesIDs parameter directly into the bash script, and the subsequent awk invocation uses this shell variable, all without proper quoting or escaping. If params.excludedSpeciesIDs contains shell metacharacters (e.g., "; rm -rf /; #), the generated script line will break out of the string and execute arbitrary commands with the workflow user's privileges before awk runs. Treat excludedSpeciesIDs strictly as data by safely quoting/escaping it (or passing it via a safer mechanism such as environment variables or Nextflow's protected interpolation) so that it cannot alter the shell command structure.

aromberg added 7 commits December 3, 2025 16:48

add a CLI command to run the full XspecT pipeline

3511498

Train models using gene families from ppanggolin

53bcb75

update benchmark

ec83b12

add svm workflow for score-based validation

e6c6a54

update pangenome-informed training workflow

2de17f4

update gitignore

2285845

bump version

257dcc2

Copilot AI review requested due to automatic review settings December 18, 2025 12:50

aromberg merged commit 4fb5e64 into main Dec 18, 2025
7 checks passed

Copilot started reviewing on behalf of aromberg December 18, 2025 12:51 View session

Copilot AI reviewed Dec 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.7.3#38

v0.7.3#38
aromberg merged 7 commits intomainfrom
dev

aromberg commented Dec 18, 2025

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 18, 2025

Uh oh!

Copilot AI Dec 18, 2025

Uh oh!

Copilot AI Dec 18, 2025

Uh oh!

Copilot AI Dec 18, 2025

Uh oh!

Copilot AI Dec 18, 2025

Uh oh!

Copilot AI Dec 18, 2025

Uh oh!

Copilot AI Dec 18, 2025

Uh oh!

Copilot AI Dec 18, 2025

Uh oh!

Copilot AI Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	Counting instances in which the highest number of hits are shared by multiple species as abstentions, the a selective accuracy of 82.80% is achieved for simulated reads, with a coverage of 87.63%. Rejection recall is 45.09%.
	Counting instances in which the highest number of hits are shared by multiple species as abstentions, a selective accuracy of 82.80% is achieved for simulated reads, with a coverage of 87.63%. Rejection recall is 45.09%.


		df = pd.read_csv("${classifications}", sep="\\t")

		# apply use_to_numeric to reduce memory usage

	# apply use_to_numeric to reduce memory usage
	# apply to_numeric to reduce memory usage

	help="Sparse sampling step (e. g. only every 500th kmer for '--sparse-sampling-step 500').",
	help="Sparse sampling step (e.g. only every 500th kmer for '--sparse-sampling-step 500').",

	df = df[df["Rejected"] == False].copy()
	df = df[~df["Rejected"]].copy()

Conversation

aromberg commented Dec 18, 2025

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants