Skip to content

v0.7.3#38

Merged
aromberg merged 7 commits intomainfrom
dev
Dec 18, 2025
Merged

v0.7.3#38
aromberg merged 7 commits intomainfrom
dev

Conversation

@aromberg
Copy link
Collaborator

  • Add a CLI command to run the full XspecT pipeline
  • Add workflows for pan genome-informed training, score-based validation using SVM
  • Update benchmark & benchmark docs

Copilot AI review requested due to automatic review settings December 18, 2025 12:50
@aromberg aromberg merged commit 4fb5e64 into main Dec 18, 2025
7 checks passed
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR bumps XspecT to version 0.7.3 and introduces significant enhancements to the tool's functionality and benchmarking infrastructure. The key additions include a new CLI command for running the complete XspecT pipeline end-to-end, comprehensive Nextflow workflows for pan-genome-informed model training and SVM-based validation, plus substantial updates to the benchmark workflow and documentation.

Key Changes:

  • New xspect all CLI command that orchestrates genus filtering, species classification, and optional MLST typing in a single pipeline
  • Nextflow workflows for pan-genome analysis-based training (pangenome-train) and score-based validation using SVM (score-svm)
  • Enhanced benchmark workflow with support for abstention metrics, read simulation using InSilicoSeq, and improved confusion matrix generation

Reviewed changes

Copilot reviewed 14 out of 15 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
tests/test_cli.py Adds comprehensive test coverage for the new "all" pipeline command with multiple test scenarios
src/xspect/main.py Implements the new all_pipeline command that chains genus filtering, species classification, and MLST typing
scripts/score-svm/main.nf New Nextflow workflow for downsampling classification data and training SVM models for validation
scripts/score-svm/nextflow.config Configuration for the SVM validation workflow with SLURM executor settings
scripts/pangenome-train/main.nf New comprehensive workflow for pan-genome-informed model training using PPanGGolin
scripts/pangenome-train/nextflow.config Configuration for the pan-genome training workflow
scripts/nextflow-utils/main.nf Adds reusable confusion matrix generation processes for benchmarking
scripts/nextflow-utils/environment.yml Updates XspecT version dependency from 0.5.4 to 0.7.2
scripts/benchmark/main.nf Major updates including abstention metrics, InSilicoSeq integration, and enhanced error handling
scripts/benchmark/environment.yml Removes benchmark-specific environment file (now uses shared nextflow-utils environment)
scripts/benchmark/classify/main.nf Updates classification process to use shared environment and adds retry logic
docs/benchmark.md Comprehensive documentation updates with Mermaid flowchart and updated benchmark results
pyproject.toml Version bump from 0.7.2 to 0.7.3
mkdocs.yml Adds Mermaid diagram support for documentation
.gitignore Adds exclusions for temporary files and old results

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# Extract valid species-level tax IDs whose scientific names do NOT contain 'sp.' or 'Candidatus' (case-insensitive)
jq -r '.reports[]
| .taxonomy
| select((.current_scientific_name.name // "") | test("(?i)\\\\bCandidatus\\\\b|\\\\bsp\\\\.", "i") | not)
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex pattern includes redundant flags. The comment says "case-insensitive" but the test function already has "i" flag specified, and there's an additional (?i) in the pattern itself. Remove the inline (?i) from the pattern since the flag is already passed as the second argument to test().

Copilot uses AI. Check for mistakes.
Stats --> Z([End])
```

The benchmark was performed by first downloading all available *Acinetobacter* genomes from RefSeq (latest version only, excluding atypical), filtered on a passed ("OK") taxonomy check status and on them not being part of the training dataset. Genomes assigned to strain IDs were remapped to their respective species IDs, after which genomes with species IDs not contained in the XspecT *Acinetobacter* model were removed. The remaining genomes were then used to classify both assemblies and simulated reads generated from them. Simulated reads were generated by first filtering on genomes that were categorized as "complete" or "chromosome" by NCBI. The reads were then simulated from the longest contig of each genome (assumed to be the chromosome) using InSilicoSeq. 100 000 reads were simulated for each genome based on the NovaSeq profile, with a read length of 150 bp. The reads were then classified using XspecT with predictions based on the maximum-scoring species.
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with the table below that uses comma separators for large numbers, "100 000" should be written as "100,000" with a comma separator.

Suggested change
The benchmark was performed by first downloading all available *Acinetobacter* genomes from RefSeq (latest version only, excluding atypical), filtered on a passed ("OK") taxonomy check status and on them not being part of the training dataset. Genomes assigned to strain IDs were remapped to their respective species IDs, after which genomes with species IDs not contained in the XspecT *Acinetobacter* model were removed. The remaining genomes were then used to classify both assemblies and simulated reads generated from them. Simulated reads were generated by first filtering on genomes that were categorized as "complete" or "chromosome" by NCBI. The reads were then simulated from the longest contig of each genome (assumed to be the chromosome) using InSilicoSeq. 100 000 reads were simulated for each genome based on the NovaSeq profile, with a read length of 150 bp. The reads were then classified using XspecT with predictions based on the maximum-scoring species.
The benchmark was performed by first downloading all available *Acinetobacter* genomes from RefSeq (latest version only, excluding atypical), filtered on a passed ("OK") taxonomy check status and on them not being part of the training dataset. Genomes assigned to strain IDs were remapped to their respective species IDs, after which genomes with species IDs not contained in the XspecT *Acinetobacter* model were removed. The remaining genomes were then used to classify both assemblies and simulated reads generated from them. Simulated reads were generated by first filtering on genomes that were categorized as "complete" or "chromosome" by NCBI. The reads were then simulated from the longest contig of each genome (assumed to be the chromosome) using InSilicoSeq. 100,000 reads were simulated for each genome based on the NovaSeq profile, with a read length of 150 bp. The reads were then classified using XspecT with predictions based on the maximum-scoring species.

Copilot uses AI. Check for mistakes.
| Assemblies| 13,786 | 13,776 | 19 | 99.86% | 0.14% | ≈1.00 | 0.96 | ≈1.00 |
| Reads | 121,800,000 | 88,368,547| 33,431,453 | 72.55% | 27.45% | 0.73 | 0.21 | 0.81 |

Counting instances in which the highest number of hits are shared by multiple species as abstentions, the a selective accuracy of 82.80% is achieved for simulated reads, with a coverage of 87.63%. Rejection recall is 45.09%.
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar error: "the a selective accuracy" should be "a selective accuracy" (remove "the").

Suggested change
Counting instances in which the highest number of hits are shared by multiple species as abstentions, the a selective accuracy of 82.80% is achieved for simulated reads, with a coverage of 87.63%. Rejection recall is 45.09%.
Counting instances in which the highest number of hits are shared by multiple species as abstentions, a selective accuracy of 82.80% is achieved for simulated reads, with a coverage of 87.63%. Rejection recall is 45.09%.

Copilot uses AI. Check for mistakes.

df = pd.read_csv("${classifications}", sep="\\t")

# apply use_to_numeric to reduce memory usage
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in comment: "use_to_numeric" should be "to_numeric" (remove "use_").

Suggested change
# apply use_to_numeric to reduce memory usage
# apply to_numeric to reduce memory usage

Copilot uses AI. Check for mistakes.
@click.option(
"--sparse-sampling-step",
type=int,
help="Sparse sampling step (e. g. only every 500th kmer for '--sparse-sampling-step 500').",
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in help text: "e. g." should be "e.g." (remove space after the period).

Suggested change
help="Sparse sampling step (e. g. only every 500th kmer for '--sparse-sampling-step 500').",
help="Sparse sampling step (e.g. only every 500th kmer for '--sparse-sampling-step 500').",

Copilot uses AI. Check for mistakes.
df[col] = pd.to_numeric(df[col], downcast='float')

# Same rows may already be rejected due to ambiguous classifications - filter them out
df = df[df["Rejected"] == False].copy()
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using direct comparison with boolean False is not recommended in pandas. Use ~df["Rejected"] or df["Rejected"] == False with the equality operator, but the most idiomatic approach is ~df["Rejected"] which is more pythonic and clearer.

Suggested change
df = df[df["Rejected"] == False].copy()
df = df[~df["Rejected"]].copy()

Copilot uses AI. Check for mistakes.
Comment on lines +359 to +365
iss generate \
--model ${params.seqPlatform} \
--genomes "${sample}" \
--n_reads 100000 \
--seed 42 \
--cpus ${task.cpus} \
--output "${sample.baseName}_simulated"
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generateReads process constructs the iss generate command using --model ${params.seqPlatform} without quoting or escaping the seqPlatform parameter. A malicious value for params.seqPlatform (e.g., NovaSeq; malicious_cmd) would be inserted verbatim into the shell command, allowing arbitrary command execution with the task's privileges. To avoid this command injection risk, ensure params.seqPlatform is safely quoted/escaped or otherwise passed to iss in a way that the shell cannot interpret embedded metacharacters as additional commands.

Copilot uses AI. Check for mistakes.
Comment on lines +18 to +21
def excludeOptions = excludedSpeciesIDs ? "--exclude-species ${excludedSpeciesIDs}" : ''
def validateFlag = params.validate ? "--validation" : ''
"""
xspect classify species -g ${model} -i ${sample} -o ${sample.baseName}.json
"""

stub:
"""
mkdir -p results
touch results/${sample.baseName}.json
xspect classify species -g ${model} -i ${sample} -o ${sample.baseName}.json ${excludeOptions} ${validateFlag}
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The classifySample process interpolates model, sample, and excludeOptions directly into a shell command (xspect classify ...) without quoting or escaping, and excludeOptions is built from the excludedSpeciesIDs parameter. An attacker who can control params.xspectModel or params.excludedSpeciesIDs (for example via nextflow run ... --xspectModel 'Acinetobacter; malicious_cmd' or --excludedSpeciesIDs '1; malicious_cmd') can inject arbitrary shell commands that run with the Nextflow task's privileges. To prevent command injection, ensure these values are safely quoted/escaped (or passed via safe Nextflow interpolation mechanisms or environment variables) so that they are treated as data, not executable shell syntax.

Copilot uses AI. Check for mistakes.
Comment on lines +194 to +209
excluded_species="${excludedSpeciesIDs}"
if [ -n "\$excluded_species" ]; then
awk -F'\t' -v excluded="\$excluded_species" '
BEGIN {
# split on commas into array arr
n = split(excluded, arr, /,/);
for (i = 1; i <= n; i++) {
if (arr[i] != "") {
excluded_map[arr[i]] = 1;
}
}
}
NR==1 { print; next }
!(\$6 in excluded_map) { print }
' assemblies.tsv > temp_assemblies.tsv
mv temp_assemblies.tsv assemblies.tsv
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In createAssemblyTable, the line excluded_species="${excludedSpeciesIDs}" embeds the excludedSpeciesIDs parameter directly into the bash script, and the subsequent awk invocation uses this shell variable, all without proper quoting or escaping. If params.excludedSpeciesIDs contains shell metacharacters (e.g., "; rm -rf /; #), the generated script line will break out of the string and execute arbitrary commands with the workflow user's privileges before awk runs. Treat excludedSpeciesIDs strictly as data by safely quoting/escaping it (or passing it via a safer mechanism such as environment variables or Nextflow's protected interpolation) so that it cannot alter the shell command structure.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants