Optimize `rule combine_samples` #1178

victorlin · 2025-06-27T04:18:33Z

Description of proposed changes

This rule became noticeably slower with Augur 31.2.1. I tracked down a few reasons. This PR along with Augur #1834 speeds up this rule significantly, shaving off ~1hr of run time from ncov/gisaid (details in testing section).

Testing

Full ncov/gisaid trial run results:

tanglegram for ncov/gisaid/global/6m, I think this shows that there are no functional changes?
Snakemake run stats compared with old runs (added rule subsample because it has notable differences too):
- stats-augur-31.0.0.json from May 25 run
  - average subsample run time: 27m
  - average combine_samples run time: 33m
  - workflow run time: 3h15m
- stats-augur-31.1.0.json from May 28 run
  - average subsample run time: 17m (improved by date function caching)
  - average combine_samples run time: 37m
  - workflow run time: 2h53m
- stats-augur-31.2.1.json from June 22 run
  - average subsample run time: 11m
  - average combine_samples run time: 2h18m (increased due to concurrency bug)
  - workflow run time: 4h41m
- stats-ncov-pr-1178.json from trial run linked above
  - average subsample run time: 10m
  - average combine_samples run time: 15m (improved by fixes in this PR)
  - workflow run time: 2h50m

Checklist

Merge+release of filter: Improve concurrent run times augur#1834

Release checklist

If this pull request introduces backward incompatible changes, complete the following steps for a new release of the workflow:

Determine the version number for the new release by incrementing the most recent release (e.g., "v2" from "v1").
Update docs/src/reference/change_log.md in this pull request to document these changes and the new version number.
After merging, create a new GitHub release with the new version number as the tag and release title.
Document new minimum Augur version requirement

joverlee521

Thanks for tracking down the slowness and documenting all of the test runs!

workflow/snakemake_rules/main_workflow.smk

The message is no longer accurate as of the switch to augur filter in "Aggregate subsampled sequences with augur filter" (1b7ceb6). Instead of updating it, I'll just remove it since having this rule attribute goes against the team's Snakemake style guide.¹ ¹ <https://docs.nextstrain.org/en/latest/reference/snakemake-style-guide.html#avoid-the-message-rule-attribute>

This improves searchability.

victorlin · 2025-07-23T22:35:46Z

I've started a new ncov/gisaid trial run. If all goes well, I'll plan to merge tomorrow.

victorlin · 2025-07-29T00:16:16Z

The trial run showed faster run times for combine_samples jobs but slightly slower overall. Re-running without b6c233c to see if that helps.

victorlin · 2025-07-29T20:37:53Z

Average combine_samples run time from the trial run above is still 15 minutes, meaning it's not necessary to use 2 threads instead of 1. I'm going to drop 94f1858 + b6c233c and merge.

Side note: run time statistics are no longer available directly through a stats.json file, so I used this command in the benchmarks directory to get the numbers:

awk 'FNR == 2 {
    # For each file, on its 2nd line, print the first field
    print $1
}' combine_samples_* |

awk '{
    # Accumulate sum and track min/max
    sum += $1
    if (NR == 1 || $1 < min) min = $1
    if (NR == 1 || $1 > max) max = $1
}
END {
    # After reading all values, print statistics
    print "Min: " min " seconds"
    print "Max: " max " seconds"
    print "Mean: " sum/NR " seconds"
}'

This avoids an extra pass through the sequence file that was introduced in Augur 31.2.1. The sequence checks do 2 things: (1) check for duplicates and (2) check for presence of ids from metadata so that any ids missing from sequences are dropped. Here's why it's safe to bypass these checks: (1) is not useful if the inputs are already deduplicated prior to running this rule. A new config key has been added to mark inputs as deduplicated. (2) is not useful since --exclude-all already excludes any ids that would have been excluded by the check for presence in sequences.

victorlin self-assigned this Jun 27, 2025

victorlin force-pushed the victorlin/optimize-combine_samples branch from 3d1baef to 4360a2d Compare July 7, 2025 22:02

victorlin marked this pull request as ready for review July 7, 2025 22:03

victorlin force-pushed the victorlin/optimize-combine_samples branch from 4360a2d to 12e81b9 Compare July 21, 2025 21:50

joverlee521 approved these changes Jul 22, 2025

View reviewed changes

workflow/snakemake_rules/main_workflow.smk Outdated Show resolved Hide resolved

workflow/snakemake_rules/main_workflow.smk Outdated Show resolved Hide resolved

victorlin added 2 commits July 23, 2025 09:20

Use rule name in log/benchmark file names

44b8aa8

This improves searchability.

victorlin force-pushed the victorlin/optimize-combine_samples branch 3 times, most recently from de519d9 to b6c233c Compare July 23, 2025 22:33

victorlin force-pushed the victorlin/optimize-combine_samples branch from b6c233c to 39a781b Compare July 29, 2025 20:39

victorlin merged commit 60d4bca into master Jul 29, 2025
5 of 6 checks passed

victorlin deleted the victorlin/optimize-combine_samples branch July 29, 2025 20:40

joverlee521 mentioned this pull request Sep 2, 2025

Add benchmark to the Snakemake style guide nextstrain/docs.nextstrain.org#210

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize `rule combine_samples` #1178

Optimize `rule combine_samples` #1178

victorlin commented Jun 27, 2025 •

edited

Loading

Uh oh!

joverlee521 left a comment

Uh oh!

Uh oh!

Uh oh!

victorlin commented Jul 23, 2025

Uh oh!

victorlin commented Jul 29, 2025

Uh oh!

victorlin commented Jul 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Optimize rule combine_samples #1178

Optimize rule combine_samples #1178

Conversation

victorlin commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of proposed changes

Testing

Checklist

Release checklist

Uh oh!

joverlee521 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

victorlin commented Jul 23, 2025

Uh oh!

victorlin commented Jul 29, 2025

Uh oh!

victorlin commented Jul 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Optimize `rule combine_samples` #1178

Optimize `rule combine_samples` #1178

victorlin commented Jun 27, 2025 •

edited

Loading