Skip to content

Conversation

@victorlin
Copy link
Member

@victorlin victorlin commented Jun 27, 2025

Description of proposed changes

This rule became noticeably slower with Augur 31.2.1. I tracked down a few reasons. This PR along with Augur #1834 speeds up this rule significantly, shaving off ~1hr of run time from ncov/gisaid (details in testing section).

Testing

Full ncov/gisaid trial run results:

  • tanglegram for ncov/gisaid/global/6m, I think this shows that there are no functional changes?
  • Snakemake run stats compared with old runs (added rule subsample because it has notable differences too):
    • stats-augur-31.0.0.json from May 25 run
      • average subsample run time: 27m
      • average combine_samples run time: 33m
      • workflow run time: 3h15m
    • stats-augur-31.1.0.json from May 28 run
      • average subsample run time: 17m (improved by date function caching)
      • average combine_samples run time: 37m
      • workflow run time: 2h53m
    • stats-augur-31.2.1.json from June 22 run
      • average subsample run time: 11m
      • average combine_samples run time: 2h18m (increased due to concurrency bug)
      • workflow run time: 4h41m
    • stats-ncov-pr-1178.json from trial run linked above
      • average subsample run time: 10m
      • average combine_samples run time: 15m (improved by fixes in this PR)
      • workflow run time: 2h50m

Checklist

Release checklist

If this pull request introduces backward incompatible changes, complete the following steps for a new release of the workflow:

@victorlin victorlin self-assigned this Jun 27, 2025
@victorlin victorlin force-pushed the victorlin/optimize-combine_samples branch from 3d1baef to 4360a2d Compare July 7, 2025 22:02
@victorlin victorlin marked this pull request as ready for review July 7, 2025 22:03
@victorlin victorlin force-pushed the victorlin/optimize-combine_samples branch from 4360a2d to 12e81b9 Compare July 21, 2025 21:50
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tracking down the slowness and documenting all of the test runs!

The message is no longer accurate as of the switch to augur filter in
"Aggregate subsampled sequences with augur filter" (1b7ceb6). Instead of
updating it, I'll just remove it since having this rule attribute goes
against the team's Snakemake style guide.¹

¹ <https://docs.nextstrain.org/en/latest/reference/snakemake-style-guide.html#avoid-the-message-rule-attribute>
This improves searchability.
@victorlin victorlin force-pushed the victorlin/optimize-combine_samples branch 3 times, most recently from de519d9 to b6c233c Compare July 23, 2025 22:33
@victorlin
Copy link
Member Author

I've started a new ncov/gisaid trial run. If all goes well, I'll plan to merge tomorrow.

@victorlin
Copy link
Member Author

The trial run showed faster run times for combine_samples jobs but slightly slower overall. Re-running without b6c233c to see if that helps.

@victorlin
Copy link
Member Author

Average combine_samples run time from the trial run above is still 15 minutes, meaning it's not necessary to use 2 threads instead of 1. I'm going to drop 94f1858 + b6c233c and merge.

Side note: run time statistics are no longer available directly through a stats.json file, so I used this command in the benchmarks directory to get the numbers:

awk 'FNR == 2 {
    # For each file, on its 2nd line, print the first field
    print $1
}' combine_samples_* |

awk '{
    # Accumulate sum and track min/max
    sum += $1
    if (NR == 1 || $1 < min) min = $1
    if (NR == 1 || $1 > max) max = $1
}
END {
    # After reading all values, print statistics
    print "Min: " min " seconds"
    print "Max: " max " seconds"
    print "Mean: " sum/NR " seconds"
}'

This avoids an extra pass through the sequence file that was introduced
in Augur 31.2.1.

The sequence checks do 2 things: (1) check for duplicates and (2) check
for presence of ids from metadata so that any ids missing from sequences
are dropped.

Here's why it's safe to bypass these checks:

(1) is not useful if the inputs are already deduplicated prior to
running this rule. A new config key has been added to mark inputs as
deduplicated.

(2) is not useful since --exclude-all already excludes any ids that
would have been excluded by the check for presence in sequences.
@victorlin victorlin force-pushed the victorlin/optimize-combine_samples branch from b6c233c to 39a781b Compare July 29, 2025 20:39
@victorlin victorlin merged commit 60d4bca into master Jul 29, 2025
5 of 6 checks passed
@victorlin victorlin deleted the victorlin/optimize-combine_samples branch July 29, 2025 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants