-
Notifications
You must be signed in to change notification settings - Fork 409
Optimize rule combine_samples
#1178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3d1baef to
4360a2d
Compare
4360a2d to
12e81b9
Compare
joverlee521
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tracking down the slowness and documenting all of the test runs!
The message is no longer accurate as of the switch to augur filter in "Aggregate subsampled sequences with augur filter" (1b7ceb6). Instead of updating it, I'll just remove it since having this rule attribute goes against the team's Snakemake style guide.¹ ¹ <https://docs.nextstrain.org/en/latest/reference/snakemake-style-guide.html#avoid-the-message-rule-attribute>
This improves searchability.
de519d9 to
b6c233c
Compare
|
I've started a new ncov/gisaid trial run. If all goes well, I'll plan to merge tomorrow. |
|
The trial run showed faster run times for |
|
Average Side note: run time statistics are no longer available directly through a stats.json file, so I used this command in the benchmarks directory to get the numbers: awk 'FNR == 2 {
# For each file, on its 2nd line, print the first field
print $1
}' combine_samples_* |
awk '{
# Accumulate sum and track min/max
sum += $1
if (NR == 1 || $1 < min) min = $1
if (NR == 1 || $1 > max) max = $1
}
END {
# After reading all values, print statistics
print "Min: " min " seconds"
print "Max: " max " seconds"
print "Mean: " sum/NR " seconds"
}' |
This avoids an extra pass through the sequence file that was introduced in Augur 31.2.1. The sequence checks do 2 things: (1) check for duplicates and (2) check for presence of ids from metadata so that any ids missing from sequences are dropped. Here's why it's safe to bypass these checks: (1) is not useful if the inputs are already deduplicated prior to running this rule. A new config key has been added to mark inputs as deduplicated. (2) is not useful since --exclude-all already excludes any ids that would have been excluded by the check for presence in sequences.
b6c233c to
39a781b
Compare
Description of proposed changes
This rule became noticeably slower with Augur 31.2.1. I tracked down a few reasons. This PR along with Augur #1834 speeds up this rule significantly, shaving off ~1hr of run time from ncov/gisaid (details in testing section).
Testing
Full ncov/gisaid trial run results:
rule subsamplebecause it has notable differences too):subsamplerun time: 27mcombine_samplesrun time: 33msubsamplerun time: 17m (improved by date function caching)combine_samplesrun time: 37msubsamplerun time: 11mcombine_samplesrun time: 2h18m (increased due to concurrency bug)subsamplerun time: 10mcombine_samplesrun time: 15m (improved by fixes in this PR)Checklist
Release checklist
If this pull request introduces backward incompatible changes, complete the following steps for a new release of the workflow:
docs/src/reference/change_log.mdin this pull request to document these changes and the new version number.