-
Notifications
You must be signed in to change notification settings - Fork 135
Description
Metadata
Filtering and subsampling logic in augur filter is done using pandas DataFrames to represent input metadata. Due to the need to load DataFrames into memory, this has been optimized to work on large datasets (e.g. SARS-CoV-2) by loading DataFrames in chunks since the entire dataset is too large to load on a typical machine's RAM. Compared to a non-chunked approach, this is less efficient (requires multiple passes through the metadata) and less intuitive (requires workarounds and extra variables to coordinate operations across chunks).
For metadata filtering/subsampling, there are two ways to tackle this. See individual issues for details:
- Speed up filtering/subsampling without replacing Pandas #1573
- Speed up filtering/subsampling by rewriting Pandas logic #1574
Metadata output is currently Python-bound. An alternative:
Sequences
In a single augur filter call, the input --sequences is read up to 3 times:
- Validation, i.e. checking for duplicates and retrieving ids to sync with metadata
- Building the sequence index
- Writing the filtered subset of sequences to
--output-sequences
Some work has been done to improve sequence I/O:
The sequence index is an Augur-specific TSV file format containing sequence ids, lengths, and nucleotide counts. It is slow to generate, done either automatically at runtime or manually with augur index and passed into augur filter using --sequence-index. The latter improves concurrent run times, but in general the process is slow. Here is a list of work related to the sequence index:
- filter: Skip building sequence index if there are no sequence filters #1827
- Replace sequence indexing #1846