Skip to content

Speed up augur filter #1575

@victorlin

Description

@victorlin

Metadata

Filtering and subsampling logic in augur filter is done using pandas DataFrames to represent input metadata. Due to the need to load DataFrames into memory, this has been optimized to work on large datasets (e.g. SARS-CoV-2) by loading DataFrames in chunks since the entire dataset is too large to load on a typical machine's RAM. Compared to a non-chunked approach, this is less efficient (requires multiple passes through the metadata) and less intuitive (requires workarounds and extra variables to coordinate operations across chunks).

For metadata filtering/subsampling, there are two ways to tackle this. See individual issues for details:

Metadata output is currently Python-bound. An alternative:

Sequences

In a single augur filter call, the input --sequences is read up to 3 times:

  1. Validation, i.e. checking for duplicates and retrieving ids to sync with metadata
  2. Building the sequence index
  3. Writing the filtered subset of sequences to --output-sequences

Some work has been done to improve sequence I/O:

The sequence index is an Augur-specific TSV file format containing sequence ids, lengths, and nucleotide counts. It is slow to generate, done either automatically at runtime or manually with augur index and passed into augur filter using --sequence-index. The latter improves concurrent run times, but in general the process is slow. Here is a list of work related to the sequence index:

Related issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestpriority: highTo be resolved before other issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions