Speed up augur filter

## Metadata

Filtering and subsampling logic in `augur filter` is done using pandas DataFrames to represent input metadata. Due to the need to load DataFrames into memory, this has been optimized to work on large datasets (e.g. SARS-CoV-2) by loading DataFrames in chunks since the entire dataset is too large to load on a typical machine's RAM. Compared to a non-chunked approach, this is less efficient (requires multiple passes through the metadata) and less intuitive (requires workarounds and extra variables to coordinate operations across chunks).

For metadata filtering/subsampling, there are two ways to tackle this. See individual issues for details:

- #1573
- #1574

Metadata output is currently Python-bound. An alternative:

- #1469

## Sequences

In a single `augur filter` call, the input `--sequences` is read up to 3 times:

1. Validation, i.e. checking for duplicates and retrieving ids to sync with metadata
2. Building the sequence index
3. Writing the filtered subset of sequences to `--output-sequences`

Some work has been done to improve sequence I/O:

- [x] #1821
- [x] #1834

The sequence index is an Augur-specific TSV file format containing sequence ids, lengths, and nucleotide counts. It is [slow to generate](https://github.com/nextstrain/augur/issues/1794#issuecomment-2957127385), done either automatically at runtime or manually with `augur index` and passed into `augur filter` using `--sequence-index`. The latter improves concurrent run times, but in general the process is slow. Here is a list of work related to the sequence index:

- [x] #1827
- [ ] #1846

## Related issues

- https://github.com/nextstrain/augur/issues/789


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up augur filter #1575

Metadata

Sequences

Related issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Speed up augur filter #1575

Description

Metadata

Sequences

Related issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions