Skip to content

Refactor Alignment and Data Processing #34

@brucejwittmann

Description

@brucejwittmann

The way we perform alignments could be much more efficient. We toss all reads with an insertion or deletion, so we are assuming a priori that the returned read aligns to the reference. As a result, there should be no need to perform a global alignment with Biopython -- we can just compare the reads to the reference, aligning the tail-ends of the reads to the appropriate ends of the reference. Reads with a given number of mismatches can then be discarded.

Doing this would allow us to (1) avoid the O(n^2) memory requirement for aligning to a reference of length n, (2) ordinally encode characters from the beginning, thus saving on memory, and (3) take advantage of vectorization with numpy to perform alignment QC and counting.

We may also want to play around with when exactly new processes are spawned for data analysis. Ideally, we want to send as little data as possible to the spawned processes, then return only what we need to comprehensively analyze all wells. Reorganizing code to maximize this transfer/memory efficiency should also reduce memory bloat.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions