Skip to content

hts_SeqScreener enhancements for bigger references #227

@samhunter

Description

@samhunter

hts_SeqScreener is meant to filter/identify reads originating from specific source sequences (PhiX as default, but also ribosomal sequences or adapters etc).

Is your enhancement request related to a problem? Please describe.
Currently hts_SeqScreener is not optimized for large references. It hasn't been tested much or at all on human sized genomes (~3gbp), but is not expected to work well, and would be very slow.

Describe the solution you'd like
A number of alternative algorithms/data structures have been designed to speed up similar processes.
Mapping is essentially the same:
Minimap2: https://github.com/lh3/minimap2#algo
minimizer schemes: https://www.biorxiv.org/content/10.1101/652925v1.full.pdf
https://homolog.us/blogs/bioinfo/2017/10/25/intro-minimizer/
https://pdfs.semanticscholar.org/18a3/3e90b5e6872d33e32c4b9bd6f2fe577be8d6.pdf

But there is also Kraken2:
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1891-0

Implementing something similar to what is used in one of these tools could make screening against a human size genome possible

Additional context

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions