hts_SeqScreener enhancements for bigger references

hts_SeqScreener is meant to filter/identify reads originating from specific source sequences (PhiX as default, but also ribosomal sequences or adapters etc).

**Is your enhancement request related to a problem? Please describe.**
Currently hts_SeqScreener is not optimized for large references. It hasn't been tested much or at all on human sized genomes (~3gbp), but is not expected to work well, and would be very slow.

**Describe the solution you'd like**
A number of alternative algorithms/data structures have been designed to speed up similar processes.
Mapping is essentially the same:
Minimap2: https://github.com/lh3/minimap2#algo
minimizer schemes: https://www.biorxiv.org/content/10.1101/652925v1.full.pdf
https://homolog.us/blogs/bioinfo/2017/10/25/intro-minimizer/
https://pdfs.semanticscholar.org/18a3/3e90b5e6872d33e32c4b9bd6f2fe577be8d6.pdf

But there is also Kraken2:
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1891-0

Implementing something similar to what is used in one of these tools could make screening against a human size genome possible 

**Additional context**



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hts_SeqScreener enhancements for bigger references #227

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

hts_SeqScreener enhancements for bigger references #227

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions