Skip to content

Order of input files to hts_SeqScreener changes hits reported when R1/R2 lengths differ #253

@AstrobioMike

Description

@AstrobioMike

Describe the bug
Heya, thanks for maintaining HTStream :)

For the attached example files, the number of hits identified by hts_SeqScreener vary depending on whether the forward reads or reverse reads were provided first.

Overall summary:

how-run num_hits
R1 SE: -U test-R1.fq.gz 7
R2 SE: -U test-R2.fq.gz 2021
R1,R2 PE: -1 test-R1.fq.gz -2 test-R2.fq.gz 3348
R2,R1 PE: -1 test-R2.fq.gz -2 test-R1.fq.gz 2021
  • for these data, it is expected that R1 would yield near zero, as they are not biological
  • R2 are the biological sequences, and it seems providing things as -1 R2 -2 R1 to hts_SeqScreener returns the expected results based on what it looks like when providing just R2

For the data I was working with when running into this, attached, R1 length is 29 bases and R2 is 121. So the disparity may have to deal with the length of read 1 and if that's used for calculating the percentage-hit cutoff without consideration for read 2 length/total fragment length?

To Reproduce
Example sequence reads that will help reproduce the bug:

ref.fa.gz
test-R1.fq.gz
test-R2.fq.gz

Commands to reproduce the behavior:

mamba create -y -n hts -c conda-forge -c bioconda -c defaults htstream=1.3.3

conda activate hts

### with files attached to issue

## running single
# R1
hts_SeqScreener -U test-R1.fq.gz -s ref.fa.gz -L hts-R1-SE.log > /dev/null
grep -A 5 "Single_end" hts-R1-SE.log | tail -n 1 | sed 's/^/# /'
#         "hits": 7

# R2
hts_SeqScreener -U test-R2.fq.gz -s ref.fa.gz -L hts-R2-SE.log > /dev/null
grep -A 5 "Single_end" hts-R2-SE.log | tail -n 1 | sed 's/^/# /'
#         "hits": 2021

## running together
# R1 first
hts_SeqScreener -1 test-R1.fq.gz -2 test-R2.fq.gz -s ref.fa.gz -C -L hts-R1-first.log > /dev/null
grep -A 3 "Paired_end" hts-R1-first.log | tail -n 1
    # 3348 hits

# R2 first
hts_SeqScreener -1 test-R2.fq.gz -2 test-R1.fq.gz -s ref.fa.gz -C -L hts-R2-first.log > /dev/null
grep -A 3 "Paired_end" hts-R2-first.log | tail -n 1
    # 2021 hits

Expected behavior
I would have expected the results to be the same whether R1 or R2 were provided first when given as paired with the -C option set. But the order provided makes a large difference.

Desktop (please complete the following information):

  • observed on MacOS and Ubuntu

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions