Skip to content

Qualimap assigns more reads to exonic regions than total reads in the sample  #1273

@ctuni

Description

@ctuni

Description of the bug

A while ago we realized that QUALIMAP assigns more reads to exonic+intronic+intergenic than total reads in the sample. We subsampled the paired-end SRA FASTQ to exactly 1M fragments by head -n 4000000 file_{1,2}.fastq > subsampled_{1,2}.fastq.
Then, the pipeline runs the samples through STAR, then Picard markduplicates, and then qualimap (broadly speaking).

The issue is that although the samples have exactly 1M reads, and the initial FastQC correctly reports that there are indeed exactly 1M reads, Qualimap consistently assigns more than 1M reads to genomic features, such as ~1.5M reads mapping to exons, and then a bit more to introns and intergenic regions. The STAR logs also report exactly 1M reads to map. The reads assigned to genomic features by qualimap, when added up, do not add up to 2M, so Qualimap is not quantifying reads instead of fragments, which could explain the discrepancy.

The way qualimap runs when the variables are translated to arguments is correct (our samples are paired-end and the strandedness is reverse):

qualimap \
        rnaseq \
        -bam input.markdup.sorted.bam \
        -gtf annotation.gtf \
        -p strand-specific-reverse \
        -pe \
        -outdir out/

We also made sure that the counting algorithm is uniquely-mapped-reads, as is shown in this screenshot from the Qualimap report. It is my understanding that using this counting algorithm means that chimeric and secondary alignments are not counted, which could explain the higher number if I did not use this counting algorithm:
image

We know that this is not a nf-core/rnaseq issue but we also reported the same issue a while ago on Qualimap's BitBucket repo. The link to the issue is here but it can't be accessed because the issue has not been approved yet. It was submitted on the 22nd of January 2024, so 8 weeks have passed and there are no responses or updates.

Maybe there is an explanation that we are missing but we have tried everything and still can't explain how Qualimap assigns more reads to genomic features than total reads in the sample. Personally, we have decided to drop the Qualimap results and compute those statistics with an in-house script, and wanted to open the discussion here.

Command used and terminal output

No response

Relevant files

No response

System information

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions