Skip to content

Add SQLite3 backend for performance Improvement #397

@DhanoHacks

Description

@DhanoHacks

Hi team,

We (me along with @saketkc) are currently experimenting with using a SQLite3 backend instead of Pandas since it tends to be faster for very large datasets (~GBs of data). It is still at a testing stage but we have implemented the following features:

  1. Convert from GTF format to SQLite3 database
  2. Convert from GFF3 format to SQLite3 database
  3. Convert from SQLite3 database to Pyranges object
  4. Perform aggregate queries on genomic interval data: Count exons for each gene, calculate total exon length for each gene, determine the gene with the most transcripts
  5. Perform interval queries on genomic interval data: Merge exons, Find overlaps between two datasets, subtract one dataset from another

Here are the results from our tests on a personal laptop and a computing cluster, on the datasets Homo_sapiens.GRCh38.112.chr.gtf, gencode.vM36.annotation.gtf, and Arabidopsis_thaliana.TAIR10.60.gff3:

PC results

Image
Image
Image

Cluster results

Image
Image
Image

Overall, we are seeing lower computational times across all aggregate and interval queries, at the cost of slightly more processing time for conversion from GTF/GFF3 to SQLite3 database (though that is also not a major issue since SQLite3 database is persistent on disk, and hence, file processing needs to be done only once and not every time someone wants to re-run their analysis).

Would pyranges be interested in incorporating this? If so, I am happy to make a PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions