Add SQLite3 backend for performance Improvement

Hi team,

We (me along with @saketkc) are currently experimenting with using a SQLite3 backend instead of Pandas since it tends to be faster for very large datasets (~GBs of data). It is still at a testing stage but we have implemented the following features:

1. Convert from GTF format to SQLite3 database
2. Convert from GFF3 format to SQLite3 database
3. Convert from SQLite3 database to Pyranges object
4. Perform aggregate queries on genomic interval data: Count exons for each gene, calculate total exon length for each gene, determine the gene with the most transcripts
5. Perform interval queries on genomic interval data: Merge exons, Find overlaps between two datasets, subtract one dataset from another

Here are the results from our tests on a personal laptop and a computing cluster, on the datasets Homo_sapiens.GRCh38.112.chr.gtf, gencode.vM36.annotation.gtf, and Arabidopsis_thaliana.TAIR10.60.gff3:

### PC results
![Image](https://github.com/user-attachments/assets/903d877e-70d9-45b5-a8b8-0cf239f75e7e)
![Image](https://github.com/user-attachments/assets/e56bac80-6831-471f-b81f-6d53dc63aa4c)
![Image](https://github.com/user-attachments/assets/7c341113-2d3c-4490-b0ff-8387a3e5fe89)

### Cluster results
![Image](https://github.com/user-attachments/assets/47f5a1b9-8daf-4f5a-aff3-59806ba590c6)
![Image](https://github.com/user-attachments/assets/683d1354-9cad-4e16-b677-e3317a519e96)
![Image](https://github.com/user-attachments/assets/9f83bb25-643d-4ef3-b742-34782f5f27fd)

Overall, we are seeing lower computational times across all aggregate and interval queries, at the cost of slightly more processing time for conversion from GTF/GFF3 to SQLite3 database (though that is also not a major issue since SQLite3 database is persistent on disk, and hence, file processing needs to be done only once and not every time someone wants to re-run their analysis).

Would pyranges be interested in incorporating this? If so, I am happy to make a PR.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SQLite3 backend for performance Improvement #397

PC results

Cluster results

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add SQLite3 backend for performance Improvement #397

Description

PC results

Cluster results

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions