Performance profiling and scalability

## Current architecture

Currently the architecture of bilby-encoder as i understand it, is that you can run it like this

```
python src/encoder.py -b regions.bed -f genome.fa -a file1.bam file2.bam -o output_dir
```

This results in a list of files in output directory of .npy files (on disk numpy matrices) corresponding to the computed signals ("transition matrices")

I tested this out on some ENCODE files, and used a sample of regions from gencode genes

```
#!/bin/bash


wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/gencode.v47.chr_patch_hapl_scaff.annotation.gff3.gz

zcat gencode.v47.chr_patch_hapl_scaff.annotation.gff3.gz| awk '$3 ~ /gene/' | cut -f1,4,5> gene_regions.bed
```


gene_regions is all the genomic regions covered by feature type 'gene' from the gencode gene predictions. there are 86,000 of these. 


## Profiling

I took the first 1000 "gene regions" (head -n 1000 gene_regions.bed > first1000.bed) and ran the tool on 1, 2, and more bam files

I created a flamegraph to show what time is spent on. most time is spent in the transition matrix code, which could potentially be a hot spot that could be optimized

here is the flamegraph (see https://www.brendangregg.com/flamegraphs.html for flamegraph background)

![profile](https://github.com/user-attachments/assets/ba5f3761-ad9d-4bb9-8130-4b1bd68290dc)


generated by py-spy https://github.com/benfred/py-spy

```bash
    py-spy record -o profile.svg -- python src/encoder.py -b subset.bed -f hg19.fa -a ENCFF712DCV.bam  -o outdir
```

## Time taken for different numbers of BAM files processed

![out](https://github.com/user-attachments/assets/b661def5-e15e-42c5-9827-e35f2d3d93cb)


In order to simulate running with multiple bam files, i just ran the same BAM file repeatedly through the tool

e.g. for the 16 runs i ran this

```
time python src/encoder.py -b subset.bed -f hg19.mod.fa -a ENCFF712DCV.bam ENCFF759JTJ.bam  ENCFF759JTJ.bam ENCFF759JTJ.bam  ENCFF759JTJ.bam  ENCFF759JTJ.bam  ENCFF759JTJ.bam  ENCFF759JTJ.bam  ENCFF759JTJ.bam  ENCFF759JTJ.bam ENCFF759JTJ.bam  ENCFF759JTJ.bam  ENCFF759JTJ.bam  ENCFF759JTJ.bam  ENCFF759JTJ.bam -o outdir6
```


for 16 runs it took a bit over an hour to run


# Memory usage profiling

Wanted to check into memory usage just to see if there was any effect, but it might not be too bad. For a single BAM file (with the 1000 region subset above), the memory usage looks like this for one file.

The big spike might be the initial sequence human reference genome FASTA file load with SeqIO. potentially using pyfaidx could reduce that since it is indexed fasta access, and might be good to use pyfaidx if it is parallelized to make the startup time faster (and not load the entire reference genome in each invokation/process)

![out](https://github.com/user-attachments/assets/332ddb08-b628-40d7-b342-5210caabac43)

i didn't memory profile the 16 bam files, but it did not out of memory. that said, it could be worth trying to profile the 16 bam file case to see if it is getting near limits

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance profiling and scalability #29

Current architecture

Profiling

Time taken for different numbers of BAM files processed

Memory usage profiling

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance profiling and scalability #29

Description

Current architecture

Profiling

Time taken for different numbers of BAM files processed

Memory usage profiling

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions