-
Notifications
You must be signed in to change notification settings - Fork 4
Performance profiling and scalability #29
Description
Current architecture
Currently the architecture of bilby-encoder as i understand it, is that you can run it like this
python src/encoder.py -b regions.bed -f genome.fa -a file1.bam file2.bam -o output_dir
This results in a list of files in output directory of .npy files (on disk numpy matrices) corresponding to the computed signals ("transition matrices")
I tested this out on some ENCODE files, and used a sample of regions from gencode genes
#!/bin/bash
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/gencode.v47.chr_patch_hapl_scaff.annotation.gff3.gz
zcat gencode.v47.chr_patch_hapl_scaff.annotation.gff3.gz| awk '$3 ~ /gene/' | cut -f1,4,5> gene_regions.bed
gene_regions is all the genomic regions covered by feature type 'gene' from the gencode gene predictions. there are 86,000 of these.
Profiling
I took the first 1000 "gene regions" (head -n 1000 gene_regions.bed > first1000.bed) and ran the tool on 1, 2, and more bam files
I created a flamegraph to show what time is spent on. most time is spent in the transition matrix code, which could potentially be a hot spot that could be optimized
here is the flamegraph (see https://www.brendangregg.com/flamegraphs.html for flamegraph background)
generated by py-spy https://github.com/benfred/py-spy
py-spy record -o profile.svg -- python src/encoder.py -b subset.bed -f hg19.fa -a ENCFF712DCV.bam -o outdirTime taken for different numbers of BAM files processed
In order to simulate running with multiple bam files, i just ran the same BAM file repeatedly through the tool
e.g. for the 16 runs i ran this
time python src/encoder.py -b subset.bed -f hg19.mod.fa -a ENCFF712DCV.bam ENCFF759JTJ.bam ENCFF759JTJ.bam ENCFF759JTJ.bam ENCFF759JTJ.bam ENCFF759JTJ.bam ENCFF759JTJ.bam ENCFF759JTJ.bam ENCFF759JTJ.bam ENCFF759JTJ.bam ENCFF759JTJ.bam ENCFF759JTJ.bam ENCFF759JTJ.bam ENCFF759JTJ.bam ENCFF759JTJ.bam -o outdir6
for 16 runs it took a bit over an hour to run
Memory usage profiling
Wanted to check into memory usage just to see if there was any effect, but it might not be too bad. For a single BAM file (with the 1000 region subset above), the memory usage looks like this for one file.
The big spike might be the initial sequence human reference genome FASTA file load with SeqIO. potentially using pyfaidx could reduce that since it is indexed fasta access, and might be good to use pyfaidx if it is parallelized to make the startup time faster (and not load the entire reference genome in each invokation/process)
i didn't memory profile the 16 bam files, but it did not out of memory. that said, it could be worth trying to profile the 16 bam file case to see if it is getting near limits


