Skip to content

Performance profiling and scalability #29

@cmdcolin

Description

@cmdcolin

Current architecture

Currently the architecture of bilby-encoder as i understand it, is that you can run it like this

python src/encoder.py -b regions.bed -f genome.fa -a file1.bam file2.bam -o output_dir

This results in a list of files in output directory of .npy files (on disk numpy matrices) corresponding to the computed signals ("transition matrices")

I tested this out on some ENCODE files, and used a sample of regions from gencode genes

#!/bin/bash


wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/gencode.v47.chr_patch_hapl_scaff.annotation.gff3.gz

zcat gencode.v47.chr_patch_hapl_scaff.annotation.gff3.gz| awk '$3 ~ /gene/' | cut -f1,4,5> gene_regions.bed

gene_regions is all the genomic regions covered by feature type 'gene' from the gencode gene predictions. there are 86,000 of these.

Profiling

I took the first 1000 "gene regions" (head -n 1000 gene_regions.bed > first1000.bed) and ran the tool on 1, 2, and more bam files

I created a flamegraph to show what time is spent on. most time is spent in the transition matrix code, which could potentially be a hot spot that could be optimized

here is the flamegraph (see https://www.brendangregg.com/flamegraphs.html for flamegraph background)

profile

generated by py-spy https://github.com/benfred/py-spy

    py-spy record -o profile.svg -- python src/encoder.py -b subset.bed -f hg19.fa -a ENCFF712DCV.bam  -o outdir

Time taken for different numbers of BAM files processed

out

In order to simulate running with multiple bam files, i just ran the same BAM file repeatedly through the tool

e.g. for the 16 runs i ran this

time python src/encoder.py -b subset.bed -f hg19.mod.fa -a ENCFF712DCV.bam ENCFF759JTJ.bam  ENCFF759JTJ.bam ENCFF759JTJ.bam  ENCFF759JTJ.bam  ENCFF759JTJ.bam  ENCFF759JTJ.bam  ENCFF759JTJ.bam  ENCFF759JTJ.bam  ENCFF759JTJ.bam ENCFF759JTJ.bam  ENCFF759JTJ.bam  ENCFF759JTJ.bam  ENCFF759JTJ.bam  ENCFF759JTJ.bam -o outdir6

for 16 runs it took a bit over an hour to run

Memory usage profiling

Wanted to check into memory usage just to see if there was any effect, but it might not be too bad. For a single BAM file (with the 1000 region subset above), the memory usage looks like this for one file.

The big spike might be the initial sequence human reference genome FASTA file load with SeqIO. potentially using pyfaidx could reduce that since it is indexed fasta access, and might be good to use pyfaidx if it is parallelized to make the startup time faster (and not load the entire reference genome in each invokation/process)

out

i didn't memory profile the 16 bam files, but it did not out of memory. that said, it could be worth trying to profile the 16 bam file case to see if it is getting near limits

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions