Skip to content

avoid grepping for each line...#2

Open
bpow wants to merge 1 commit intocbuhay:masterfrom
bpow:faster-vega
Open

avoid grepping for each line...#2
bpow wants to merge 1 commit intocbuhay:masterfrom
bpow:faster-vega

Conversation

@bpow
Copy link

@bpow bpow commented Apr 3, 2015

Calling a system grep process for each line (or two grep processes per line in the case of VEGADB) of the reference databases is inefficient.

On my desktop, setup.sh takes almost an hour. Most of this time is in the check_HGNC_individual_VEGADB.pl script. This patch performs the equivalent in a matter of seconds.

The overall setup.sh still takes >20 minutes because something else becomes the rate-limiting step. Something similar could be done for the other check_HGNC_individual*pl scripts, but you would have to be careful because making an index splitting by [\s,] is not the same as grep -w (for example, hsa-mir-511 matches hsa-mir-511-1 when using grep -w, but would not match with a simple index of "words" as done here.

So this could just be considered an example of how to address an inefficiency in the setup process.

@Rashesh7
Copy link
Collaborator

Rashesh7 commented Apr 9, 2015

Hi,

This is really a great suggestion. Yes the patch will need to be edited for each database accordingly since creating a Hash generates a strict Key for each Gene.

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants