Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
7e9cae0
parallel blast beginnings. Will be a drop in replacment for par_block…
necrolyte2 Aug 14, 2015
94ccd51
merging test_built.py
necrolyte2 Aug 14, 2015
1341483
Added GNU Parallel install to pavement
necrolyte2 Aug 17, 2015
66bcbf8
ready to test blasting in parallel. Still need to figure out how to h…
necrolyte2 Aug 17, 2015
2490afa
Hopefully fixed implementation of parallel_blast. Will run tomorrow t…
necrolyte2 Aug 17, 2015
b19d34c
distinquishes between single node and multi node jobs via PE_HOSTFILE…
necrolyte2 Aug 18, 2015
815fc66
This seems to be the correct variation of options and arguments. Took…
necrolyte2 Aug 19, 2015
61aa85a
Blast is working. Now to parallelize diamond
necrolyte2 Aug 19, 2015
f5cfc1f
Diamond seems to be implemented now although it keeps gobbling up all…
necrolyte2 Aug 19, 2015
c0bb5dd
Now detects if blastn or blastp/blastx are being run and excludes the…
necrolyte2 Aug 20, 2015
22cef86
Addressing #281. Pipeline now downloads diamond from github. Shows wh…
necrolyte2 Aug 21, 2015
3487ea8
Preparing for bio_pieces parallel_blast
necrolyte2 Aug 21, 2015
c510483
iterative blast tests now run step1 if needed. Fixed bug with iter bl…
necrolyte2 Aug 24, 2015
8baab70
Touched up diamond db buliding docs. Closes #281
necrolyte2 Aug 24, 2015
205b9a4
Fixes a bug where orf_filter output fasta file was not being used to …
necrolyte2 Sep 2, 2015
5ef8d78
parallel_blast injection for pipeline instead of par_block_blast
necrolyte2 Sep 2, 2015
8a971a2
deltat now will print command that it is for
necrolyte2 Sep 3, 2015
fb28016
Fixes a bug where blastx was ignored. Fixes #283
necrolyte2 Sep 3, 2015
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 16 additions & 6 deletions docs/source/databases.rst
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ Create databases directory structure

.. code-block:: bash

mkdir -p ~/databases/{humandna,humanrna,ncbi}
mkdir -p ~/databases/{humandna,humanrna,ncbi,diamond}
mkdir -p ~/databases/ncbi/blast/{nt,nr}
mkdir -p ~/databases/ncbi/taxonomy

Expand Down Expand Up @@ -91,26 +91,34 @@ Diamond
Diamond corresponds to :doc:`stages/iterative_blast_phylo`'s
``blast_db_list``

Download and index protein database for diamond blastx
Download and index protein database for diamond blastx.

You should make sure to view the `makedb`_ command, specifically the part about
how the ``-b`` option controls how much memory diamond will take when it runs
using the database you build. In the example below we use ``-b 2`` which should
consume about 12GB of RAM so if you don't have that much RAM on your computer where
you will run diamond, or if you have way more RAM you may want to change that
number as the higher it is, the faster diamond will run but consume more memory.

You can use the ``free -hg`` command which will show you how many GB you have(listed
under the Total column)

.. code-block:: bash

mkdir -p ~/databases/diamond
pushd ~/databases/diamond
wget ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz
gunzip nr.gz
diamond makedb -p 12 -d diamondnr -v --log --in nr -b 0.5
diamond makedb -d diamondnr -v --log --in nr -b 2
popd

Alternatively you can generate the diamond database from an already downloaded
blast nr database

.. code-block:: bash

mkdir -p ~/databases/diamond
pushd ~/databases/diamond
blastdbcmd -db ~/databases/ncbi/blast/nr/nr -entry all > blastnr.fasta
diamond makedb -d diamondnr --log --in blastnr.fasta -b 0.5
diamond makedb -d diamondnr --log --in blastnr.fasta -b 2
rm blastnr.fasta

Host Genome Setup
Expand Down Expand Up @@ -249,3 +257,5 @@ Note: This command is only available after you install. Unfortunately at this po
You will probably want to ensure that the pipeline can find all of your databases. There is now a handy script that you can use to do this prior to installing.

:doc:`scripts/verifydatabases`

.. _makedb: https://github.com/bbuchfink/diamond#makedb-options
8 changes: 5 additions & 3 deletions pathdiscov/Local_Module/Verbose_Sys.pm
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
use strict;
use File::Basename;

# run a system commmand "verbosely", echo-ing the command as well as giving the time
sub verbose_system
Expand All @@ -7,12 +8,13 @@ sub verbose_system
# @_ exists magically - anything passed in

my $cmd = shift;
my @parts = split(' ', $cmd);
my $exe = basename($parts[0]);

print "[cmd] ",$cmd,"\n";
my $start_time = time();
system($cmd);
print_system($cmd);
my $end_time = time();
print "[deltat] ",$end_time-$start_time,"\n";
print "[$exe deltat] ",$end_time-$start_time,"\n";
}

# run a system commmand but print the command first
Expand Down
Binary file removed pathdiscov/download/diamond64/diamond
Binary file not shown.
2 changes: 1 addition & 1 deletion pathdiscov/files/sample.param.base
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ command iterative_blast_phylo
blast_db_list BLAST_NT,BLAST_NT #,DIAMOND_NR blast db prefix
blast_task_list megablast,dc-megablast # options are: megablast dc-megablast blastn, diamond
# blast_options_list -evalue 1e-4 -word_size 28,-evalue 1e-4 -word_size 12,-evalue 1e-4 -word_size 7 # blast options (except for: -task -query -db -out -outfmt -num_descriptions; these are
blast_options_list -evalue 1e-4 -word_size 28,-evalue 1e-4 -word_size 12 ##,--compress 0 -p 0 -v -k 10 -w 28 --id 0.7 -c 4 -t TEMPDIR blast options (except for: -task -query -db -out -outfmt -num_descriptions; these are hardwired)
blast_options_list -evalue 1e-4 -word_size 28,-evalue 1e-4 -word_size 12 ##,-v -k 10 -w 28 --id 0.7 -c 4 -t TEMPDIR blast options (except for: -task -query -db -out -outfmt -num_descriptions; these are hardwired)
ninst_list NUMINST,NUMINST # the input file will be broken into chunks and blasted in parallel - this parameter is the number of instances of BLAST you want to run in parallel
taxonomy_names TAX_NAMES # NCBI taxonomy names dump file
taxonomy_nodes TAX_NODES # NCBI taxonomy nodes dump file
Expand Down
12 changes: 7 additions & 5 deletions pathdiscov/iterative_blast_phylo/blast_wrapper.pl
Original file line number Diff line number Diff line change
Expand Up @@ -48,23 +48,25 @@
if ($type eq "diamond" || !(defined($task)))
{
$task_option="blastx";

}
elsif($type eq "blastx")
{
$task_option = "";
}
else
{
$task_option="-task $task";
$task_option="-task $task";
}


if ($type eq "blastn")
if ($type ne "diamond")
{
print "[start]\n";
my $cmd = "$type -query $query -db $db $task_option -out $out -outfmt $outfmt -max_target_seqs 10 $options";
verbose_system($cmd);
print "[end]\n";
}

if ($type eq "diamond")
else
{
print "[start]\n";
my $cmd = "$type $task_option $options -q $query -d $db -o $out";
Expand Down
32 changes: 27 additions & 5 deletions pathdiscov/iterative_blast_phylo/iterative_blast_phylo.pl
Original file line number Diff line number Diff line change
Expand Up @@ -209,28 +209,50 @@
print_system($cmd);

my $inputfasta = "$outputdir/$j.$mate.fasta";

my $blastexe = "blastn";
my $blasttask = "--task $blast_task_list[$i]";
my $outfmt="6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore";
# Start building the --blast_options string
my $blastoptions = $blast_options_list[$i];
# If blast type then we will force the format
if( $blast_task_list[$i] ne "diamond" ) {
# Must specify same format as diamond to keep things consistent
$blastoptions = $blastoptions . " -outfmt \\\"$outfmt\\\"";
}
# Ensure string is wrapped in quotes
$blastoptions = "$blastoptions";

# Filter using get_orf
if( $blast_task_list[$i] eq "diamond" || $blast_task_list[$i] eq "blastx" )
{
$blastexe = $blast_task_list[$i];
$blasttask = "";
if( $blast_task_list[$i] eq "diamond" ) {
$blasttask = "--task blastx";
}
print "[echo] orf filtering $mate prior to $blast_task_list[$i]\n";
my $odir = "tmp_${mate}_${j}/orf_filter";
my $logs = "$odir/logs";
my $matename = $mate;
print_system("mkdir -p $logs");
my $cmd = "orf_filter.pl --outputdir $odir --logs $logs --paramfile $abs_pfile --R1 $outputdir/$j.$mate.fasta --sample $sample --timestamp $timestamp";
if($contig) {
$matename = 'R1';
}
my $cmd = "orf_filter.pl --outputdir $odir --logs $logs --paramfile $abs_pfile --$matename $outputdir/$j.$mate.fasta --sample $sample --timestamp $timestamp";
if($contig) {
$cmd .= " --contig 1";
}
verbose_system($cmd);
# New input for diamond/blastx will be orf filtered fasta
$inputfasta = "$outputdir/$odir/orf_filter.$mate";
print "[debug] orf_filtered input $inputfasta\n";
print "[debug] orf_filtered input for $blast_task_list[$i] will now be $inputfasta\n";
my $cmd = "linecount $inputfasta orf_filter $mate.count 2 1";
print_system($cmd);
}

my $blast_db_nr;
my $cmd = "$path_scripts/par_block_blast.pl --outputdir tmp_".$mate."_$j --inputfasta $inputfasta --db $blast_db_list[$i] --blast_type $blast_task_list[$i] --task $blast_task_list[$i] --ninst $ninst_list[$i] --outfile $outputdir/$j.$mate.blast --outheader $outputdir/blast.header --blast_options \"$blast_options_list[$i]\"";
my $cmd = "parallel_blast $inputfasta $outputdir/$j.$mate.blast --ninst $ninst_list[$i] --db $blast_db_list[$i] --blast_exe $blastexe $blasttask --blast_options \"$blastoptions\"";

verbose_system($cmd);

print "[echo] get phylogeny counts\n";
Expand Down Expand Up @@ -287,7 +309,7 @@

# get reads that didnt blast
# args: blast output, fasta input
my $cmd = "$path_scripts/get_unblast_reads.pl $j.$mate.blast $j.$mate.fasta > $j.$mate.noblast.fasta";
my $cmd = "$path_scripts/get_unblast_reads.pl $j.$mate.blast $inputfasta > $j.$mate.noblast.fasta";
verbose_system($cmd);

# args: input file, filtering_program_name, output file, 2->fasta, concat
Expand Down
14 changes: 8 additions & 6 deletions pathdiscov/orf_filter/orf_filter.pl
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,8 @@
'contig=i' => \$contig, # contig bool
'timestamp=s' => \$timestamp); # time stamp

die "[error] required input parameters not found" if (!( defined($outputdir) && defined($logs) && defined($pfile) && defined($r1) ));
die "[error] required input parameters not found" if (!( defined($outputdir) && defined($logs) && defined($pfile) ));
die "[error] required input parameters not found" if ( !defined($r1) && !defined($r2) );

@mates=("contig") if ($contig); # Reset mates to the name contig

Expand Down Expand Up @@ -119,17 +120,18 @@
#my %h_fasta = map {/>(\w+)_(\w+)\s(.*)/; $1 => 1} split(/\n/, `cat $mate.orfout.fa`);
# Should be >(anything)_\d\s(.*
my %h_fasta = map {/>(.*?)_(\d+)\s(.*)/; $1 => 1} split(/\n/, `cat $mate.orfout.fa`);
#print Dumper \ %h_fasta;
# print Dumper \ %h_fasta;

print "[echo] filter $hoh{$command}{$mate} by orfs into $command.$mate\n";
get_subset_by_fastaid($hoh{$command}{$mate}, "$command.$mate", \%h_fasta);
verbose_system("linecount $command.$mate orf_filter $mate.count fastq 1");
}
else {
print "Doing nothing because there are no getorf_options set\n";
print "[info] Value of \$hoh{".$command."}{\"getorf_options\"}:'". $hoh{$command}{"getorf_options"} ."'\n";
print "[error] there are no getorf_options set\n";
}
} # defined
elsif ($mate eq "R1")
elsif (($mate eq "R1" || $mate eq "contig") && !defined($r2))
{
if(! -s $hoh{$command}{$mate} ) {
print("$hoh{$command}{$mate} is empty\n");
Expand Down Expand Up @@ -211,9 +213,9 @@ sub get_subset_by_fastaid
$_ = <$fh>;
if($_ =~ /^@/) {
$format = "fastq";
print("Detected $infile as fastq\n");
print("[info] Detected $infile as fastq\n");
} else {
print("Detected $infile as fasta\n");
print("[info] Detected $infile as fasta\n");
}
close($fh);

Expand Down
37 changes: 35 additions & 2 deletions pavement.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,10 @@


options(setup=setup_dict,
diamond=Bunch(
url='https://github.com/bbuchfink/diamond/releases/download/v0.7.9/diamond-linux64.tar.gz',
sdir=path('pathdiscov/download'),
),
bwa=Bunch(
sdir=path('pathdiscov/download'),
bindir=path('pathdiscov/bin')
Expand Down Expand Up @@ -59,7 +63,10 @@
downloads=path('pathdiscov/download'),
installdir=join(sys.prefix,'lib')
),

parallel=Bunch(
downloads=path('pathdiscov/download'),
url='http://ftp.gnu.org/gnu/parallel/parallel-20150722.tar.bz2'
),
wkhtmltopdf=Bunch(
sfile = path('pathdiscov/download/wkhtmltopdf'),
olink =path('pathdiscov/bin')
Expand Down Expand Up @@ -293,6 +300,19 @@ def download_install_fastqc(options):
else:
info("fastqc symlink already exists")

@task
def download_unpack_diamond(options):
diamondbin=join(sys.prefix,'bin','diamond')
if not exists(diamondbin):
info('Downloading diamond')
cmd = 'cd {0} && wget {1} -O- | tar xzvf -'
sh(
cmd.format(
options.diamond.sdir,
options.diamond.url
)
)

@task
def download_compile_bwa(options):
"""installs the current package"""
Expand All @@ -303,6 +323,19 @@ def download_compile_bwa(options):
sdir = path(currwd) / options.bwa.sdir
sh('(cd %s; wget https://github.com/lh3/bwa/archive/0.7.10.tar.gz -O- | tar xzf -; mv bwa-* bwa; cd bwa; make; cd %s)' % (sdir, sdir))

@task
def download_compile_gnuparallel(options):
''' Installs GNU Parallel '''
prefix = sys.prefix
p_bin = join(sys.prefix,'bin','parallel')
options.url
if not exists(p_bin):
info("Installing GNU Parallel")
sh('cd {0}; wget {1} -O- | tar xjf -; cd parallel-*; ' \
'./configure --prefix={2}; make && make install;'.format(
options.downloads, options.url, prefix)
)

@task
def download_compile_samtools(options):
"""installs the current package"""
Expand Down Expand Up @@ -421,7 +454,7 @@ def install_dependencies():
pass

@task
@needs('download_compile_bwa','download_compile_samtools','refRay','getorf','download_install_fastqc','installSnap','perl_modules')
@needs('download_compile_bwa','download_compile_samtools','refRay','getorf','download_install_fastqc','installSnap','perl_modules','download_compile_gnuparallel','download_unpack_diamond')
def install_other_dependencies():
pass

Expand Down
7 changes: 4 additions & 3 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -282,6 +282,7 @@ def run_tests(self):
'sff2fastq = pathdiscov.sff2fastq:main',
'step1 = pathdiscov.stages.step1:main',
'get_blast_reads = pathdiscov.get_blast_reads:main',
'parallel_blast = pathdiscov.parallel_blast:main',
],
},
# These all get copied to our installation's bin folder for us
Expand All @@ -294,7 +295,7 @@ def run_tests(self):
'pathdiscov/download/ray/Ray',
'pathdiscov/download/Ray2',
'pathdiscov/download/bowtie2/bowtie2',
'pathdiscov/download/diamond64/diamond',
'pathdiscov/download/diamond',
'pathdiscov/download/snap/snap',
'pathdiscov/quality_filter/quality_filter.pl',
'pathdiscov/host_map/host_map.pl',
Expand All @@ -303,8 +304,8 @@ def run_tests(self):
'pathdiscov/ray2_assembly/joinlines.sh',
'pathdiscov/ray2_assembly/ray2_assembly.pl',
'pathdiscov/iterative_blast_phylo/annotate_blast.sh',
'pathdiscov/iterative_blast_phylo/blast_wrapper.pl',
'pathdiscov/iterative_blast_phylo/block_blast.sh',
#'pathdiscov/iterative_blast_phylo/blast_wrapper.pl',
#'pathdiscov/iterative_blast_phylo/block_blast.sh',
'pathdiscov/iterative_blast_phylo/fastq2fasta.awk',
'pathdiscov/iterative_blast_phylo/format_iterative_blast_phylo.pl',
'pathdiscov/iterative_blast_phylo/get_unblast_reads.pl',
Expand Down
14 changes: 13 additions & 1 deletion tests/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@
import sys
import glob

from nose.plugins.attrib import attr

import mock

import unittest2 as unittest
Expand Down Expand Up @@ -303,17 +305,27 @@ def _write_param(self, txt, parampath=None):
fh.write(txt)
return abspath(parampath)

def assertNotEmpty(self, path):
self.assertTrue(exists(path), '{0} does not exist'.format(path))
self.assertNotEqual(0, os.stat(path).st_size, "{0} is empty".format(path))

def _verify_countfile(self, expect, countpath):
'''
expect is list((name,count),(name2,count)...)
where each tuple should match a line in countpath file
'''
fh = open(countpath)
for line, ex in zip(fh, expect):
lines = [line.strip() for line in fh]
print "Expected: {0}".format(expect)
print "Found: {0}".format(lines)
self.assertEqual(len(expect), len(lines), "More count entries than expected")
for line, ex in zip(lines, expect):
name, count = line.strip().split()
count = float(count)
exc = float(ex[1])
self.assertEqual(ex[0], name)
print line
print ex
self.assertAlmostEqual(exc, count, delta=2)

def aexists(path, msg=None):
Expand Down
4 changes: 4 additions & 0 deletions tests/rikkcdna/test_built.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
from os.path import *

def test_tst_databases_are_built():
assert exists('rikkcdna.nin'), "You did not build databases. See rikkcdna/Readme.rst"
Loading