-
Notifications
You must be signed in to change notification settings - Fork 4
Release Notes 2.0.1
There are lots of exciting enhancements in RUM 2.0. This document highlights the most visible changes between RUM 1.x and RUM 2.0.
RUM now follows the convention for Perl modules. We use Makefile.PL to drive the installation. Please see the README.md file in your distribution or the [Installation wiki page](Installing RUM) for installation instructions.
We have revised the output SAM formatted results so that they validate against the Picard SAM parser.
The command-line interface for RUM has changed in the following ways:
-
RUM_runner.pl has been renamed to rum_runner.
-
Command-line options now follow standard Unix conventions. Long options start with two dashes, e.g.
--indexor--strand-specific. Short options only have one dash, e.g.-o. Please runrum_runner helpfor help information. -
rum_runnernow has multiple actions that it can perform: running a pipeline, checking the status of a job, killing a job, and performing other common actions. See Actions section, below for a full list and description of each.
rum_runner now has several different actions you can run. An action is
specified by the first command-line argument:
-
rum_runner align ...: Run the pipeline on a specified output directory, starting at the first step that hasn't already been run. -
rum_runner clean -o *dir*: Remove intermediate output files from a specified output directory. Useful if you ranrum_runner align --no-clean .... -
rum_runner help [action]: Get help. Userum_runner help *action*to get help for a particular action. -
rum_runner kill -o *dir*: Stop a job running in a specified directory. -
rum_runner status -o *dir*: Check on the status of a job in a specified directory. -
rum_runner version: Print out the version of RUM.
You can now get a simple report showing the progress of the pipeline
by running rum_runner status -o *output-dir*.
It will show you whether you're in the "Preprocessing", "Processing", or "Postprocessing" phase. In the processing phase, it will show you each step, along with an X for each chunk that has completed that step. It should look something like this:
Processing in 10 chunks
-----------------------
XXXXXXXXXX Run bowtie on genome
XXXXXXXXXX Run bowtie on transcriptome
XXXXXXXXXX Separate unique and non-unique mappers from genome bowtie output
XXXXXXXXXX Separate unique and non-unique mappers from transcriptome bowtie
output
XXXXXXXXXX Merge unique mappers together
XXXXXXXXXX Merge non-unique mappers together
XXXXXXXXXX Make a file containing the unmapped reads, to be passed into
blat
XXX X X X Run blat on unmapped reads
XXX X X X Run mdust on unmapped reads
X Parse blat output
X Merge bowtie and blat results
Clean up RUM files
Produce RUM_Unique
Sort RUM_Unique by location
Sort cleaned non-unique mappers by ID
Remove duplicates from NU
Create SAM file
Create non-unique stats
Sort RUM_NU
Generate quants
This shows that all 10 chunks are past the "Make a file containing the unmapped reads..." step, 6 of them are past the "Run mdust..." step, and one is past the "Merge bowtie and blat results step".
In the postprocessing phase, the output should look something like this:
Postprocessing
--------------
X Merge RUM_NU files
X Make non-unique coverage
X Merge RUM_Unique files
X Compute mapping statistics
X Make unique coverage
X Finish mapping stats
X Merge SAM headers
X Concatenate SAM files
X Merge novel exons
X Merge quants
make_junctions
Sort junctions (all, bed) by location
Sort junctions (all, rum) by location
Sort junctions (high-quality, bed) by location
Get inferred internal exons
There is only one column of X's, since the postprocessing phase is not run in parallel.
If you start running a job and it stops for some reason, simply
running rum_runner align -o *dir* should make it pick up from the
last step it succesfully ran. For example, suppose you run a job like this:
rum_runner align \
-o sample01 \
--name sample01 \
--index ~/rum-indexes/mm9 \
--chunks 30 \
~/samples/sample01/forward.fq ~/samples/sample01/reverse.fqrum_runner will save the settings for the job, including the name,
number of chunks, input files, and any other parameters, to the file
sample01/.rum/job_settings. As it runs, it will keep track of the
state of the pipeline, based on which intermediate files are
present. Then if you stop the job or it fails for some reason, you
should be able to restart it again simply by running
rum_runner align -o sample01It will load the settings from sample01/.rum/job_settings, examine
the intermediate files to figure out what state it was in when it
stopped, and then restart from there. If you're running it in a single
chunk in a terminal, it will tell you which steps it is skipping:
(skipping) Run bowtie on genome
(skipping) Run bowtie on transcriptome
(skipping) Separate unique and non-unique mappers from genome bowtie output
(skipping) Separate unique and non-unique mappers from transcriptome bowtie
output
(skipping) Merge unique mappers together
(running) Merge non-unique mappers together
(running) Make a file containing the unmapped reads, to be passed into
blat
(running) Run blat on unmapped reads
(running) Run mdust on unmapped reads
(running) Parse blat output
(running) Merge bowtie and blat resultsRestarting RUM alignments on a compute cluster (e.g. --qsub) is the same as restarting local runs:
rum_runner align -o sample01The difference is that you will not see the status messages, they will be written to a log instead. You can always see the current status of the run by tailing the log files or with rum_runner status -o sample01.
RUM will now use the popular LOG::Log4perl module for logging if you
have it installed, or will use a simpler home-grown logging module if
you don't. In either case, all of the log files will be placed in a
log sub directory of the job directory. The log files are:
-
rum.log- Main log file, contains output from preprocessing, postprocessing, and monitoring chunks. -
rum_errors.log- Main error log file. This should be empty if all goes well. -
rum_NNN.log, where NNN is the number of a chunk. Contains output from processing a single chunk of reads. If you don't run with multiple chunks, this will be folded intorum.log. -
rum_errors_NNN.log. Errors for chunk NNN. This should be empty if all goes well.
RUM 2.x should handle errors and allow you to kill a running job a little more smoothly than RUM 1.x. For example, if you are running RUM in a terminal, simply hitting CTRL-C should kill the parent process as well as all subprocesses. If you find that it doesn't, please open an issue here.
Use rum_runner help kill to see the full description of the rum_runner kill ... action.
You will now need the autodie Perl module. If you are using perl >=
5.10, this should already be installed. If not, you may neet to
install it. You should be able to install it very quickly by running:
cpan -i autodieLog::Log4perl is recommended, but not required. You should be able to install it by running:
cpan -i Log::Log4perlIf you have Log::Log4perl, you will be able to control logging output
by modifying the conf/rum_logging.conf file in the RUM
distribution. See http://mschilli.github.com/log4perl/ for more
information.
-
We have added lots of unit and integration tests.
-
Intermediate output files produced for the chunks now go in
*output-dir*/chunks. -
Log files all go in
*output-dir*/log. -
The mechanism for running jobs on an SGE cluster has been made extensible, so that we may be able to support other types of clusters in the future.