Release Notes 2.0.1

There are lots of exciting enhancements in RUM 2.0. This document highlights the most visible changes between RUM 1.x and RUM 2.0.

Standard Perl Installation

RUM now follows the convention for Perl modules. We use Makefile.PL to drive the installation. Please see the README.md file in your distribution or the [Installation wiki page](Installing RUM) for installation instructions.

Better SAM output

We have revised the output SAM formatted results so that they validate against the Picard SAM parser.

Cleaner User Interface

The command-line interface for RUM has changed in the following ways:

RUM_runner.pl has been renamed to rum_runner.
Command-line options now follow standard Unix conventions. Long options start with two dashes, e.g. --index or --strand-specific. Short options only have one dash, e.g. -o. Please run rum_runner help for help information.
rum_runner now has multiple actions that it can perform: running a pipeline, checking the status of a job, killing a job, and performing other common actions. See Actions section, below for a full list and description of each.

Actions [actions]

rum_runner now has several different actions you can run. An action is specified by the first command-line argument:

rum_runner align ...: Run the pipeline on a specified output directory, starting at the first step that hasn't already been run.
rum_runner clean -o *dir*: Remove intermediate output files from a specified output directory. Useful if you ran rum_runner align --no-clean ....
rum_runner help [action]: Get help. Use rum_runner help *action* to get help for a particular action.
rum_runner kill -o *dir*: Stop a job running in a specified directory.
rum_runner status -o *dir*: Check on the status of a job in a specified directory.
rum_runner version: Print out the version of RUM.

Job Status

You can now get a simple report showing the progress of the pipeline by running rum_runner status -o *output-dir*.

It will show you whether you're in the "Preprocessing", "Processing", or "Postprocessing" phase. In the processing phase, it will show you each step, along with an X for each chunk that has completed that step. It should look something like this:

Processing in 10 chunks
-----------------------
XXXXXXXXXX Run bowtie on genome
XXXXXXXXXX Run bowtie on transcriptome
XXXXXXXXXX Separate unique and non-unique mappers from genome bowtie output
XXXXXXXXXX Separate unique and non-unique mappers from transcriptome bowtie
           output
XXXXXXXXXX Merge unique mappers together
XXXXXXXXXX Merge non-unique mappers together
XXXXXXXXXX Make a file containing the unmapped reads, to be passed into
           blat
XXX  X X X Run blat on unmapped reads
XXX  X X X Run mdust on unmapped reads
         X Parse blat output
         X Merge bowtie and blat results
           Clean up RUM files
           Produce RUM_Unique
           Sort RUM_Unique by location
           Sort cleaned non-unique mappers by ID
           Remove duplicates from NU
           Create SAM file
           Create non-unique stats
           Sort RUM_NU
           Generate quants

This shows that all 10 chunks are past the "Make a file containing the unmapped reads..." step, 6 of them are past the "Run mdust..." step, and one is past the "Merge bowtie and blat results step".

In the postprocessing phase, the output should look something like this:

Postprocessing
--------------
X Merge RUM_NU files
X Make non-unique coverage
X Merge RUM_Unique files
X Compute mapping statistics
X Make unique coverage
X Finish mapping stats
X Merge SAM headers
X Concatenate SAM files
X Merge novel exons
X Merge quants
  make_junctions
  Sort junctions (all, bed) by location
  Sort junctions (all, rum) by location
  Sort junctions (high-quality, bed) by location
  Get inferred internal exons

There is only one column of X's, since the postprocessing phase is not run in parallel.

Restarting

If you start running a job and it stops for some reason, simply running rum_runner align -o *dir* should make it pick up from the last step it succesfully ran. For example, suppose you run a job like this:

rum_runner align \
  -o sample01 \
  --name sample01 \
  --index ~/rum-indexes/mm9 \
  --chunks 30 \
  ~/samples/sample01/forward.fq ~/samples/sample01/reverse.fq

rum_runner will save the settings for the job, including the name, number of chunks, input files, and any other parameters, to the file sample01/.rum/job_settings. As it runs, it will keep track of the state of the pipeline, based on which intermediate files are present. Then if you stop the job or it fails for some reason, you should be able to restart it again simply by running

rum_runner align -o sample01

It will load the settings from sample01/.rum/job_settings, examine the intermediate files to figure out what state it was in when it stopped, and then restart from there. If you're running it in a single chunk in a terminal, it will tell you which steps it is skipping:

(skipping) Run bowtie on genome
(skipping) Run bowtie on transcriptome
(skipping) Separate unique and non-unique mappers from genome bowtie output
(skipping) Separate unique and non-unique mappers from transcriptome bowtie
              output
(skipping) Merge unique mappers together
(running)  Merge non-unique mappers together
(running)  Make a file containing the unmapped reads, to be passed into
              blat
(running)  Run blat on unmapped reads
(running)  Run mdust on unmapped reads
(running)  Parse blat output
(running)  Merge bowtie and blat results

Restarting on a Cluster

Restarting RUM alignments on a compute cluster (e.g. --qsub) is the same as restarting local runs:

rum_runner align -o sample01

The difference is that you will not see the status messages, they will be written to a log instead. You can always see the current status of the run by tailing the log files or with rum_runner status -o sample01.

Logging

RUM will now use the popular LOG::Log4perl module for logging if you have it installed, or will use a simpler home-grown logging module if you don't. In either case, all of the log files will be placed in a log sub directory of the job directory. The log files are:

rum.log - Main log file, contains output from preprocessing, postprocessing, and monitoring chunks.
rum_errors.log - Main error log file. This should be empty if all goes well.
rum_NNN.log, where NNN is the number of a chunk. Contains output from processing a single chunk of reads. If you don't run with multiple chunks, this will be folded into rum.log.
rum_errors_NNN.log. Errors for chunk NNN. This should be empty if all goes well.

Error Handling, Job Cancellation

RUM 2.x should handle errors and allow you to kill a running job a little more smoothly than RUM 1.x. For example, if you are running RUM in a terminal, simply hitting CTRL-C should kill the parent process as well as all subprocesses. If you find that it doesn't, please open an issue here.

Use rum_runner help kill to see the full description of the rum_runner kill ... action.

Third-Party Libraries

Autodie

You will now need the autodie Perl module. If you are using perl >= 5.10, this should already be installed. If not, you may neet to install it. You should be able to install it very quickly by running:

cpan -i autodie

Log::Log4perl

Log::Log4perl is recommended, but not required. You should be able to install it by running:

cpan -i Log::Log4perl

If you have Log::Log4perl, you will be able to control logging output by modifying the conf/rum_logging.conf file in the RUM distribution. See http://mschilli.github.com/log4perl/ for more information.

Other Changes

We have added lots of unit and integration tests.
Intermediate output files produced for the chunks now go in *output-dir*/chunks.
Log files all go in *output-dir*/log.
The mechanism for running jobs on an SGE cluster has been made extensible, so that we may be able to support other types of clusters in the future.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Release Notes 2.0.1

Standard Perl Installation

Better SAM output

Cleaner User Interface

Actions [actions]

Job Status

Restarting

Restarting on a Cluster

Logging

Error Handling, Job Cancellation

Third-Party Libraries

Autodie

Log::Log4perl

Other Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally