Star_loop

About

The purpose of this script is to loop STAR to align a large number of samples. Often with RNA-seq studies and eqtl studies in particular our analysis requires a large number of samples to be analyzed. STAR does not have an easy built in method to loop over a large number of files. Simply feeding STAR a large list of samples or using a wildcard will cause STAR to interpret these files as one sample. Independent runs of STAR are required then for all samples, hence the loop script. Incorporated into this script is several options to allow for slight differences in the features of a run. Please note that for each sample, this script does not check to see if alignment files already exist and will rerun STAR, overwriting them if they do exist.

Usage

STAR alignments find many downstream uses, but particularly they are useful for downstream analysis by leafcutter which calculates intron excision ratios of splicing events. STAR alignment must be used for this purpose as opposed to a pseudoalignment software such as Salmon or Kallisto as leafcutter requires an accurate CIGAR string explaining how the read aligned to the genome. STAR produces such a string while Salmon and Kallisto do not. This facet must be considered when choosing what software to run for eqtl or sqtl analysis.

Important Options

--indexdir, --runindex, and --genomefasta
These three options pertain only to STAR's indexing process. To align samples, STAR requires that an index be created for a given genome file. This process needs only happen once and will output into the directory specified by --indexdir. As this process takes a rather long time, by default it is assumed that the index has already been created and its location should be specified using the --indexdir option. To run the indexing process prior to performing alignment the user should supply the --runindex option with no arguments and --genomefasta with the location of the genome file as its argument.
--samplelist and the list format To loop over multiple sample files, this script requires
--VCF and --WASP Together these options are meant to activate the WASP output mode implemented in STAR. This is a reimplementation of the original WASP algorithm that tags individual reads within the fastq file in order to identify allele specific biases. However due to errors within the testing VCF file the utility cannot currently be verified so it is not recommended that these are used.

Complete Options

      -a | --annotation) 
            The annotation file corresponding to your genome build. Needed to run Alignment to the transcriptome
      -g | --genomefasta) 
            file path to the genome files for indexing. Must be supplied when -ri is supplied
      -i | --inputdirectory) 
            directory containing the fastq files to be processed
      --indexdir) 
            when supplied alone, indexing has been run, and points to the index directory. When supplied with --runindex, serves as output directory for indexing.
      -o | --outputdirectory) 
            directory where you'd like to send all your alignments
      --runindex) 
            Only supplied if indexing has not been run. May be supplied with -id to name output directory. Must be supplied with -g genome files path (Right now only accepts 1 genome file).
      -s | --samplelist) 
            list of samples you wish to process. Assumes samples are paired fastq files. Default is 10.
      -t | --threads) 
            number of threads you wish to use
      --VCF) 
            file path to vcf file for wasp tag
      --WASP | -W) 
            Run WASP with the default VCF

Defaults

AlignmentOutDefault=~/StarAlignments/ 
IndexDirectoryDefault=~/starIndex/ 
RunIndexDefault=False 
RunWASPDefault=False
ThreadsDefault=6
VCFfileDefault=~/Data/SNPdata/chr1-22.vcf.gz

List Samples

sample list format

The sample list follows a fairly simple format.

There should be no headers
The first column should be a list of samples
One sample name per line
Sample names should not contain any file types (for example .fastq and/or .gz endings can be removed from the name e.g. sample1.fastq becomes sample1)
Sample names should not contain any path information (for example /home/data/sample1.fastq can be listed as sample1)
Sample names should not contain any of its end information (For example with paired end files you may have sample1_R1.fastq and sample1_R2.fastq. These may be reduced to simply sample1.)
The sample list should not contain any duplicates (For example with paired end files you may have sample1_R1.fastq and sample1_R2.fastq. These may be reduced to simply sample1)

Automatic name translation

Often times the sample names that fastq files are tagged with differ from how they need to be presented downstream. For example, genotype files may contain unique identifiers that differ from the fastq file names. With this in mind it is possible to introduce different names for your samples early on in the pipeline. To do this users can optionally introduce a second column of sample names to the sample list. The first column will still identify the correct input sample and outputs will simply be renamed to match the corresponding name in the second column. This name translation step will only take place during alignment steps. As such if users need to translate names they should do this right from the beginning Users only need to make this list once and it will not interfere with analysis downstream.

Generating the sample list

an easy and quick way to generate the sample list is using the unix command ls | cut -f 1 -d <delimeter> > sample_list.txt. This will cut file names into chunks that can be easily selected by the user. Please see the cut manual page for more details

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Star_loop

About

Usage

Important Options

Complete Options

Defaults

List Samples

sample list format

Automatic name translation

Generating the sample list

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally