Skip to content

Star_loop

Ryan Schubert edited this page Sep 14, 2018 · 4 revisions

About

The purpose of this script is to loop STAR to align a large number of samples. Often with RNA-seq studies and eqtl studies in particular our analysis requires a large number of samples to be analyzed. STAR does not have an easy built in method to loop over a large number of files. Simply feeding STAR a large list of samples or using a wildcard will cause STAR to interpret these files as one sample. Independent runs of STAR are required then for all samples, hence the loop script. Incorporated into this script is several options to allow for slight differences in the features of a run. Please note that for each sample, this script does not check to see if alignment files already exist and will rerun STAR, overwriting them if they do exist.

Usage

STAR alignments find many downstream uses, but particularly they are useful for downstream analysis by leafcutter which calculates intron excision ratios of splicing events. STAR alignment must be used for this purpose as opposed to a pseudoalignment software such as Salmon or Kallisto as leafcutter requires an accurate CIGAR string explaining how the read aligned to the genome. STAR produces such a string while Salmon and Kallisto do not. This facet must be considered when choosing what software to run for eqtl or sqtl analysis.

Important Options

  • --indexdir, --runindex, and --genomefasta
    These three options pertain only to STAR's indexing process. To align samples, STAR requires that an index be created for a given genome file. This process needs only happen once and will output into the directory specified by --indexdir. As this process takes a rather long time, by default it is assumed that the index has already been created and its location should be specified using the --indexdir option. To run the indexing process prior to performing alignment the user should supply the --runindex option with no arguments and --genomefasta with the location of the genome file as its argument.

  • --samplelist and the list format To loop over multiple sample files, this script requires

  • --VCF and --WASP Together these options are meant to activate the WASP output mode implemented in STAR. This is a reimplementation of the original WASP algorithm that tags individual reads within the fastq file in order to identify allele specific biases. However due to errors within the testing VCF file the utility cannot currently be verified so it is not recommended that these are used.

Complete Options

      -a | --annotation) 
            The annotation file corresponding to your genome build. Needed to run Alignment to the transcriptome
      -g | --genomefasta) 
            file path to the genome files for indexing. Must be supplied when -ri is supplied
      -i | --inputdirectory) 
            directory containing the fastq files to be processed
      --indexdir) 
            when supplied alone, indexing has been run, and points to the index directory. When supplied with --runindex, serves as output directory for indexing.
      -o | --outputdirectory) 
            directory where you'd like to send all your alignments
      --runindex) 
            Only supplied if indexing has not been run. May be supplied with -id to name output directory. Must be supplied with -g genome files path (Right now only accepts 1 genome file).
      -s | --samplelist) 
            list of samples you wish to process. Assumes samples are paired fastq files. Default is 10.
      -t | --threads) 
            number of threads you wish to use
      --VCF) 
            file path to vcf file for wasp tag
      --WASP | -W) 
            Run WASP with the default VCF

Defaults

AlignmentOutDefault=~/StarAlignments/ 
IndexDirectoryDefault=~/starIndex/ 
RunIndexDefault=False 
RunWASPDefault=False
ThreadsDefault=6
VCFfileDefault=~/Data/SNPdata/chr1-22.vcf.gz

Clone this wiki locally