-
Notifications
You must be signed in to change notification settings - Fork 1
01 Sequence Alignment
Sequence alignment is a necessary step for users beginning with fastq files. Users who already have alignment files (bam/sam) can move on to the next step. The pipeline has two wrappers capable of performing alignment, star_loop and salmon_loop. Salmon is considerably faster than other aligners as it does not perform a true alignment. While this pseudo-alignment is incredibly fast, the trade off is a loss of metadata that other softwares may need (eg leafcutter). Accuracy may also be a concern compared to traditional aligners such as STAR which is currently one of the most accurate aligners in existence. For the softwares in use in this pipeline salmon is adequate for most analyses.
Alignment with salmon is relatively fast and accurate, however it only performs alignment to the transcriptome not the genome. For the purposes of this pipeline salmon is preferred over other methods. The wrapper script serves as a lightweight interpreter of user inputs. To trigger alignment based analysis one only needs to provide the directory containing fastq files with the -f option. Currently this wrapper is only capable of interpreting paired end fastq files, single end files must be fed to STAR. Additionally, one needs to provide salmon with a list of samples (assumed to be paired end), a transcriptome file, an annotation file, and an index. If an index for the transcriptome file has not been generated, the --runindex option will create one prior to alignment. Please see the salmon_loop page for more thorough documentation
Example
# If paired end
./salmon_loop -t $PATH/$TO/transcriptome.fa --runindex -f $PATH/$TO/fastqdirectory/ -a $PATH/$TO/annotations.gencode.gtf -s $PATH/$TO/sample_list.txt
# If single end
./salmon_loop -t $PATH/$TO/transcriptome.fa --runindex -f $PATH/$TO/fastqdirectory/ -a $PATH/$TO/annotations.gencode.gtf -s $PATH/$TO/sample_list.txt --single-end
Alignments with STAR have very high fidelity to their reference file. The requirements for aligning with STAR are very similar to the alignments with salmon. Star_loop requires a sample list, the directory containing fastq files, a genome file, an annotation file, and an index. Please see the star_loop page for more thorough documentation.
Example
./star_loop -g $PATH/$TO/genome.fa -s $PATH/$TO/sample_list.txt --inputdirectory $PATH/$TO/fastqdirectory/ -a $PATH/$TO/annotations.gencode.gtf --runindex
The sample list follows a fairly simple format.
- There should be no headers
- The first column should be a list of samples
- One sample name per line
- Sample names should not contain any file types (for example .fastq and/or .gz endings can be removed from the name e.g. sample1.fastq becomes sample1)
- Sample names should not contain any path information (for example /home/data/sample1.fastq can be listed as sample1)
- Sample names should not contain any of its end information (For example with paired end files you may have sample1_R1.fastq and sample1_R2.fastq. These may be reduced to simply sample1.)
- The sample list should not contain any duplicates (For example with paired end files you may have sample1_R1.fastq and sample1_R2.fastq. These may be reduced to simply sample1)
Often times the sample names that fastq files are tagged with differ from how they need to be presented downstream. For example, genotype files may contain unique identifiers that differ from the fastq file names. With this in mind it is possible to introduce different names for your samples early on in the pipeline. To do this users can optionally introduce a second column of sample names to the sample list. The first column will still identify the correct input sample and outputs will simply be renamed to match the corresponding name in the second column. This name translation step will only take place during alignment steps. As such if users need to translate names they should do this right from the beginning Users only need to make this list once and it will not interfere with analysis downstream.
an easy and quick way to generate the sample list is using the unix command ls | cut -f 1 -d <delimeter> > sample_list.txt. This will cut file names into chunks that can be easily selected by the user. Please see the cut manual page for more details