-
Notifications
You must be signed in to change notification settings - Fork 1
Salmon_loop
Salmon is a fast and relatively lightweight software to analyze genetic expression. Salmon gives information on the expression levels of individual genes (quant.genes.sf) as well as what transcripts come from it (quant.sf). This wrapper script has two purposes: 1) When fed multiple files at once, Salmon treats these as one sample so a lightweight wrapper is needed to iterate over multiple samples; 2) Salmon has multiple strategies of quantifying expression depending on what input files are used. Additionally the syntax salmon uses is somewhat peculiar. Given these, the wrapper also serves as a lightweight interpreter of user input in order to choose an execution strategy; 3)Finally the wrapper makes a call to the R script Salmon_parser.R to parse the output into matrixEQTL format.
Salmon is capable of performing a lightweight alignment prior to quantification on fastq read files. Salmon is significantly faster than other aligners at the cost of a slightly lower accuracy as well as the loss of some metadata (eg CIGAR strings). Depending on what downstream analysis is planned, alignment with salmon may or may not be appropriate and it is up to the user whether or not to align with salmon. To trigger alignment based analysis one only needs to provide the directory containing fastq files with the -f option. Additionally, one needs to provide salmon with a list of samples (assumed to be paired end), a transcriptome file, an annotation file, and an index. If an index for the transcriptome file has not been generated, the --runindex option will create one prior to alignment.
./Salmon_loop -t $PATH/$TO/transcriptome.fa --runindex -f $PATH/$TO/fastqdirectory/ -a $PATH/$TO/annotations.gencode.gtf -s $PATH/$TO/sample_list.txt
- -a or --annotation An annotation file is required for two purposes. 1) It is necessary to produce the quant.genes.sf file that contains gene expression data. 2) The .sf files do not contain gene location information needed by matrixEQTL. Part of the parsing done by Salmon_Parser.R is to separate genes by chromosome and create gene location files.
- --single-end Must be used if reads are single-end. Otherwise reads are assumed to be paired-end which will throw an error if this is false.
Salmon_Parser.R does as it sounds and parses the output of quantification done by Salmon to a format usable by MatrixEQTL. Salmon performs the following operations:
- Combines all samples in Salmon's output directory
- Removes any genes that do not pass threshold (see --meanthreshold and --variancethreshold)
- Determines gene chr and location
- Separates genes by chromosome
- Creates gene location files corresponding to each chromosome
The final output will be 22 files corresponding to gene expression per chromosome and 22 files corresponding to gene location per chromosome.
-a | --annotation)
annotation file corresponding to the genome
-b | --bamdirectory)
directory containing the bam files to be processed
-f | --fastqdirectory)
directory containing the fastq files to be processed
--indexdir)
when supplied alone, indexing has been run, and points to the index directory. When supplied with -ri, serves as output directory for indexing.
-l | --librarytype)
what type of reads are being aligned. Uses A by default to automatically attempt to infer what type of reads are being used. Please look at the Salmon documentation for further explanation of this option.
-m | --meanthreshold)
Filtering threshold used by the salmon parser script. Defines the minimum mean TPM for a given transcript for it to remain in the final combined quantification file.
-o | --outputdirectory)
directory where you'd like to send all your quantification files
--runindex)
Only supplied if indexing has not been run. May be supplied with -id to name output directory. Must be supplied with -g genome files path (Right now only accepts 1 genome file).
-s | --samplelist)
full path to the list of samples you would like to process. See above for formatting of sample list.
--single-end)
Necessary flag if reads are single end. Default is otherwise paired-end.
-t | --transcript)
full path to the reference transcript file.
-v | --variancethreshold)
Filtering threshold used by the salmon parser script. Defines the minimum variance of TPM for a given transcript for it to remain in the final combined quantification file.
IndexDirectoryDefault=~/SalmonIndex/
LibraryTypeDefault=A
MeanThresholdDefault=0.1
OutputDirectoryDefault=~/SalQuant
RunIndexDefault=False
VarianceThresholdDefault=0.1
The sample list follows a fairly simple format.
- There should be no headers
- The first column should be a list of samples
- One sample name per line
- Sample names should not contain any file types (for example .fastq and/or .gz endings can be removed from the name e.g. sample1.fastq becomes sample1)
- Sample names should not contain any path information (for example /home/data/sample1.fastq can be listed as sample1)
- Sample names should not contain any of its end information (For example with paired end files you may have sample1_R1.fastq and sample1_R2.fastq. These may be reduced to simply sample1.)
- The sample list should not contain any duplicates (For example with paired end files you may have sample1_R1.fastq and sample1_R2.fastq. These may be reduced to simply sample1)
Often times the sample names that fastq files are tagged with differ from how they need to be presented downstream. For example, genotype files may contain unique identifiers that differ from the fastq file names. With this in mind it is possible to introduce different names for your samples early on in the pipeline. To do this users can optionally introduce a second column of sample names to the sample list. The first column will still identify the correct input sample and outputs will simply be renamed to match the corresponding name in the second column. This name translation step will only take place during alignment steps. As such if users need to translate names they should do this right from the beginning Users only need to make this list once and it will not interfere with analysis downstream.
an easy and quick way to generate the sample list is using the unix command ls | cut -f 1 -d <delimeter> > sample_list.txt. This will cut file names into chunks that can be easily selected by the user. Please see the cut manual page for more details