-
Notifications
You must be signed in to change notification settings - Fork 1
02 Quantification
If in the previous step you ran salmon to pseudoalign your fastq files then congratulations! You have already completed the steps necessary to quantify your RNA-Seq data. Check that the output files are not empty. You should also check the total_populations folder is not empty. This is where the parsed and filtered salmon output goes. You may notice that some snps are missing. That is because parsing automatically removes snps that do not meet thresholds of mean and variance across samples. If you wish to create population files on a smaller subset of your sample population you can simply run Salmon_parser.R again. This script automatically searches for all the relevant quant files in a directory and its sub directories. To run on a smaller subset simply remove all samples you don't wish to run on. See salmon_loop page for more details and options.
Salmon_parser.R example
mkdir outputdir
Rscript Salmon_parser.R -q $PATH/to/quantification/directory/ -a $PATH/$TO/annotations.gencode.gtf -o outputdir
If the user already has alignment files (BAM/SAM) then salmon can still run to quantify these with relative speed. The arguments are relatively similar to running salmon for pseudoalignment. The key difference is that instead of supplying the directory containing fastq files one must supply the directory containing the BAM files with the -b option. One also does not need to supply an index or run indexing for this step. Please see the salmon_loop page for the full list of options.
Example
./salmon_loop -t $PATH/$TO/transcriptome.fa -b $PATH/$TO/BAMdirectory/ -a $PATH/$TO/annotations.gencode.gtf -s $PATH/$TO/sample_list.txt
The sample list follows a fairly simple format.
- There should be no headers
- The first column should be a list of samples
- One sample name per line
- Sample names should not contain any file types (for example .fastq and/or .gz endings can be removed from the name e.g. sample1.fastq becomes sample1)
- Sample names should not contain any path information (for example /home/data/sample1.fastq can be listed as sample1)
- Sample names should not contain any of its end information (For example with paired end files you may have sample1_R1.fastq and sample1_R2.fastq. These may be reduced to simply sample1.)
- The sample list should not contain any duplicates (For example with paired end files you may have sample1_R1.fastq and sample1_R2.fastq. These may be reduced to simply sample1)
Often times the sample names that fastq files are tagged with differ from how they need to be presented downstream. For example, genotype files may contain unique identifiers that differ from the fastq file names. With this in mind it is possible to introduce different names for your samples early on in the pipeline. To do this users can optionally introduce a second column of sample names to the sample list. The first column will still identify the correct input sample and outputs will simply be renamed to match the corresponding name in the second column. This name translation step will only take place during alignment steps. As such if users need to translate names they should do this right from the beginning Users only need to make this list once and it will not interfere with analysis downstream.
an easy and quick way to generate the sample list is using the unix command ls | cut -f 1 -d <delimeter> > sample_list.txt. This will cut file names into chunks that can be easily selected by the user. Please see the cut manual page for more details