-
Notifications
You must be signed in to change notification settings - Fork 0
Questions
- What is the structure of a FASTQ file? How is the quality of the data stored in the FASTQ files and how are paired reads identified?
The FASTQ file is structured into four separate fields for each of the sequences. The first field starts with the '@' symbol followed by the name/identifier of the sequence. The second field is the raw sequence e.g 'ATAGC...'. The third field starts with the '+' symbol which is there as a seperator. The fourth and last field contains the quality values for each base in the sequence. The quality score is stored in ASCII format. Here is it represented from lowest to highest (lowest: !, highest: ~)
!"#$%&'()*+,-./0123456789:;<=>?
@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^
_`abcdefghijklmnopqrstuvwxyz{|}~
For paired runs, you will get two separate files with one for R1 and one for R2. Most programs will take both of these files and know that they are paired reads. Alternatively, we can have different naming conventions where we have the same header, but only change the number. For example:
@SEQ_ID 1:N:0:INDEX and @SEQ_ID 2:N:0:INDEX
- What is read preprocessing?
Read preprocessing is when we prepare our reads for assembly. This includes trimming of the adapters used during sequencing, for example trimming the Illumina adapters for the Illumina reads.
- What parameters are of interest when looking into read quality?
We run our quality checks with FastQC. The first thing we usually look at is the quality scores for the reads. They are represented with a barplot which shows the scores for the different positions. For a OK read, we want our scores above Q20.
We also look at the per base sequence content. If the plot is way to jumpy, we know something probably is wrong.
Lastly, we want to look at the adapter content of the reads, especially after trimming to see if we got rid of the adapters.
- How is the quality of your data?
The quality of my data is not the best. We have that the base quality is under 20Q for some intervals in the sequences. This is both for the Illumina and Nanopore reads.
- How important is read preprocessing for downstream analyses and why?
Preprocessing is very important to be able to come to correct conclusions in later analyses. For example, if we leave the adapters on, we would get faulty data when comparing the sequences with annotated ones as these would differ, but not in a meaningful way. It is also a possibility that faulty reads would be used in further analyses, even if that have a low quality score. This would lead to messy data that would be hard to draw conclusions from.
- What can generate the “fails” in FastQC that you observe in your data? Can these cause any problems during subsequent analyses?
As explained above, some faults could be due to the adapters being read. We could also have contaminations in our raw data that would lead to problems as explained above.
- How many reads have been discarded after trimming?
For the shortread DNA sequences, around 250 000 - 335 000 reads was discarded depending on the strain
For the RNA sequences, around 850 000 - 1 125 000 reads was discarded
For the long-read DNA sequences, around 200 - 420 reads was added according to the FastQC report. This could be due to Porechopcutting the adapters and not discarding these but counted them as their own reads.
- How can this affect your future analyses and results?
When we have fewer reads to use, we may get worse overall quality due to insufficient data for some parts of the genome. This could in turn lead to us having more contigs than if we would have more reads. But the reads that was discarded was discarded for a reason. We will get a better quality assembly and thus better quality analysis further down the road.
- How is the quality of your data after trimming?
The quality of the data was significantly better. We got better quality scores for the reads. All of the quality scores were above Q20.
- What quality threshold did you choose for the leading/trailing/slidingwindow parameters, and why?
For the DNA trimming, I chose LEADING: 3 TRAILING: 20 SLIDINGWINDOW: 4:20.