Skip to content
William Wallén edited this page May 23, 2025 · 15 revisions

Genome Analysis Wiki

Quality control and trimming

The first step of the analysis was to do a quality check of our reads. We did this both for the short Illumina reads as well for the Nanopore long reads. We do this to see the quality of our reads, so we have something to compare to after trimming. We did this by using FastQC.

fastqc -o $OUTPUT --threads 2 "${INPUT_PATH}SRR244130${i}_1.fastq.gz" "${INPUT_PATH}SRR244130${i}_2.fastq.gz"

Explanation of code:

  • -o provides the output
  • --threads provides the amount of cores FastQC can use

Results

bild bild

The pictures above show the per-base quality from the analyses of the hyper-producer (HP126) forward and backwards strands. We can see that the forward strand (left) has better quality than the backward strand (right). This is a recurring event for the other two strands as well (wild-type R7 & exclusion of antibiotic-producing cluster DV3). We can also see that for quality significantly decreases towards the end off the reads. This is typical for Illumina and will not be a problem after trimming.

For the RNA strands, we got very good values of the per-base quality, but we got much worse results in for example the per-base sequence quality and per sequence GC content.

bild

This is not uncommon for RNA as library preparation and sequencing methods often introduce biases such as residual adapter sequences or inherent characteristics of poly(A) tails, which may lead to a slight spike in specific nucleotides even after trimming.

Trimming

Short-reads

Before the assembly of the strands, we will have to process them. We have both short-read DNA strands and RNA strands that were sequenced using Illumina. One of the things that we need to do when preprocessing the reads is to trim the adapters as we don't want to have these sequences added to our final assembly.

For this purpose, we use Trimmomatic to help with the preprocessing of our short-read DNA sequences and our RNA-reads.

For the DNA short-read preprocess:

trimmomatic PE -threads 12 -phred33 input/files/ output/files/ ILLUMINACLIP:/path/to/adapters/TruSeq3-PE-2.fa:2:30:10 LEADING:3 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:150

Explanation of code:

  • PE informs Trimmomatic that we have pair-end reads
  • -threads dictates the amount of cores Trimmomatic can use
  • -phred33 convert quality scores to Phred-33
  • ILLUMINACLIP: provides the path to the adapters
  • LEADING cuts bases off the start of a read, if below a threshold quality
  • TRAILING cuts bases off the end of a read, if below a threshold quality
  • SLIDINGWINDOW performs sliding window trimming, cutting once the average quality within the window falls below a threshold.
  • MINLEN drops an entire read if it is below a specified length

We did the same for the RNA seq with the difference of the MINLEN parameter as the length of the RNA sequences is much shorter (~36-75 bases). We then set the MINLEN to 36 instead. We also changed the TRAILING to 3 as the first quality check didn't show poor quality on either end.

bild

As we can see in the figure above, we a significantly better quality on our reads. This is true for the whole read, even if the end has worse quality then the start.

Longread

For the long-read DNA strands, we need to use a different tool as it uses Nanopore for its sequencing instead of Illumina. We had to instead use Porechop for our trimming. Porechop is a tool developed specifically for the preproccesing of Nanopore reads. It looks both in both ends and middle of the reads to try and find adapters. When it finds an adapter, it either cuts it off if it is in either end, or if it finds it in the middle, it will split the read into two.

porechop -i "SRR24413072.fastq" -o "${OUTPUT}/01-04-DNA_longread_72trimmed.fastq.gz" --format fastq.gz --threads 4

Explanation of code:

  • -i input file
  • -o output file
  • --format output format
  • --threads amount of threads used
bild bild

In the pictures above, we can see the result from FastQC for the Nanopore reads before (left) and after (right) trimming. We got an overall better per-base quality after trimming our reads. Another observation is that we also got a more stable per-base sequence content graph for the start and end of the sequence. This would be due to the removal of the adapters in the reads.

bild bild

Genome Assembly

Assembly and Evaluation

We started the genome assembly by assembling the Nanopore reads. For this, we used the program Flye. Flye is a de novo assembler for longread data. We do this for all three genomes.

flye --nano-raw "01-04-DNA_longread_66trimmed.fastq.gz" --out-dir "${OUTPUT}/66" --threads 5
  • --nano-raw: Use the default parameters for Nanopore reads.

From Flye, we get a number of files, but most importantly, we get an output assembly.fasta file. We input this file into our evaluation program, Quast to check how well they where assembled. We did this by using the following command:

python '/sw/bioinfo/quast/5.0.2/rackham/bin/quast.py' -o "${OUTPUT}/R7" -r "${REFRENCE_PATH}/R7_genome.fasta" -t 2 "${INPUT_PATH}/R7/assembly.fasta"
  • python '/sw/bioinfo/quast/5.0.2/rackham/bin/quast.py' is the python execution and the path to the quast.py file.
  • -o Output directory
  • -r Refrence genome that we use to evaluate the assembly
  • -t How many threads it should use

Results from Quast

We got good results from all three assemblies. The main points we look at are Contigs, Largest contig and N50/L50. We would also want to look at genome completeness and base-level correctness (indels).

Genome Contigs Largest Contig N50 L50 Genome Fraction (%) Indels/100kbp
DV3 4 7 Mb 7 Mb 1 80.838 56.01
HP126 4 6.99 Mb 6.99 Mb 1 90.953 60.46
R7 2 9.36 Mb 9.36 Mb 1 99.930 61.87

We can see that we get very good results from this assembly with few contigs that we have one very large one, being our bacterial chromosome while the other smaller ones could be something like plasmids. Both N50 and L50 show good values indicating that we have had a good assembly. We do have a pretty high amount of indels per 100kbs. To be able to hopefully lower these values, we will polish our assembly with our shortreads.

Polishing

For the polishing of the assembly, we use the Illumina reads that we trimmed earlier. For the whole process, we use three programs, BWA, Samtools and Pilon.

BWA is a software package that is used to map DNA sequences to a reference genome. In our case, we will map our Illumina reads to our assembled nanopore sequence to try and decrease the amount of indels. First, we index our nanopore assemblies:

bwa index sample_assembly_flye.fasta

We do this for faster mapping later. BWA will be more efficient when searching for alignments. We then take our high‑accuracy short reads and try to align them to our draft contigs, producing a SAM file using the command:

bwa mem -t 5 sample_assembly_flye.fasta shortread_dna_sample_1.trimmed.fastq.gz shortread_dna_sample_2.trimmed.fastq.gz > sample.paired.sam

We do this so Pilon know where each Illumina base “falls” on our nanopore assembly to be able to spot errors. Our output from this is a SAM file. SAM is a human readable file format which is slower than its 'sibling', BAM file. A BAM file is a SAM file written to binary, leading to a more compressed file which will make the next processes more efficient, as well as save space. We do this with the command:

samtools view -bSh sample.paired.sam > *sample*.paired.unsorted.bam
  • -b writes to binary
  • -S auto detects the input format
  • -hpreserves the headers

After this, we sort the BAM file. We do this so our alignments appears in ascending genomic order. We do this as Pilon expects our reads to be sorted by position so it doesn't have to scan the whole file when jumping between any locus. We do this by using the following command:

samtools sort sample.unpaired.sorted.bam -o *sample*.paired.sorted.bam

We then index our BAM file for more efficient use of Pilon. This will create an .bai file that is companion file to our BAM files.

samtools index sample.paired.sorted.bam

Lastly, we use Pilon to try and improve our assemblies. We did this with the following command:

java -jar "${PILON_HOME}/pilon.jar" \
 --threads 5 \
 --genome sample_assembly_flye.fasta \
 --frags sample.paired.sorted.bam \
 --outdir /output/path \
 --output sample_pilon \
 --changes
  • --genome is our nanopore assembly that we want to polish
  • --frags is our BAM file we fixed before
  • --outdir output directory
  • --changes creates a human-readable log of all of the changes that was done

Results from polishing

After we polished our nanopore assembly, we used Quast again to see if we got any improvements. We saw an big decrease in the amount of indels per 100kbs which makes these very good assemblies.

Sample Pre Polishing Indels/kbs Post Polishing Indels/kbs
DV3 56.01 8.60
HP126 60.46 3.39
R7 61.87 0.65

Annotation

Now when we have done our genome assembly for the different samples, it is important to do annotation on our assemblies. We do this to make it easier to compare them against eachother in the further steps. The software we used was Prokka.

prokka --cpus 5 --outdir "${OUTPUT}/sample" --prefix "Sample" --kingdom Bacteria "${INPUT_PATH}/sample_pilon.fasta" 
  • --cpus: The amount of threads to be used
  • --outdir: Sets the output directory
  • `--prefix: Sets the naming of the output files
  • --kingdom: Choose either bacteria or archea

Results

The outputs showed similar results as the paper showed for the CDS content.

Sample Prokka CDS output Paper CDS (coverage adjusted) Paper CDS
R7 8411 8163 8263
HP126 7638 7510 8163
DV3 8163 7415 9077

During the analysis of the Prokka outputs, we could not just read the amount of CDS it found and compare it to the paper, but we needed to firstly account for the difference in coverage. We took the coding sequences and divided it with the coverage percentage. We could then multiply this value with our coverage to get a better representation of the expected CDS content.

Comparative Genomics

For the comparison between R7, HP126 mutant and DV3, a BLAST search was done. We firstly created a manual database from our wild-type genome:

makeblastdb -in R7_pilon.fasta -dbtype nucl -title asm_db -out asm_db

We then compared HP126 and DV3 to this database:

blastn -query HP126_pilon.fasta -db asm_db -num_threads 8 -outfmt 6 -out HP126_R7_blastn.crunch
  • -query: The input file that is to be aligned.
  • -db: The database to align to.
  • -outfmt: Define the outformat. 6 represent tabular format.

The output from this was a table of all the blast results. We got a total of 2167 hits for HP126 and 2164 hits for DV3. These hits are not perfect, so we need to filter our results. We filter setting %id >= 99%, sequence length should be longer than 100bp, E-value <= 1e-10 and a bitscore >= 100. After filtering, we got only 48 hits for HP126 and 49 for DV3. Doing this filtering leads to us only looking at large, nearly identical regions between the query and reference. These likely correspond to conserved or syntenic regions.

For the visualisation of the synteny between the two blasted genomes and wilt-type, we used the Artemis Comparison Tool (ACT). We can see the output from this analysis in the pictures below (HP126 on the left and DV3 on the right). We can see that large parts of the genomes are conserved relative to the wild-type R7. The most notable thing here is the inverted translocation that is present in both HP126 and DV3.

bild bild

This inversion and translocation could be a reason for the increase in antibiotic production. The translocation could have resulted in a stronger promoter and thus be upregulated.

We should also note that we have multiple contigs in the wild-type that is not matched in any mutant. This is probably due to poor assembly of HP126 and DV3 as these had a much lower completeness relative to R7.

Differential Expression Analysis

We want to do a differential expression analysis to see what genes are expressed in higher or lower amount between the different samples. We do this to be able to determine what could be the cause of the hyperproducing strain.

The first thing that needs to be done is to map the trimmed RNA reads to our genomes. We do this using bwa and samtools:

bwa mem -t 5 assembly/R7_pilon.fasta 02-04-trimomatic_rna/shortread_rna_56_1.trimmed.fastq.gz \
    02-04-trimomatic_rna/shortread_rna_56_2.trimmed.fastq.gz \
    | samtools sort -@5 -o DV3/DV3_rep1_rna_map.sorted.bam

    samtools index DV3/DV3_rep1_rna_map.sorted.bam

We input the wild-type (R7) assembly we did and one of the replicates of the RNA reads. We do this for each of the three replicates and strains. This means that we will get three separate .bam files with the RNA mapped to the assembly. We then run this through featureCounts. We map our RNA reads to the wild-type assembly because if we would map them to their respective assembly, we would not be able to compare them. Some gene boundaries or contig orders might shift making a comparison impossible.

featureCounts -T 15 -p -s 2 \
-a assembly/DV3.gff \
-t CDS -g ID \
-o DV3/featureCount_DV3.txt DV3/DV3*.sorted.bam 
  • -p: Makes sure that featureCounts counts the fragments for paired-end data and not individual reads
  • `-s: It specifies the strand specificity. 2 stands for reverse
  • -a: Defines our assembly
  • -t CDS: Tells featureCounts to select only features that equals CDS
  • -g: Says what to group the reads by
  • -o: Output

When running this, we will count the number of times a gene is expressed in the different replicates. We can use this data to see differences in the expression levels between the strains.

To be able to analyze the differences in the expression between R7 and HP126, as well as R7 and DV3, a heatmap was created showing the top 20 differentially expressed genes. The genes were filters and sorted according to the difference in expression, as well as the p-value. In the figure, we can clearly see what genes were upregulated and which ones were downregulated for the respective strains.

bild bild

To be able to see what these genes are, we need to refer to our R7 prokka output, as this will have our gene id's and what they corresponds to. Some of the upregulated genes in HP126 are Alkaline phosphatase D, ATP-dependent RNA helicase DeaD, N-acetylglucosamine repressor and Renalase which could be the cause for the ramped-up production of the antibiotic. Some of the downregulated genes were D-inositol-3-phosphate glycosyltransferase and ATP-dependent DNA helicase RecG. The downregulation of these genes might slow down other metabolic pathways that takes energy from antibiotic production. The same could be said about the expressions in R7 v DV3. We can't find hard evidence of what exactly is the cause of the increased production.

We didn't find any genes that were mentioned in the article. When searching for genes in the prokka output containing the keyword Oxytetracycline, we only got three results, where none had any statistical significance or up/down regulation.

Lastly, we created a PCA plot on the DE data. We can see for both plots of R7 v HP126 and R7 v DV3, we have a clear separation between the two. This means that there is a clear difference in the gene expression between R7 and the mutants. This strengthens it more that we have 97% and 92% respectively for PC1.

PCA_DV3_R7 PCA_HP126_R7

Clone this wiki locally