This project is a fork of the original repository debruijn-tp, aimed at assembling the genome of Enterovirus A71. This genome is notably short, comprising 7408 nucleotides and is linear and non-segmented. The FastQ file provided was generated using the ART program via the command:
art_illumina -i eva71.fna -ef -l 100 -f 20 -o eva71 -ir 0 -dr 0 -ir2 0 -dr2 0 -na -qL 41 -rs 1539952693
The reads have a maximum quality of 41 and contain no insertions, only the 5' -> 3' strand reads are supplied.
The goal of this practical is to develop an assembler for the Enterovirus A71 genome using the Debruijn graph algorithm. Our assembler operates in three main stages:
- K-mer Generation: The reads are then split into k-mers of a fixed length k.
- Graph Construction: The k-mers are then used to construct a Debruijn graph.
- Solving Bubbles and Removing Tips: The graph is then simplified by removing bubbles and out/entrytips.
- Contig Generation: The contigs are then generated from the simplified graph.
To install the required dependencies, you can use the following command:
pip3 install --user networkx pytest pylint pytest-cov loguru
The assembler can be run using the following command:
python debruijn/debruijn.py -i [fastq_path] -k [k-mer_size] -o [contigs_path] -f [graph_path]
Options:
-i: Path to the FastQ file containing the reads.-k: Size of the k-mers to be generated. Default is 21.-o: Path to the output fasta file containing the contigs.-f: Path to the output png file containing the graph. If not provided, the graph will not be saved.
Here are some examples of running the assembler with different sets of reads:
eva71_plus_perfect.fq: Contains all reads generated for the complete sequencing of Enterovirus A71.
python debruijn/debruijn.py -i data/eva71_plus_perfect.fq -k 21 -o results/contigs_full_reads.fasta -f results/graph_full_reads.png
eva71_two_reads.fq: Contains the first two reads from eva71_plus_perfect.fq
python debruijn/debruijn.py -i data/eva71_two_reads.fq -k 21 -o results/contigs_2_reads.fasta -f results/graph_2_reads.png
eva71_hundred_reads.fq: Contains the first 100 reads from eva71_plus_perfect.fq
python debruijn/debruijn.py -i data/eva71_hundred_reads.fq -k 21 -o results/contigs_100_reads.fasta -f results/graph_100_reads.png
To assess the quality of our assembler, we used BLAST to compare the contigs generated by our assembler (full reads) with the original genome of Enterovirus A71. The following steps were followed:
Note: Ensure that the BLAST+ software is installed on your machine.
- Create a BLAST database from the original genome:
makeblastdb -in data/eva71.fna -dbtype nucl
- Run BLAST on the contigs:
blastn -query results/contigs_full_reads.fasta -db data/eva71.fna -out results/blast_results.txt
Here are a part of the results obtained from the BLAST analysis:
BLASTN 2.12.0+
Reference: Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb
Miller (2000), "A greedy algorithm for aligning DNA sequences", J
Comput Biol 2000; 7(1-2):203-14.
Database: data/eva71.fna
1 sequences; 7,408 total letters
Query= contig_0 len=7391
Length=7391
Score E
Sequences producing significant alignments: (Bits) Value
EVA71_BrCr_U22521 EVA71 13649 0.0
>EVA71_BrCr_U22521 EVA71
Length=7408
Score = 13649 bits (7391), Expect = 0.0
Identities = 7391/7391 (100%), Gaps = 0/7391 (0%)
Strand=Plus/Plus
Query 1 GTGGGTTGTCACCCACCCACAGGGTCCACTGGGCGCTAGTACACTGGTATCTCGGTACCT 60
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 12 GTGGGTTGTCACCCACCCACAGGGTCCACTGGGCGCTAGTACACTGGTATCTCGGTACCT 71
Query 61 TTGTACGCCTGTTTTATACCCCCTCCCTGATTTGCAACTTAGAAGCAACGCAAACCAGAT 120
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 72 TTGTACGCCTGTTTTATACCCCCTCCCTGATTTGCAACTTAGAAGCAACGCAAACCAGAT 131
You can find the full BLAST results in the results/blast_results.txt file.
We can see that the contigs generated by our assembler match the original genome of Enterovirus A71 with 100% identity. This indicates that our assembler is able to accurately reconstruct the genome of Enterovirus A71 from the reads provided ! 🎉