Debruijn Graph-based Assembler 🧬

Overview

This project is a fork of the original repository debruijn-tp, aimed at assembling the genome of Enterovirus A71. This genome is notably short, comprising 7408 nucleotides and is linear and non-segmented. The FastQ file provided was generated using the ART program via the command:

art_illumina -i eva71.fna -ef -l 100 -f 20 -o eva71 -ir 0 -dr 0 -ir2 0 -dr2 0 -na -qL 41 -rs 1539952693

The reads have a maximum quality of 41 and contain no insertions, only the 5' -> 3' strand reads are supplied.

Objectives

The goal of this practical is to develop an assembler for the Enterovirus A71 genome using the Debruijn graph algorithm. Our assembler operates in three main stages:

K-mer Generation: The reads are then split into k-mers of a fixed length k.
Graph Construction: The k-mers are then used to construct a Debruijn graph.
Solving Bubbles and Removing Tips: The graph is then simplified by removing bubbles and out/entrytips.
Contig Generation: The contigs are then generated from the simplified graph.

Usage 💻

Installation

To install the required dependencies, you can use the following command:

pip3 install --user networkx pytest pylint pytest-cov loguru

Running the Assembler

The assembler can be run using the following command:

python debruijn/debruijn.py -i [fastq_path] -k [k-mer_size] -o [contigs_path] -f [graph_path]

Options:

-i : Path to the FastQ file containing the reads.
-k : Size of the k-mers to be generated. Default is 21.
-o : Path to the output fasta file containing the contigs.
-f : Path to the output png file containing the graph. If not provided, the graph will not be saved.

Examples 📊

Here are some examples of running the assembler with different sets of reads:

eva71_plus_perfect.fq: Contains all reads generated for the complete sequencing of Enterovirus A71.

python debruijn/debruijn.py -i data/eva71_plus_perfect.fq -k 21 -o results/contigs_full_reads.fasta -f results/graph_full_reads.png

eva71_two_reads.fq: Contains the first two reads from eva71_plus_perfect.fq

python debruijn/debruijn.py -i data/eva71_two_reads.fq -k 21 -o results/contigs_2_reads.fasta -f results/graph_2_reads.png

eva71_hundred_reads.fq: Contains the first 100 reads from eva71_plus_perfect.fq

python debruijn/debruijn.py -i data/eva71_hundred_reads.fq -k 21 -o results/contigs_100_reads.fasta -f results/graph_100_reads.png

Testing Results with BLAST 🧪

To assess the quality of our assembler, we used BLAST to compare the contigs generated by our assembler (full reads) with the original genome of Enterovirus A71. The following steps were followed:

Note: Ensure that the BLAST+ software is installed on your machine.

Create a BLAST database from the original genome:

makeblastdb -in data/eva71.fna -dbtype nucl

Run BLAST on the contigs:

blastn -query results/contigs_full_reads.fasta -db data/eva71.fna -out results/blast_results.txt

Here are a part of the results obtained from the BLAST analysis:

BLASTN 2.12.0+


Reference: Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb
Miller (2000), "A greedy algorithm for aligning DNA sequences", J
Comput Biol 2000; 7(1-2):203-14.



Database: data/eva71.fna
           1 sequences; 7,408 total letters



Query= contig_0 len=7391

Length=7391
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value

EVA71_BrCr_U22521 EVA71                                               13649   0.0  


>EVA71_BrCr_U22521 EVA71
Length=7408

 Score = 13649 bits (7391),  Expect = 0.0
 Identities = 7391/7391 (100%), Gaps = 0/7391 (0%)
 Strand=Plus/Plus

Query  1     GTGGGTTGTCACCCACCCACAGGGTCCACTGGGCGCTAGTACACTGGTATCTCGGTACCT  60
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  12    GTGGGTTGTCACCCACCCACAGGGTCCACTGGGCGCTAGTACACTGGTATCTCGGTACCT  71

Query  61    TTGTACGCCTGTTTTATACCCCCTCCCTGATTTGCAACTTAGAAGCAACGCAAACCAGAT  120
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  72    TTGTACGCCTGTTTTATACCCCCTCCCTGATTTGCAACTTAGAAGCAACGCAAACCAGAT  131

You can find the full BLAST results in the results/blast_results.txt file. We can see that the contigs generated by our assembler match the original genome of Enterovirus A71 with 100% identity. This indicates that our assembler is able to accurately reconstruct the genome of Enterovirus A71 from the reads provided ! 🎉

Author 🧑‍💻

Essmay Touami

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
data		data
debruijn		debruijn
results		results
tests		tests
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Debruijn Graph-based Assembler 🧬

Overview

Objectives

Usage 💻

Installation

Running the Assembler

Examples 📊

Testing Results with BLAST 🧪

Author 🧑‍💻

About

Uh oh!

Releases

Packages

Languages

Essmaw/Debruijn-Graph-based-Assembler

Folders and files

Latest commit

History

Repository files navigation

Debruijn Graph-based Assembler 🧬

Overview

Objectives

Usage 💻

Installation

Running the Assembler

Examples 📊

Testing Results with BLAST 🧪

Author 🧑‍💻

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages