Skip to content

Essmaw/Debruijn-Graph-based-Assembler

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Debruijn Graph-based Assembler 🧬

Overview

This project is a fork of the original repository debruijn-tp, aimed at assembling the genome of Enterovirus A71. This genome is notably short, comprising 7408 nucleotides and is linear and non-segmented. The FastQ file provided was generated using the ART program via the command:

art_illumina -i eva71.fna -ef -l 100 -f 20 -o eva71 -ir 0 -dr 0 -ir2 0 -dr2 0 -na -qL 41 -rs 1539952693

The reads have a maximum quality of 41 and contain no insertions, only the 5' -> 3' strand reads are supplied.

Objectives

The goal of this practical is to develop an assembler for the Enterovirus A71 genome using the Debruijn graph algorithm. Our assembler operates in three main stages:

  1. K-mer Generation: The reads are then split into k-mers of a fixed length k.
  2. Graph Construction: The k-mers are then used to construct a Debruijn graph.
  3. Solving Bubbles and Removing Tips: The graph is then simplified by removing bubbles and out/entrytips.
  4. Contig Generation: The contigs are then generated from the simplified graph.

Usage 💻

Installation

To install the required dependencies, you can use the following command:

pip3 install --user networkx pytest pylint pytest-cov loguru

Running the Assembler

The assembler can be run using the following command:

python debruijn/debruijn.py -i [fastq_path] -k [k-mer_size] -o [contigs_path] -f [graph_path]

Options:

  • -i : Path to the FastQ file containing the reads.
  • -k : Size of the k-mers to be generated. Default is 21.
  • -o : Path to the output fasta file containing the contigs.
  • -f : Path to the output png file containing the graph. If not provided, the graph will not be saved.

Examples 📊

Here are some examples of running the assembler with different sets of reads:

  • eva71_plus_perfect.fq: Contains all reads generated for the complete sequencing of Enterovirus A71.
python debruijn/debruijn.py -i data/eva71_plus_perfect.fq -k 21 -o results/contigs_full_reads.fasta -f results/graph_full_reads.png
  • eva71_two_reads.fq: Contains the first two reads from eva71_plus_perfect.fq
python debruijn/debruijn.py -i data/eva71_two_reads.fq -k 21 -o results/contigs_2_reads.fasta -f results/graph_2_reads.png

  • eva71_hundred_reads.fq: Contains the first 100 reads from eva71_plus_perfect.fq
python debruijn/debruijn.py -i data/eva71_hundred_reads.fq -k 21 -o results/contigs_100_reads.fasta -f results/graph_100_reads.png

Testing Results with BLAST 🧪

To assess the quality of our assembler, we used BLAST to compare the contigs generated by our assembler (full reads) with the original genome of Enterovirus A71. The following steps were followed:

Note: Ensure that the BLAST+ software is installed on your machine.

  1. Create a BLAST database from the original genome:
makeblastdb -in data/eva71.fna -dbtype nucl
  1. Run BLAST on the contigs:
blastn -query results/contigs_full_reads.fasta -db data/eva71.fna -out results/blast_results.txt

Here are a part of the results obtained from the BLAST analysis:

BLASTN 2.12.0+


Reference: Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb
Miller (2000), "A greedy algorithm for aligning DNA sequences", J
Comput Biol 2000; 7(1-2):203-14.



Database: data/eva71.fna
           1 sequences; 7,408 total letters



Query= contig_0 len=7391

Length=7391
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value

EVA71_BrCr_U22521 EVA71                                               13649   0.0  


>EVA71_BrCr_U22521 EVA71
Length=7408

 Score = 13649 bits (7391),  Expect = 0.0
 Identities = 7391/7391 (100%), Gaps = 0/7391 (0%)
 Strand=Plus/Plus

Query  1     GTGGGTTGTCACCCACCCACAGGGTCCACTGGGCGCTAGTACACTGGTATCTCGGTACCT  60
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  12    GTGGGTTGTCACCCACCCACAGGGTCCACTGGGCGCTAGTACACTGGTATCTCGGTACCT  71

Query  61    TTGTACGCCTGTTTTATACCCCCTCCCTGATTTGCAACTTAGAAGCAACGCAAACCAGAT  120
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  72    TTGTACGCCTGTTTTATACCCCCTCCCTGATTTGCAACTTAGAAGCAACGCAAACCAGAT  131

You can find the full BLAST results in the results/blast_results.txt file. We can see that the contigs generated by our assembler match the original genome of Enterovirus A71 with 100% identity. This indicates that our assembler is able to accurately reconstruct the genome of Enterovirus A71 from the reads provided ! 🎉

Author 🧑‍💻

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%