This is working repository for Genorobotic's participation to the Lemanic Life Sciences Hackathon of April 2024.
Genorobotics aims to develop field tools for plant biodiversity identification based on DNA barcoding. This eliminates the need to transport biological samples to labs for testing and aims to support biodiversity conservation efforts.
Four genes are used to identify the species of a sample: matk, trnH-psbA, ITS and rbcL. These genes are amplified through Polymerase Chain Reaction (PCR) and sequenced with an Oxford Nanopore Portable Sequencer.
The goal of the bioinformatics team is to interpret the data generated by the sequencer. First, the raw DNA reads from a fastq file need to be aligned to generate a consensus sequence for each gene. Then, this sequence is compared to NCBI's database of genetic sequences, GenBank with BLASTn. The results of the BLASTn queries for the four genes are put together to predict the species.
The project revolves around DNA sequence clustering. This can be used for 2 purposes, the project will focus on the first:
-
Demultiplexing without additional barcodes: In the summer expedition, multiple genes are sequenced for every plant, all within one sample, using the same barcode. This reduces the number of pipetting steps on site and tubes used. It also simplifies the protocole, reducing manipulation errors. However, This means the reads from multiple different genes end up in the same file. This is problematic for consensus sequence generation, as no one sequence will emerge. They need to be seperated according to their sequence of origin.
-
Increasing coverage depth by directionalizing reads: Even when one gene is sequenced, there are four kinds of reads in one file. This is because PCR amplifies both coding and non-coding strands, and the direction in which the sequence is inserted in the sequencer is random. The four orientiations are then: coding 5', coding 3', non-coding 5', non-coding 3'.
The datasets used are the sequencing results of multiple Genorobotics expeditions, more information by the method of extraction for each expedition can be found in the associated general_info.csv file:
- demultiplexed data: Every sample contains reads from only one species and one gene. The only demultiplexed expedition is the one to Lausanne's botanical garden consisting of 10 samples of 5 different species and the 4 barcoding genes.
- multiplexed data: Every sample contains reads from only one species but up to 4 different genes. All of the remaning expeditions are multiplexed (i.e the summer expedition consisting of 12 samples of 12 different plants)
- Each sub-folder represents a sample and spells out the sample's species, genes sequenced and barcode used.
- Each subfolder contains a
fastqfile containing the raw reads and afastafile containing the species reference sequences for the genes amplified from the GenBank database - Three csv files
general_info.csv,primer_info.csvandsample_info.csvcontain information about the expedition, the primers used and the species/genes for each sample respectively. - The
tomatofolder is the only one to not to follow this organization, it is simply a compilation of sequencings from commercial tomato samples.
-
To explore the effect of the number of reads and the rate of sequencing errors on the clustering methods we will develop, fake fastq databases can be generated from a reference sequence.
-
Additionally, real demultiplexed fastq files can be joined together while keeping track of their file of origin, to simulate the effect of multiplexing on real data.
- Conda 4.8.3
- BLAST 2.X.0
- WSL 2 (Windows only)
It is highly recommended to install BLASTn on WSL rather than directly on Windows, and to run the pipeline using WSL. The pipeline was developed and tested on a Windows 11 machine with WSL 2 (Windows Subsystem for Linux). For more information about WSL 2, visit Microsoft's official documentation.
If you choose to install BLASTn on Windows, you can still run the pipeline in Windows Powershell. Make sure to set the 'windows' parameter to True when calling the .py scripts or in the .ipynb notebooks. Note that commands will still launch in a WSL terminal, so WSL installation remains a requirement.
- A very useful tutorial to install BLASTn on macOS can be found here.
- In your
.bash_profile, place the new lines regarding BLAST at the beginning of the file. - The
run_commandandrun_bash_commandfunctions should function correctly, but they might encounter issues. If you face any bugs when running minimap2, racon, or BLASTn commands, address those first.
The pipeline is designed to run within an Anaconda environment. The required environment configuration is specified in the file genorobotics_pipeline.yml.
-
Create the Anaconda Environment:
- Open your terminal and navigate to the project's folder.
- Run the command
conda env create -f genorobotics_pipeline.yml. This will create an environment namedgenorobotics_pipeline. (Ensure Anaconda is installed beforehand.)
-
Activate the Environment:
- In your terminal, type
conda activate genorobotics_pipelineto activate the newly created environment.
- In your terminal, type
-
Download BLAST: Visit the NCBI BLAST FTP site and choose the appropriate BLAST+ package for your system (look for a file ending in
-x64-linux.tar.gzfor Linux or-x64-macosx.tar.gzfor macOS). -
Download via Terminal: Open your terminal and use
wgetorcurlto download the BLAST+ package. Replace the URL with the correct version:wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.X.0+-x64-linux.tar.gz # or curl -O ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.X.0+-x64-linux.tar.gz -
Extract the Package:
- Use the following command to extract the tarball:
tar -zxvf ncbi-blast-2.12.0+-x64-linux.tar.gz
- This will create a directory with the BLAST+ executables.
- Use the following command to extract the tarball:
-
Add BLAST to Your PATH:
- To run BLASTn from any location, you need to add it to your PATH. Replace
/path/to/ncbi-blast-2.12.0+/binwith the actual path to thebindirectory inside the extracted folder.export PATH=$PATH:/path/to/ncbi-blast-2.12.0+/bin
- You also need to add the db folder to your BLASTDB environment variable. Replace
/path/to/ncbi-blast-2.12.0+/dbwith the actual path to thedbdirectory inside the extracted folder.export BLASTDB=$BLASTDB:/path/to/ncbi-blast-2.12.0+/db
- add these lines to your
.bashrcor.bash_profile(or the equivalent for your shell) to make this change permanent.
- To run BLASTn from any location, you need to add it to your PATH. Replace
-
Test the Installation:
- To ensure BLASTn is installed correctly, run:
blastn -version
- This should return the installed version of BLASTn.
- To ensure BLASTn is installed correctly, run:
-
Update your System (Optional):
- Ensure your system's package list is updated and all dependencies are satisfied:
sudo apt-get update # For Debian/Ubuntu sudo yum update # For CentOS/RedHat
- Ensure your system's package list is updated and all dependencies are satisfied:
To install BLASTn on Windows, follow these steps carefully:
-
Download BLAST+ Executables:
- Visit the NCBI BLAST download page.
- Look for the Windows version of the BLAST+ executables (e.g.,
ncbi-blast-2.12.0+-win64.exe). Click to download the installer.
-
Install BLAST:
- After downloading, run the
.exeinstaller. - Follow the installation wizard, which will guide you through the setup process, including the installation location for BLAST.
- After downloading, run the
-
Unzip the Downloaded File:
- If the downloaded BLAST+ package is a zip file, unzip it to your desired location.
-
Set Up Environment Variables:
- Add the path to the
binfolder of the BLAST installation to the PATH environment variable. - Create a new environment variable named
BLASTDBand set its value to the path of thedbfolder in your BLAST installation. - To set these environment variables:
- Right-click on 'This PC' or 'Computer' on the desktop or in File Explorer, then select 'Properties'.
- Click 'Advanced system settings' and then the 'Environment Variables' button.
- Under 'System variables', find and select the PATH variable, then click 'Edit' to add the BLAST
bindirectory. - Click 'New' to create the
BLASTDBvariable and set its value to the BLASTdbdirectory. - Click 'OK' to save changes and close all dialogs.
- Add the path to the
-
Test the Installation:
- Open Command Prompt and type
blastn -version. - This should display the installed version of BLASTn, confirming the successful installation.
- Open Command Prompt and type
Note: The installation process may vary slightly based on the version of the BLAST+ executables. Always follow the instructions provided with the downloaded package. If you encounter any issues, ensure that the paths in your environment variables are correct.
BLASTn is run locally as an alignment tool to find the best match for our sample sequence from a database of known genetic sequences. This database must be downloadd manually for the four genes of interest from GenBank
Follow these steps:
- Go on the website: https://www.ncbi.nlm.nih.gov/nuccore
- For each of the four genes: MatK, rbcL, psbA-trnH, Internal Transcribed Spacer
- Click on the
Advancedsearch option (under the search bar) Type the name of the gene in the search bar and click onsearch - In the first bar, Type the name of the gene. In the second bar, replace
All FieldsbySequence Lengthand select the range of sequence lengths to download (ex: 750:1500). The ranges proposed here provide a good basis for species identification while keeping the database size bearable:- MatK : 750 to 1500 -> ~110k Sequence, ~90Mo
- rbcL : 600 to 1000 -> ~90k Sequence, ~80Mo
- psbA-trnH : 400 to 800 -> ~60k Sequence, ~40Mo
- ITS: 1000 to 35000
- Click on the
Note: Be sure to search for "Internal Transcribed Spacer" instead of "ITS" to get results
- Press on Search then Send to (corner top right) > Complete Record > File > Format = Fasta > Create File
- Move the downloaded fasta file to the /db folder of your BLASTn installation directory
- open a terminal and place yourself in that /db directory using cd commands
- Use the makeblastdb command-line tool included with BLAST, replacing db_name.fasta by the name of the fasta you downloaded and output_name by the name of the four genes (use exactly these spellings and capitalization: matK, rbcL, psbA-trnH, ITS)
makeblastdb -in <db_name.fasta> -dbtype nucl -parse_seqids -out <output_name>You can check it was correctly installed by asking infos about the resulting db :
blastdbcmd -db <db_name> -info- J. Selz, N. R. Adam, C. E. M. Magrini, F. M. Montandon, S. Buerki, and S. J. Maerkl, ‘A field-capable rapid plant DNA extraction protocol using microneedle patches for botanical surveying and monitoring’, Appl Plant Sci, vol. 11, no. 3, p. e11529, 2023, doi: 10.1002/aps3.11529.
- A. Hakimzadeh et al., ‘A pile of pipelines: An overview of the bioinformatics software for metabarcoding data analyses’, Molecular Ecology Resources, vol. n/a, no. n/a, doi: 10.1111/1755-0998.13847.
- R. Vaser, I. Sović, N. Nagarajan, and M. Šikić, ‘Fast and accurate de novo genome assembly from long uncorrected reads’, Genome Res, vol. 27, no. 5, pp. 737–746, May 2017, doi: 10.1101/gr.214270.116.
- M. Menegon et al., ‘On site DNA barcoding by nanopore sequencing’, PLOS ONE, vol. 12, no. 10, p. e0184741, Oct. 2017, doi: 10.1371/journal.pone.0184741.
- A. Peña-Albert, E. Ordonneau, ‘Optimizing Computational Processes and Offline Operations in a Bio-Informatics Nanopore Sequencing Pipeline’, https://github.com/Awe-n/genorobotics-semester-project/blob/master/GenoRobotics_Project_Summary.pdf


