Skip to content

Genorobotic's participation to the Lemanic Life Sciences Hackathon of April 2024

Notifications You must be signed in to change notification settings

GenoRobotics-EPFL/Lemanic-Hackathon

Repository files navigation

Lemanic-Hackathon

This is working repository for Genorobotic's participation to the Lemanic Life Sciences Hackathon of April 2024.

Genorobotics in brief

Genorobotics aims to develop field tools for plant biodiversity identification based on DNA barcoding. This eliminates the need to transport biological samples to labs for testing and aims to support biodiversity conservation efforts.

Four genes are used to identify the species of a sample: matk, trnH-psbA, ITS and rbcL. These genes are amplified through Polymerase Chain Reaction (PCR) and sequenced with an Oxford Nanopore Portable Sequencer.

The goal of the bioinformatics team is to interpret the data generated by the sequencer. First, the raw DNA reads from a fastq file need to be aligned to generate a consensus sequence for each gene. Then, this sequence is compared to NCBI's database of genetic sequences, GenBank with BLASTn. The results of the BLASTn queries for the four genes are put together to predict the species.

montage

The project

The project revolves around DNA sequence clustering. This can be used for 2 purposes, the project will focus on the first:

  • Demultiplexing without additional barcodes: In the summer expedition, multiple genes are sequenced for every plant, all within one sample, using the same barcode. This reduces the number of pipetting steps on site and tubes used. It also simplifies the protocole, reducing manipulation errors. However, This means the reads from multiple different genes end up in the same file. This is problematic for consensus sequence generation, as no one sequence will emerge. They need to be seperated according to their sequence of origin.

  • Increasing coverage depth by directionalizing reads: Even when one gene is sequenced, there are four kinds of reads in one file. This is because PCR amplifies both coding and non-coding strands, and the direction in which the sequence is inserted in the sequencer is random. The four orientiations are then: coding 5', coding 3', non-coding 5', non-coding 3'.

The datasets

The datasets used are the sequencing results of multiple Genorobotics expeditions, more information by the method of extraction for each expedition can be found in the associated general_info.csv file:

  • demultiplexed data: Every sample contains reads from only one species and one gene. The only demultiplexed expedition is the one to Lausanne's botanical garden consisting of 10 samples of 5 different species and the 4 barcoding genes.
  • multiplexed data: Every sample contains reads from only one species but up to 4 different genes. All of the remaning expeditions are multiplexed (i.e the summer expedition consisting of 12 samples of 12 different plants)

data organization

  • Each sub-folder represents a sample and spells out the sample's species, genes sequenced and barcode used.
  • Each subfolder contains a fastq file containing the raw reads and a fasta file containing the species reference sequences for the genes amplified from the GenBank database
  • Three csv files general_info.csv, primer_info.csv and sample_info.csv contain information about the expedition, the primers used and the species/genes for each sample respectively.
  • The tomato folder is the only one to not to follow this organization, it is simply a compilation of sequencings from commercial tomato samples.

data generation

  • To explore the effect of the number of reads and the rate of sequencing errors on the clustering methods we will develop, fake fastq databases can be generated from a reference sequence.

  • Additionally, real demultiplexed fastq files can be joined together while keeping track of their file of origin, to simulate the effect of multiplexing on real data.

Setup

Requirements

  1. Conda 4.8.3
  2. BLAST 2.X.0
  3. WSL 2 (Windows only)

Notes for Windows Users

It is highly recommended to install BLASTn on WSL rather than directly on Windows, and to run the pipeline using WSL. The pipeline was developed and tested on a Windows 11 machine with WSL 2 (Windows Subsystem for Linux). For more information about WSL 2, visit Microsoft's official documentation.

If you choose to install BLASTn on Windows, you can still run the pipeline in Windows Powershell. Make sure to set the 'windows' parameter to True when calling the .py scripts or in the .ipynb notebooks. Note that commands will still launch in a WSL terminal, so WSL installation remains a requirement.

Notes for macOS Users

  • A very useful tutorial to install BLASTn on macOS can be found here.
  • In your .bash_profile, place the new lines regarding BLAST at the beginning of the file.
  • The run_command and run_bash_command functions should function correctly, but they might encounter issues. If you face any bugs when running minimap2, racon, or BLASTn commands, address those first.

How to Install the Required Python Environment

The pipeline is designed to run within an Anaconda environment. The required environment configuration is specified in the file genorobotics_pipeline.yml.

Setting Up the Environment

  1. Create the Anaconda Environment:

    • Open your terminal and navigate to the project's folder.
    • Run the command conda env create -f genorobotics_pipeline.yml. This will create an environment named genorobotics_pipeline. (Ensure Anaconda is installed beforehand.)
  2. Activate the Environment:

    • In your terminal, type conda activate genorobotics_pipeline to activate the newly created environment.

Installing BLASTn on Unix

  1. Download BLAST: Visit the NCBI BLAST FTP site and choose the appropriate BLAST+ package for your system (look for a file ending in -x64-linux.tar.gz for Linux or -x64-macosx.tar.gz for macOS).

  2. Download via Terminal: Open your terminal and use wget or curl to download the BLAST+ package. Replace the URL with the correct version:

    wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.X.0+-x64-linux.tar.gz
    # or
    curl -O ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.X.0+-x64-linux.tar.gz
  3. Extract the Package:

    • Use the following command to extract the tarball:
      tar -zxvf ncbi-blast-2.12.0+-x64-linux.tar.gz
    • This will create a directory with the BLAST+ executables.
  4. Add BLAST to Your PATH:

    • To run BLASTn from any location, you need to add it to your PATH. Replace /path/to/ncbi-blast-2.12.0+/bin with the actual path to the bin directory inside the extracted folder.
      export PATH=$PATH:/path/to/ncbi-blast-2.12.0+/bin
    • You also need to add the db folder to your BLASTDB environment variable. Replace /path/to/ncbi-blast-2.12.0+/db with the actual path to the db directory inside the extracted folder.
      export BLASTDB=$BLASTDB:/path/to/ncbi-blast-2.12.0+/db
    • add these lines to your .bashrc or .bash_profile (or the equivalent for your shell) to make this change permanent.
  5. Test the Installation:

    • To ensure BLASTn is installed correctly, run:
      blastn -version
    • This should return the installed version of BLASTn.
  6. Update your System (Optional):

    • Ensure your system's package list is updated and all dependencies are satisfied:
      sudo apt-get update      # For Debian/Ubuntu
      sudo yum update          # For CentOS/RedHat

How to Install BLASTn on Windows

To install BLASTn on Windows, follow these steps carefully:

  1. Download BLAST+ Executables:

    • Visit the NCBI BLAST download page.
    • Look for the Windows version of the BLAST+ executables (e.g., ncbi-blast-2.12.0+-win64.exe). Click to download the installer.
  2. Install BLAST:

    • After downloading, run the .exe installer.
    • Follow the installation wizard, which will guide you through the setup process, including the installation location for BLAST.
  3. Unzip the Downloaded File:

    • If the downloaded BLAST+ package is a zip file, unzip it to your desired location.
  4. Set Up Environment Variables:

    • Add the path to the bin folder of the BLAST installation to the PATH environment variable.
    • Create a new environment variable named BLASTDB and set its value to the path of the db folder in your BLAST installation.
    • To set these environment variables:
      • Right-click on 'This PC' or 'Computer' on the desktop or in File Explorer, then select 'Properties'.
      • Click 'Advanced system settings' and then the 'Environment Variables' button.
      • Under 'System variables', find and select the PATH variable, then click 'Edit' to add the BLAST bin directory.
      • Click 'New' to create the BLASTDB variable and set its value to the BLAST db directory.
      • Click 'OK' to save changes and close all dialogs.
  5. Test the Installation:

    • Open Command Prompt and type blastn -version.
    • This should display the installed version of BLASTn, confirming the successful installation.

Note: The installation process may vary slightly based on the version of the BLAST+ executables. Always follow the instructions provided with the downloaded package. If you encounter any issues, ensure that the paths in your environment variables are correct.

How to create the reference databases for species identification

Manual Download

BLASTn is run locally as an alignment tool to find the best match for our sample sequence from a database of known genetic sequences. This database must be downloadd manually for the four genes of interest from GenBank

Follow these steps:

  • Go on the website: https://www.ncbi.nlm.nih.gov/nuccore
  • For each of the four genes: MatK, rbcL, psbA-trnH, Internal Transcribed Spacer
    • Click on the Advanced search option (under the search bar) Type the name of the gene in the search bar and click on search
    • In the first bar, Type the name of the gene. In the second bar, replace All Fields by Sequence Length and select the range of sequence lengths to download (ex: 750:1500). The ranges proposed here provide a good basis for species identification while keeping the database size bearable:
      • MatK : 750 to 1500 -> ~110k Sequence, ~90Mo
      • rbcL : 600 to 1000 -> ~90k Sequence, ~80Mo
      • psbA-trnH : 400 to 800 -> ~60k Sequence, ~40Mo
      • ITS: 1000 to 35000

Note: Be sure to search for "Internal Transcribed Spacer" instead of "ITS" to get results

  • Press on Search then Send to (corner top right) > Complete Record > File > Format = Fasta > Create File

Creating the BLASTn databases

  • Move the downloaded fasta file to the /db folder of your BLASTn installation directory
  • open a terminal and place yourself in that /db directory using cd commands
  • Use the makeblastdb command-line tool included with BLAST, replacing db_name.fasta by the name of the fasta you downloaded and output_name by the name of the four genes (use exactly these spellings and capitalization: matK, rbcL, psbA-trnH, ITS)
makeblastdb -in <db_name.fasta> -dbtype nucl -parse_seqids -out <output_name>

You can check it was correctly installed by asking infos about the resulting db :

blastdbcmd -db <db_name> -info

References

  • J. Selz, N. R. Adam, C. E. M. Magrini, F. M. Montandon, S. Buerki, and S. J. Maerkl, ‘A field-capable rapid plant DNA extraction protocol using microneedle patches for botanical surveying and monitoring’, Appl Plant Sci, vol. 11, no. 3, p. e11529, 2023, doi: 10.1002/aps3.11529.
  • A. Hakimzadeh et al., ‘A pile of pipelines: An overview of the bioinformatics software for metabarcoding data analyses’, Molecular Ecology Resources, vol. n/a, no. n/a, doi: 10.1111/1755-0998.13847.
  • R. Vaser, I. Sović, N. Nagarajan, and M. Šikić, ‘Fast and accurate de novo genome assembly from long uncorrected reads’, Genome Res, vol. 27, no. 5, pp. 737–746, May 2017, doi: 10.1101/gr.214270.116.
  • M. Menegon et al., ‘On site DNA barcoding by nanopore sequencing’, PLOS ONE, vol. 12, no. 10, p. e0184741, Oct. 2017, doi: 10.1371/journal.pone.0184741.
  • A. Peña-Albert, E. Ordonneau, ‘Optimizing Computational Processes and Offline Operations in a Bio-Informatics Nanopore Sequencing Pipeline’, https://github.com/Awe-n/genorobotics-semester-project/blob/master/GenoRobotics_Project_Summary.pdf

About

Genorobotic's participation to the Lemanic Life Sciences Hackathon of April 2024

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •