Lemanic-Hackathon

This is working repository for Genorobotic's participation to the Lemanic Life Sciences Hackathon of April 2024.

Genorobotics in brief

Genorobotics aims to develop field tools for plant biodiversity identification based on DNA barcoding. This eliminates the need to transport biological samples to labs for testing and aims to support biodiversity conservation efforts.

Four genes are used to identify the species of a sample: matk, trnH-psbA, ITS and rbcL. These genes are amplified through Polymerase Chain Reaction (PCR) and sequenced with an Oxford Nanopore Portable Sequencer.

The goal of the bioinformatics team is to interpret the data generated by the sequencer. First, the raw DNA reads from a fastq file need to be aligned to generate a consensus sequence for each gene. Then, this sequence is compared to NCBI's database of genetic sequences, GenBank with BLASTn. The results of the BLASTn queries for the four genes are put together to predict the species.

The project

The project revolves around DNA sequence clustering. This can be used for 2 purposes, the project will focus on the first:

Demultiplexing without additional barcodes: In the summer expedition, multiple genes are sequenced for every plant, all within one sample, using the same barcode. This reduces the number of pipetting steps on site and tubes used. It also simplifies the protocole, reducing manipulation errors. However, This means the reads from multiple different genes end up in the same file. This is problematic for consensus sequence generation, as no one sequence will emerge. They need to be seperated according to their sequence of origin.
Increasing coverage depth by directionalizing reads: Even when one gene is sequenced, there are four kinds of reads in one file. This is because PCR amplifies both coding and non-coding strands, and the direction in which the sequence is inserted in the sequencer is random. The four orientiations are then: coding 5', coding 3', non-coding 5', non-coding 3'.

The datasets

The datasets used are the sequencing results of multiple Genorobotics expeditions, more information by the method of extraction for each expedition can be found in the associated general_info.csv file:

demultiplexed data: Every sample contains reads from only one species and one gene. The only demultiplexed expedition is the one to Lausanne's botanical garden consisting of 10 samples of 5 different species and the 4 barcoding genes.
multiplexed data: Every sample contains reads from only one species but up to 4 different genes. All of the remaning expeditions are multiplexed (i.e the summer expedition consisting of 12 samples of 12 different plants)

data organization

Each sub-folder represents a sample and spells out the sample's species, genes sequenced and barcode used.
Each subfolder contains a fastq file containing the raw reads and a fasta file containing the species reference sequences for the genes amplified from the GenBank database
Three csv files general_info.csv, primer_info.csv and sample_info.csv contain information about the expedition, the primers used and the species/genes for each sample respectively.
The tomato folder is the only one to not to follow this organization, it is simply a compilation of sequencings from commercial tomato samples.

data generation

To explore the effect of the number of reads and the rate of sequencing errors on the clustering methods we will develop, fake fastq databases can be generated from a reference sequence.
Additionally, real demultiplexed fastq files can be joined together while keeping track of their file of origin, to simulate the effect of multiplexing on real data.

Setup

Requirements

Conda 4.8.3
BLAST 2.X.0
WSL 2 (Windows only)

Notes for Windows Users

It is highly recommended to install BLASTn on WSL rather than directly on Windows, and to run the pipeline using WSL. The pipeline was developed and tested on a Windows 11 machine with WSL 2 (Windows Subsystem for Linux). For more information about WSL 2, visit Microsoft's official documentation.

If you choose to install BLASTn on Windows, you can still run the pipeline in Windows Powershell. Make sure to set the 'windows' parameter to True when calling the .py scripts or in the .ipynb notebooks. Note that commands will still launch in a WSL terminal, so WSL installation remains a requirement.

Notes for macOS Users

A very useful tutorial to install BLASTn on macOS can be found here.
In your .bash_profile, place the new lines regarding BLAST at the beginning of the file.
The run_command and run_bash_command functions should function correctly, but they might encounter issues. If you face any bugs when running minimap2, racon, or BLASTn commands, address those first.

How to Install the Required Python Environment

The pipeline is designed to run within an Anaconda environment. The required environment configuration is specified in the file genorobotics_pipeline.yml.

Setting Up the Environment

Create the Anaconda Environment:
- Open your terminal and navigate to the project's folder.
- Run the command conda env create -f genorobotics_pipeline.yml. This will create an environment named genorobotics_pipeline. (Ensure Anaconda is installed beforehand.)
Activate the Environment:
- In your terminal, type conda activate genorobotics_pipeline to activate the newly created environment.

Installing BLASTn on Unix

Download BLAST: Visit the NCBI BLAST FTP site and choose the appropriate BLAST+ package for your system (look for a file ending in -x64-linux.tar.gz for Linux or -x64-macosx.tar.gz for macOS).

Download via Terminal: Open your terminal and use wget or curl to download the BLAST+ package. Replace the URL with the correct version:

wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.X.0+-x64-linux.tar.gz
# or
curl -O ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.X.0+-x64-linux.tar.gz

Extract the Package:
- Use the following command to extract the tarball:
```
tar -zxvf ncbi-blast-2.12.0+-x64-linux.tar.gz
```
- This will create a directory with the BLAST+ executables.
Add BLAST to Your PATH:
- To run BLASTn from any location, you need to add it to your PATH. Replace /path/to/ncbi-blast-2.12.0+/bin with the actual path to the bin directory inside the extracted folder.
```
export PATH=$PATH:/path/to/ncbi-blast-2.12.0+/bin
```
- You also need to add the db folder to your BLASTDB environment variable. Replace /path/to/ncbi-blast-2.12.0+/db with the actual path to the db directory inside the extracted folder.
```
export BLASTDB=$BLASTDB:/path/to/ncbi-blast-2.12.0+/db
```
- add these lines to your .bashrc or .bash_profile (or the equivalent for your shell) to make this change permanent.
Test the Installation:
- To ensure BLASTn is installed correctly, run:
```
blastn -version
```
- This should return the installed version of BLASTn.
Update your System (Optional):
- Ensure your system's package list is updated and all dependencies are satisfied:
```
sudo apt-get update      # For Debian/Ubuntu
sudo yum update          # For CentOS/RedHat
```

How to Install BLASTn on Windows

To install BLASTn on Windows, follow these steps carefully:

Download BLAST+ Executables:
- Visit the NCBI BLAST download page.
- Look for the Windows version of the BLAST+ executables (e.g., ncbi-blast-2.12.0+-win64.exe). Click to download the installer.
Install BLAST:
- After downloading, run the .exe installer.
- Follow the installation wizard, which will guide you through the setup process, including the installation location for BLAST.
Unzip the Downloaded File:
- If the downloaded BLAST+ package is a zip file, unzip it to your desired location.
Set Up Environment Variables:
- Add the path to the bin folder of the BLAST installation to the PATH environment variable.
- Create a new environment variable named BLASTDB and set its value to the path of the db folder in your BLAST installation.
- To set these environment variables:
  - Right-click on 'This PC' or 'Computer' on the desktop or in File Explorer, then select 'Properties'.
  - Click 'Advanced system settings' and then the 'Environment Variables' button.
  - Under 'System variables', find and select the PATH variable, then click 'Edit' to add the BLAST bin directory.
  - Click 'New' to create the BLASTDB variable and set its value to the BLAST db directory.
  - Click 'OK' to save changes and close all dialogs.
Test the Installation:
- Open Command Prompt and type blastn -version.
- This should display the installed version of BLASTn, confirming the successful installation.

Note: The installation process may vary slightly based on the version of the BLAST+ executables. Always follow the instructions provided with the downloaded package. If you encounter any issues, ensure that the paths in your environment variables are correct.

How to create the reference databases for species identification

Manual Download

BLASTn is run locally as an alignment tool to find the best match for our sample sequence from a database of known genetic sequences. This database must be downloadd manually for the four genes of interest from GenBank

Follow these steps:

Go on the website: https://www.ncbi.nlm.nih.gov/nuccore
For each of the four genes: MatK, rbcL, psbA-trnH, Internal Transcribed Spacer
- Click on the Advanced search option (under the search bar) Type the name of the gene in the search bar and click on search
- In the first bar, Type the name of the gene. In the second bar, replace All Fields by Sequence Length and select the range of sequence lengths to download (ex: 750:1500). The ranges proposed here provide a good basis for species identification while keeping the database size bearable:
  - MatK : 750 to 1500 -> ~110k Sequence, ~90Mo
  - rbcL : 600 to 1000 -> ~90k Sequence, ~80Mo
  - psbA-trnH : 400 to 800 -> ~60k Sequence, ~40Mo
  - ITS: 1000 to 35000

Note: Be sure to search for "Internal Transcribed Spacer" instead of "ITS" to get results

Press on Search then Send to (corner top right) > Complete Record > File > Format = Fasta > Create File

Creating the BLASTn databases

Move the downloaded fasta file to the /db folder of your BLASTn installation directory
open a terminal and place yourself in that /db directory using cd commands
Use the makeblastdb command-line tool included with BLAST, replacing db_name.fasta by the name of the fasta you downloaded and output_name by the name of the four genes (use exactly these spellings and capitalization: matK, rbcL, psbA-trnH, ITS)

makeblastdb -in <db_name.fasta> -dbtype nucl -parse_seqids -out <output_name>

You can check it was correctly installed by asking infos about the resulting db :

blastdbcmd -db <db_name> -info

References

J. Selz, N. R. Adam, C. E. M. Magrini, F. M. Montandon, S. Buerki, and S. J. Maerkl, ‘A field-capable rapid plant DNA extraction protocol using microneedle patches for botanical surveying and monitoring’, Appl Plant Sci, vol. 11, no. 3, p. e11529, 2023, doi: 10.1002/aps3.11529.
A. Hakimzadeh et al., ‘A pile of pipelines: An overview of the bioinformatics software for metabarcoding data analyses’, Molecular Ecology Resources, vol. n/a, no. n/a, doi: 10.1111/1755-0998.13847.
R. Vaser, I. Sović, N. Nagarajan, and M. Šikić, ‘Fast and accurate de novo genome assembly from long uncorrected reads’, Genome Res, vol. 27, no. 5, pp. 737–746, May 2017, doi: 10.1101/gr.214270.116.
M. Menegon et al., ‘On site DNA barcoding by nanopore sequencing’, PLOS ONE, vol. 12, no. 10, p. e0184741, Oct. 2017, doi: 10.1371/journal.pone.0184741.
A. Peña-Albert, E. Ordonneau, ‘Optimizing Computational Processes and Offline Operations in a Bio-Informatics Nanopore Sequencing Pipeline’, https://github.com/Awe-n/genorobotics-semester-project/blob/master/GenoRobotics_Project_Summary.pdf

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
data		data
images		images
lib		lib
output		output
.gitignore		.gitignore
README.md		README.md
create_fake_dataset.py		create_fake_dataset.py
genorobotics_pipeline.yml		genorobotics_pipeline.yml
run_expedition.py		run_expedition.py
standard-detailed-pipeline.ipynb		standard-detailed-pipeline.ipynb
standard_pipeline.py		standard_pipeline.py
testing.ipynb		testing.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lemanic-Hackathon

Genorobotics in brief

The project

The datasets

data organization

data generation

Setup

Requirements

Notes for Windows Users

Notes for macOS Users

How to Install the Required Python Environment

Setting Up the Environment

Installing BLASTn on Unix

How to Install BLASTn on Windows

How to create the reference databases for species identification

Manual Download

Creating the BLASTn databases

References

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

GenoRobotics-EPFL/Lemanic-Hackathon

Folders and files

Latest commit

History

Repository files navigation

Lemanic-Hackathon

Genorobotics in brief

The project

The datasets

data organization

data generation

Setup

Requirements

Notes for Windows Users

Notes for macOS Users

How to Install the Required Python Environment

Setting Up the Environment

Installing BLASTn on Unix

How to Install BLASTn on Windows

How to create the reference databases for species identification

Manual Download

Creating the BLASTn databases

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages