Important Note

Each of our jupyter notebook takes about an hour to run due to
- the function to choose number of clusters (pairwiseCrossValidation) takes about 20 minutes to run
- the function to run the alignment algorithm (runClustalInRange) takes about 30 minutes to run
DNABERT6 and Finetuned_DNABERT6 takes about 4-6 hours to get the embeddings

For these reasons we provide all the necessary files to get our results, in this drive link: https://drive.google.com/drive/folders/1KLiCzlLoEf0avWA5S5f40Lqq-E_cfNEX?usp=sharing

Repository Organization

run.ipynb : this is the model with the best results, the training loop is commented, you can just load the .pth file.
- We recommend not to run "runClustalInRange" function, you can find the required files in our drive folder mentioned above.
helper.py : we have all our utility functions in this file, all notebook import this file
plot_clusters.py : this function was provided to us by Antoine Tappy who is working on a similar project
data_preprosessing.ipynb : we did all our data preprocessing in this file, you can find all the data we used in Data.zip
Other_Notebook : as we tried many different models, we keep all of our other notebooks here, there are some duplicate files such as helper.py, to simplify importing

Requirements

clustal
biopython
numpy
sklearn
scipy
pytorch
matplotlib

Data & Pretrained Models:

Data

We got our data from NCBI's website

The procedure is as follows

Enter rbcL
Filter for plants only
Select the sequence length
- rbcL : 600 to 1000 -> ~90k Sequence, ~80Mo
Download it to have the information and the DNA sequence Click and send to (corner top right) > Complete Record > File > Format = Fasta > Sort by Taxonomy ID
Put the fasta file into /Data

Pretrained models

We used 2 pretrained models
- We used models from DNABERT
  - Pretrained DNABERT6
  - FineTuned DNABERT6
We also provide our trained models as .pth files in our repository

You can find these models in our drive folder as well

Running run.ipynb

To be able to run run.ipynb, please extract Data.zip with the same name in the same directory as run.ipynb
As mentioned before in "Important Note" we highly recommend downloading /clustal /clusters and /plots folders from our drive to avoid running the notebook for an hour

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Other_Notebooks		Other_Notebooks
clustal		clustal
clusters		clusters
other_notebooks		other_notebooks
plots		plots
.gitignore		.gitignore
.gitkeep		.gitkeep
Data.zip		Data.zip
Machine_Learning_Project.pdf		Machine_Learning_Project.pdf
README.md		README.md
RESULTS.TXT		RESULTS.TXT
Results.xlsx		Results.xlsx
conv_autoencoder.pth		conv_autoencoder.pth
data_preprocessing.ipynb		data_preprocessing.ipynb
helper.py		helper.py
plot_cluster.py		plot_cluster.py
requirements.txt		requirements.txt
run.ipynb		run.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Important Note

Repository Organization

Requirements

Data & Pretrained Models:

Data

Pretrained models

Running run.ipynb

About

Uh oh!

Releases

Packages

Uh oh!

Languages

GenoRobotics-EPFL/Primer-Design

Folders and files

Latest commit

History

Repository files navigation

Important Note

Repository Organization

Requirements

Data & Pretrained Models:

Data

Pretrained models

Running run.ipynb

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages