TODO: some words of intro on the rational behind using OrthoDB
As a prerequisite you should install HMMER and the HH-suite.
You will also need to download the database of orthologous genes (8.5 GB),
a FASTA file of all species and proteins in OrthoDB orthologous groups.
For an input FASTA file of proteins all from the same species, featuring both the Uniprotd ids and AA sequences,
and for a chosen taxonomic level of orthology (set by default to bacteria), the following command:
- returns the mapping betwen the Uniprot ids and OrthoDB ids (saved as species_ids.txt);
- downloads the FASTA files of orthologous proteins (saved as uniprot_orthologs.fasta).
bash download_orthologs.sh -f input_file.fasta -o orthologs -d odb10v1_all_og_fasta.tab -s target_species
download_orthologs.sh has flags that specify:
-fthe input FASTA file. AA sequences must all be from the same species.-othe output directory where the FASTA files of orthologs are saved.-dthe database searched.-sthe target species, ie 'ecoli' (Escherichia coli). Current other options are 'pprotegens' (Pseudomonas protegens) and 'pputida' (Pseudomonas putida).
Build the MSAs with the following command:
bash generate_msa.sh -f species_ids.txt -o msa -r orthologs -c 90 -i 90
generate_msa.sh has flags that determine:
-fthe input file, which is the mapping between the Uniprot ids and OrthoDB ids.-othe output directory where the MSAs are saved.-rthe directory where the FASTA files of orthologs are saved.-cthe percentage of sequence coverage [optional, default=90].-ithe percentage of identity coverage [optional, default=90].-dkeep at least this seqs in each block of length 50 [optional, default=512].-sskip the final filtering step (hhfilter)
Ortho is released as part of the GENTANGLE pipeline (LLNL-CODE-845475) and is distributed under the terms of the MIT License (see LICENSE).
SPDX-License-Identifier: MIT
This work is supported by the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research, Lawrence Livermore National Laboratory Secure Biosystems Design SFA “From Sequence to Cell to Population: Secure and Robust Biosystems Design for Environmental Microorganisms”. Work at LLNL is performed under the auspices of the U.S. Department of Energy at Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
If you use ORTHO in your research, please cite the following papers. Thanks!
Allen JE, et al. GENTANGLE: integrated computational design of gene entanglements. In preparation. 2022.
Blazejewski T, Ho HI, Wang HH. Synthetic sequence entanglement augments stability and containment of genetic information in cells. Science. 2019 Aug 9;365(6453):595-8. https://doi.org/10.1126/science.aav5477