Skip to content

BiosecSFA/ortho

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ORTHO

Introduction

TODO: some words of intro on the rational behind using OrthoDB

Generating the MSA

As a prerequisite you should install HMMER and the HH-suite. You will also need to download the database of orthologous genes (8.5 GB), a FASTA file of all species and proteins in OrthoDB orthologous groups.

For an input FASTA file of proteins all from the same species, featuring both the Uniprotd ids and AA sequences, and for a chosen taxonomic level of orthology (set by default to bacteria), the following command:

  • returns the mapping betwen the Uniprot ids and OrthoDB ids (saved as species_ids.txt);
  • downloads the FASTA files of orthologous proteins (saved as uniprot_orthologs.fasta).

bash download_orthologs.sh -f input_file.fasta -o orthologs -d odb10v1_all_og_fasta.tab -s target_species

download_orthologs.sh has flags that specify:

  • -f the input FASTA file. AA sequences must all be from the same species.
  • -o the output directory where the FASTA files of orthologs are saved.
  • -d the database searched.
  • -s the target species, ie 'ecoli' (Escherichia coli). Current other options are 'pprotegens' (Pseudomonas protegens) and 'pputida' (Pseudomonas putida).

Build the MSAs with the following command:

bash generate_msa.sh -f species_ids.txt -o msa -r orthologs -c 90 -i 90

generate_msa.sh has flags that determine:

  • -f the input file, which is the mapping between the Uniprot ids and OrthoDB ids.
  • -o the output directory where the MSAs are saved.
  • -r the directory where the FASTA files of orthologs are saved.
  • -c the percentage of sequence coverage [optional, default=90].
  • -i the percentage of identity coverage [optional, default=90].
  • -d keep at least this seqs in each block of length 50 [optional, default=512].
  • -s skip the final filtering step (hhfilter)

License

Ortho is released as part of the GENTANGLE pipeline (LLNL-CODE-845475) and is distributed under the terms of the MIT License (see LICENSE).

SPDX-License-Identifier: MIT

Funding

This work is supported by the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research, Lawrence Livermore National Laboratory Secure Biosystems Design SFA “From Sequence to Cell to Population: Secure and Robust Biosystems Design for Environmental Microorganisms”. Work at LLNL is performed under the auspices of the U.S. Department of Energy at Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.


If you use ORTHO in your research, please cite the following papers. Thanks!

Allen JE, et al. GENTANGLE: integrated computational design of gene entanglements. In preparation. 2022.

Blazejewski T, Ho HI, Wang HH. Synthetic sequence entanglement augments stability and containment of genetic information in cells. Science. 2019 Aug 9;365(6453):595-8. https://doi.org/10.1126/science.aav5477


About

OrthoDB-based multiple MSA generation

Resources

License

Stars

Watchers

Forks

Contributors 2

  •  
  •