Skip to content
Jose Manuel Martí edited this page Apr 24, 2024 · 13 revisions

Required data for running GENTANGLE

The prerequisites for running the GENTANGLE pipeline from scratch for a gene pair are:

  1. DATANGLE (recommended!)
  2. A protein database for homologous search (unless the user provides their own alignments).
  3. Custom input gene sequences (both amino acids and nucleotides sequences in fasta format)
  4. Parameters for the entanglement (can be easily adapted from the example on DATANGLE) and, potentially, a CUT (Codon Usage Table) for the targeted host (DATANGLE-provided CUTs can be used directly or as templates). These are direct requirements of CAMEOX, GENTANGLE's core.

DATANGLE

While this is not technically a requirement, we strongly recommend to clone the DATANGLE repo as it provides:

  1. A fully compatible data directory structure with GENTANGLE,
  2. Example files with GENTANGLE inputs, which are very useful as templates for different gene pairs or parameters,
  3. Example files with GENTANGLE output that are useful to learn in advance about the typical results.
  4. All the intermediate files to be able to independently test any step of the GENTANGLE pipeline with the Singularity container by using the example entanglement of infA and aroB (these two genes are entangled in the original CAMEOS paper).

Please see details and documentation on the DATANGLE wiki page.

Protein database

Do I need to download the database?

A protein database is required for finding sequences homologous to your genes of interest in order to build the protein multiple sequence alignments required by CAMEOX. However, if you already have MSAs for the genes to entangle or are testing the pipeline using our provided examples, you can skip this download.

The GENTANGLE container provides an app (the first step of the pipeline) to create MSAs from scratch using an updated local protein database version (a UniProt's UniRef database recommended), which enables local generation of alignments. While we support and recommend this approach, the user is free to generate MSAs using any other procedure and then skip that step of the pipeline. If you are entangling your own genes and don't have pre-built alignments for them, you'll need to download a protein database, so please proceed to the next subsection.

Getting the protein database

You will need a protein database in fasta format. We recommend that you download Uniprot's UniRef100 or UniRef90 databases and place it/them at the /path/datangle/dbs directory. They are large databases so you will need some free space for them as indicated in the requirements: uniref90.fasta for example is 86 GB as of April 2024. The databases can be download from here.

We suggest using UniRef90 for genes that are frequently sequenced and have many orthologs. If the resulting alignments have a limited number of orthologs, and especially if they have less orthologs than the length of the reference sequence in amino acids, we recommend using UniRef100 instead. If there is doubt we suggest starting with the UniRef100, understanding that there may be some sequence redundancy in the retrieved sequences. The sequence redundancy can be managed using the included ppmsa app of the GENTANGLE container for curating the multiple sequence alignments with the --seqid_filter option to further remove redundant sequences (see more details in the step 2 of the pipeline).

You may use different protein databases in fasta format. Despite the size, we recommend using UniProt's UniRef for its good coverage of a variety of proteins, but some users may prefer to use local metagenomic databases, which are even larger than the recommended ones, but may potentially provide a larger training set.

Custom input gene sequences

  1. Gene name: Once you assign a label (geneLabel1) to your reference protein, you must consistently keep that naming throughout the pipeline, so it should be carefully selected in this "step 0". As an example, we use the structure gene_organism_database in our examples, where geneLabel1 is named infA_pf5_uref100.

  2. Fasta sequences: In order to specify the two gene you'd like to entangle using CAMEOX, you need to create fasta files containing their sequences in both amino acids for the proteins and nucleotides for the coding region sequences (CDS). geneLabel1.refseq.fa is a standard fasta file containing the protein sequence of geneLabel1, with header >geneLabel1, and geneLabel1.cds.fa is a standard fasta file containing the nucleotide CDS of geneLabel1:

    1. Copy the protein sequence to a new subdirectory within the msas data directory under /path/datangle/ (needed for all the upstream GENTANGLE pipeline):
      mkdir /path/datangle/msas/geneLabel1/
      cp geneLabel1.refseq.fa /path/datangle/msas/geneLabel1/
      
    2. Add the protein sequence to the proteins.fasta file in the root of the data directory (needed for CAMEOX step in GENTANGLE):
      cat geneLabel1.refseq.fa >> /path/datangle/proteins.fasta
      
    3. Add the CDS to the cds.fasta file in the root of the data directory (needed for CAMEOX step in GENTANGLE):
      cat geneLabel1.cds.fa >> /path/datangle/cds.fasta
      
  3. Repeat for the other gene of the pair: You will specify the second gene with the exact same procedure of steps 0 and 1 above. Again, you can replace geneLabel2 with any unique name of your gene, but the label need to be consistent in the path, filename, and fasta headers.

  4. Proceed with GENTANGLE pipeline: Once the custom input genes are setup you can then follow steps outlined the end-to-end example to build the multiple sequence alignment, which are steps 1 and 2 of the GENTANGLE pipeline; train the protein fitness models, which are steps 3 and 4 of GENTANGLE; and then summarize all the required files in step 5. Once steps 1-5 are complete, the entanglement search is run as step 6 of GENTANGLE (CAMEOX) and evaluation and analysis of its output is done in steps 7 and 8 of the pipeline. However, before running step 6 (CAMEOX) you will need to set the required parameters for the CAMEOX execution (see next section).

CAMEOX parameters and CUTs

CAMEOX parameters and CUTs for different hosts are described in detail in the CAMEOX readme. As recommended in that document, in both cases, for any customization, we recommend starting from:

Clone this wiki locally