-
Notifications
You must be signed in to change notification settings - Fork 0
output
This section provides a comprehensive guide to key concepts and the various output files and sampling strategies generated by the GENTANGLE pipeline, enabling users to effectively utilize the results in gene entanglement design. Understanding these outputs is crucial for making informed decisions and optimizing your research workflow around GENTANGLE. Here you will find important concepts about the output data, details about the directory structure, different file formats (including interactive plots, tabular data, and serialized dataframes), and the various sampling strategies available for downselecting candidate solutions.
ERP (Entanglement Relative Position) refers to the relative position of the shorter gene within the longer gene, expressed as a decimal value with four digits for precision. For example:
- A shorter gene embedded in a 10000 nucleotide length gene at position 1 would be at relative position 0.0001 and thus represented as ERP0001 (Entanglement Relative Position 0.1‰ or 0.01%).
- A shorter gene embedded in a 10000 nucleotide length gene at position 10 would be at relative position 0.001 and thus represented as ERP0010 (Entanglement Relative Position 1‰ or 0.1%).
- A shorter gene embedded in a 1000 nucleotide length gene at position 1 would also be at relative position 0.001 and thus represented as ERP0010 (Entanglement Relative Position 1‰ or 0.1%).
- A shorter gene embedded in a 1000 nucleotide length gene at position 900 would also be at relative position 0.9 and thus represented as ERP9000 (Entanglement Relative Position 900‰ or 90%).
The choice of four decimal places for ERP values strikes a balance between precision, practicality, and user-friendliness. It allows for unique positioning within protein sequences up to 10,000 amino acids, which covers a significant majority of known proteins. While some redundancy may occur for proteins exceeding 10,000 amino acids, the frequency of such proteins is relatively low, making the risk of uncertainty very low. Using five or more decimal places would likely be unnecessarily complex and confusing for users, especially considering the low frequency of very large proteins and its practicality in this context of gene entanglement.
The term ~ratio (similarity ratio) refers to the fraction of matching amino acids between the design protein and the reference (wild type) protein. The ~ratio can be used as a baseline for some basic validation under the general tendency that protein sequences closer to the WT should perform better.
The averaged ~ratio is the similarity average ratio considering both proteins in the entanglement pair with equal weight.
Different seeds generated in the HMM-based step of CAMEOX may converge to the same solution after the MRF-based optimization, which may happen within the same CAMEOX run of between different runs. This convergence of different initial seeds to identical final variants creates redundancy in the set of variants proposed by CAMEOX. The analysis step of GENTANGLE also analyzes this potential multiplicity in the generated variants and provides results such a specific interactive plot (see "Interactive plot of multiplicity analysis" below). This redundancy is also exploited in some of the sampling criteria (see "Sampling Strategies and Output Files" below)
These are the concepts related to this phenomenon:
-
Per CAMEOX run:
- Multiplicity, for a final variant, is the number of initial different seeds that converged to the same sequence of that variant in a CAMEOX run.
- Redundancy rate is the fraction of redundant variants in a CAMEOX run. There are many different variables that affect this ratio. ranging from the HMM and MRF model used for the entanglement pair to parameters of the CAMEOX run such as number of seeds and type of weighting strategy.
-
Per series of CAMEOX runs considered in an analysis:
- Max_multi is the maximum of the multiplicity of a variant over all the CAMEOX runs considered in the analysis.
- Abundance is the sum of the multiplicity of a variant over all the CAMEOX runs (datasets) considered where it appears twice or more times.
- Spread is the fraction of CAMEOX runs where the variant appears twice or more times.
GENTANGLE organizes its output within a dedicated outputs subdirectory. Within this directory, a new subfolder is created for each pair of entangled genes, named in the format {geneLabel1}_{geneLabel2}. To be informative, our gene labels typically include information about the reference sequence strain and the database used to build the multiple sequence alignments (MSAs) for each gene, for instance gene_organism_database in the case of the genes in the infA_pf4_uref100⥂aroB_pf4_uref100 entanglement.
The final solutions are automatically placed in different files in the output directory (for instance /path/datangle/output).
-
File format:
summary_{RUN_ID}.csv -
Description: This file provides a comprehensive overview of all generated solutions for the entanglement of
{geneLabel1}and{geneLabel2}. TheRUN_IDis a unique identifier automatically assigned during thecameoxstep of the pipeline, ensuring traceability.
GENTANGLE often generates a large number of candidate solutions, exceeding the capacity for experimental testing, since it is recommended to generate as exhaustive a search as possible to find the best performing solutions. To facilitate downselection, the pipeline offers four distinct sampling strategies, each producing a separate output file:
-
Output file format:
pareto_ERP_CAMEOX_{geneLabel1}_{geneLabel2}_pairs.csv -
Description: This strategy selects the top N solutions along the Pareto frontier, alternating solutions with optimal trade-offs between the best fitness scores of
{geneLabel1}and{geneLabel2}. This sampling strategy samples strictly from the best scoring solutions and should output reliable entanglement solutions when accurate fitness models were used, so it is ideal when high confidence exists in the accuracy of the fitness models.
-
Output file format:
multy_ERP_CAMEOX_{geneLabel1}_{geneLabel2}_pairs.csv -
Description: This downsampling strategy selects the top N solutions by the number of times that different HMM seeds converged to the same solution after the MRF optimization for the entanglement of
{geneLabel1}and{geneLabel2}. This redundancy can be observed within a run and between different runs of CAMEOX, and this criterion accounts for both. This method offers an alternative sampling strategy to the Pareto frontier by identifying solutions that the pipeline consistently converges upon.
-
Output file format:
overden_ERP_CAMEOX_{geneLabel1}_{geneLabel2}_pairs.csv -
Description: This strategy bins solutions based on their score distribution and samples N solutions evenly from the top M most populated but most separate regions for the entanglement of
{geneLabel1}and{geneLabel2}. It is useful in early discovery phases where model accuracy may be uncertain, ensuring a set of samples predictive to have an optimized range of diverse fitness values.
-
Output file format:
random_ERP_CAMEOX_{geneLabel1}_{geneLabel2}_pairs.csv -
Description: This strategy randomly samples variants from the entire solution space for the entanglement of
{geneLabel1}and{geneLabel2}. Similar to overdensity sampling, it is beneficial in early stages or when exploring a wide range of possibilities without relying heavily on model scores.
GENTANGLE generates different interactive HTML files with filename iplot*.html that allow users to visually explore the distribution of solutions and the impact of different sampling strategies for the entanglement of {geneLabel1} and {geneLabel2}. The next subsections provide details on the three main kinds of interactive plots generated by the GENTANGLE pipeline (analysis app). You can find links to examples on the GENTANGLE webpage and in the subsections below.
-
Filename format:
iplot_{geneLabel1}_{geneLabel2}_rand.html -
Content: Pseudolikelihood, similarity ratio, and density analysis for all the generated variants.
-
Details: Each plot series covers all the variants generated in a single CAMEOX run for the entanglement pair.
-
Plots:
-
APLLis the anti/negative pseudo-log-likelihood (PLL) by the MRF/Potts models, -
~ratiois the similarity ratio of the variants regarding the WT/reference sequence as described above, -
Densityis a density plot of variants in the APLL space.
-
-
Color scales:
-
None: provides different color by plot series (CAMEOX runs) -
By ~ratio avg: the color will depend on the average of the similarity ratios for each protein of the pair. -
By ERP: the color scale refers to the ERP as described above. Blueshift refer to an ERP towards the beginning of the larger protein, while redshift indicates an ERP towards the end of the larger protein. This may be a very informative plot.
-
-
-
Example: Example of all variants iplot.
-
Filename format:
iplot_{geneLabel1}_{geneLabel2}_redundancy_all.html - Content: Interactive plot of the redundant variants in the APLL space of the two proteins.
-
Details: The plot shows the variants with multiplicity over 1 in at least one of the CAMEOX run considered in the analysis. The diameter of the circles are proportional to the
spreadas defined above. The basic color scale refers to theabundanceas defined above; additional color scales are provided for the averaged ~ratio and the ERP. - Example: Example of multiplicity analysis iplot. Try selecting the ERP color scale to see how the different ERPs have an increased density of redundant solutions in different parts of the APLL space.
-
Filename format:
iplot_{geneLabel1}_{geneLabel2}_sampled_vars.html - Content: Sampled variants with selection methods.
-
Details: Double-clicking on each sampling strategy within the interactive plot reveals the selected candidate solutions; per each ERP, the options are:
pareto_ERP,multy_ERP,overden_ERP, andrandom_ERP. - Example: Example of sampled variants iplot. Try selecting the Pareto option for the two different ERP values to see how the model clearly favors one entanglement position over the other.
Beyond the CSV and HTML files, GENTANGLE produces additional data files containing scores, parameters, and statistics. These files offer deeper insights into the solution space and can be used for further analysis, visualization, or downstream applications such as machine learning or custom filtering. They can be analyzed using the provided example Jupyter notebook:
-
Serialized Pandas DataFrames:
-
File format:
CAMEOX_{geneLabel1}_{geneLabel2}_pairs_*.pkl.bz2 -
Description: These compressed files contain Pandas DataFrames storing detailed information about the generated solutions for the entanglement of
{geneLabel1}and{geneLabel2}, including scores, sequences, and other relevant data.
-
File format:
-
Metadata:
-
File format:
CAMEOX_{geneLabel1}_{geneLabel2}_metadata.csv -
Description: This file provides metadata associated with the different CAMEOX runs for the entanglement of
{geneLabel1}and{geneLabel2}, such as parameters used and runtime statistics.
-
File format:
The choice of sampling strategy depends on the confidence in the fitness models and the stage of the research project. When model accuracy is high, prioritizing the Pareto frontier is recommended as it focuses on solutions with the best trade-offs between gene fitness scores. In early stages or when working with noisy or less reliable models, incorporating multiplicity, overdensity, or random sampling can be beneficial. These strategies provide a broader exploration of the solution space and can be used to collect additional data for further model refinement in subsequent cycles. Ultimately, GENTANGLE offers all four sampling options to allow users to experiment and find the most suitable approach for their specific research goals and model confidence.
If you use GENTANGLE/CAMEOX in your research, please consider citing the paper. Thanks!
Martí JM, et al. (2024) GENTANGLE: integrated computational design of gene entanglements. Bioinformatics; btae380. https://doi.org/10.1093/bioinformatics/btae380