DIVERGE v4 is a Python package designed for large-scale analysis of functional divergence across multi-gene families. It is a major upgrade of the widely used DIVERGE software, incorporating a novel Super-Cluster algorithm, a modular Python structure, and a user-friendly web server. This package allows for the identification of amino acid sites undergoing significant evolutionary shifts, helping to uncover functional divergence after gene duplication.
DIVERGE analyzes two types of functional divergence:
- Type-I: Significant differences in evolutionary rates at specific sites between gene clusters, indicating different functional constraints
- Type-II: Subfamily-specific amino acid property conservation, where sites are conserved across subfamilies but with different amino acid types
-
Novel Super-Cluster Algorithm: A statistically robust method for analyzing large gene families that:
- Replaces numerous one-to-one comparisons with a single computation
- Divides clusters into Super-Cluster pairs based on conservation patterns
- Provides more accurate functional divergence detection for multi-gene families
- Reduces computational complexity from exponential to linear
-
Modular Python Architecture:
- Base Layer (C++ API): Core data structures and computationally intensive functions
- Middle Layer (Python Wrapper): PyBind11-based bridge between C++ and Python
- Top Layer (High-Level Python API): User-friendly interface for model building and analysis
-
Comprehensive Database:
- Analysis results of 4,540 human protein families
- Covers 10,133 human genes and 215,480 protein sequences
- Built using phylogenetic data from PANTHER database
- Multiple sequence alignments from 19 selected vertebrate species
- Expert-reviewed phylogenetic trees
- Novel Super-Cluster Algorithm: A statistically robust method designed for large-scale analysis of functional divergence in multi-gene families, replacing numerous one-to-one comparisons with a single computation
- Modular Python Package: Built for scalability and seamless integration into bioinformatics workflows, with 10 customizable modules for functional divergence analysis
- Web Interface: A user-friendly web server developed using the Streamlit framework, making the package accessible even without programming knowledge
- Comprehensive Database: Analysis results of 4,540 human protein families (comprising 10,133 human genes and 215,480 protein sequences), searchable by UniProtKB, Ensembl, HGNC IDs, or gene names
To install the DIVERGE v4 Python package, use pip:
pip install divergeIf you prefer to compile the package from source using setup.py, you will need to install the pybind11 library, which provides the C++ bindings for Python used in this package. You can install it via pip:
pip install pybind11Once pybind11 is installed, you can compile DIVERGE v4 by running the following commands:
git clone https://github.com/zjupgx/diverge4.git
cd diverge4
python setup.py installpybind11 is necessary because DIVERGE v4 uses C++ for its core data structures and computationally intensive tasks, which are exposed to Python via pybind11.
-
Multiple Sequence Alignment (MSA) File:
- Supported formats: FASTA or CLUSTAL
- Only amino acid alignments are allowed
- Gaps (-) are allowed in the alignment
-
Phylogenetic Tree File:
- Must be in Newick format
- Branch lengths are optional but recommended
- Internal node names should be removed to prevent program crashes
- Tree depth must be at least 3 for proper analysis
from diverge import Gu99
# Perform Type-I functional divergence analysis
gu99 = Gu99("alignment.aln", "cluster1.tree", "cluster2.tree")
print("Theta coefficient:", gu99.summary.iloc[0, 0])
print("Sites with high divergence (Qk > 0.9):", sum(gu99.results.iloc[:, 0] > 0.9))from diverge import SuperCluster
# Analyze multiple clusters with parallel processing
super_cluster = SuperCluster("alignment.aln", "tree1.tree", "tree2.tree",
"tree3.tree", "tree4.tree", parallel=True)
print("Summary:", super_cluster.summary)
print("Results shape:", super_cluster.results.shape)from diverge import Gu99Batch
# Process multiple datasets in parallel
batch = Gu99Batch(max_threads=8)
batch.add_task("dataset1.aln", "d1_tree1.tree", "d1_tree2.tree", task_name="Dataset_1")
batch.add_task("dataset2.aln", "d2_tree1.tree", "d2_tree2.tree", task_name="Dataset_2")
batch.calculate_batch()
# Get results
results = batch.get_successful_results()
batch.print_summary()from diverge import SuperCluster
# Apply conservation weighting for improved accuracy
conswins = {'cons_win_len': 3, 'lambda_param': 0.7}
super_cluster = SuperCluster("alignment.aln", *tree_files,
conswins=conswins, parallel=True)📖 Complete User Guide - Detailed documentation covering:
- All analysis methods with examples
- SuperCluster algorithm details
- Batch processing workflows
- Performance optimization
- Troubleshooting guide
- Advanced features
DIVERGE v4 provides various independent computing processes to create custom pipelines for functional divergence analysis. Below are the main functions:
| Function | Description |
|---|---|
| Type-I Divergence (Gu99 method) | Detect type-I functional divergence using the Gu (1999) method |
| Type-I Divergence (Gu2001 method) | Detect type-I functional divergence using the Gu (2001) method. Requires phylogenetic tree file with branch length data |
| Type-II Divergence | Detect type-II functional divergence of gene families |
| Super-Cluster Analysis | Perform large-scale functional divergence analysis using the Super-Cluster method, designed for multi-gene families |
| Rate Variation Among Sites (RVS) | Estimate rate variations among sites for a given cluster. Only one cluster is allowed per run |
| Functional Distance Analysis | Estimate type-I functional distance between pairs of clusters and compute type-I functional branch lengths. Requires at least three clusters |
| FDR for Predictions | Calculate the false discovery rate of functionally diverging sites |
| Asymmetric Test for Type-I Functional Divergence | Test whether the degree of type-I functional divergence differs between duplicate genes. Requires three clusters |
| Effective Number of Sites | Estimate the effective number of sites related to type-I or type-II functional divergence. Requires two clusters |
| Gene-Specific Type-I Analysis | Site-specific posterior profile for predicting gene-specific type-I functional divergence-related sites. Requires three clusters |
The Super-Cluster algorithm is designed to efficiently analyze functional divergence in large gene families by:
- Partitioning m clusters into two groups (Super-Cluster pairs)
- Computing changes at amino acid sites for each Super-Cluster
- Performing DIVERGE Type-I analysis on Super-Cluster pairs
- Recording site-specific posterior probabilities for divergence profiling
This approach provides several advantages:
- Reduces computational complexity
- Improves statistical robustness
- Enables analysis of larger gene families
- Provides more intuitive functional divergence profiles
The web server (https://pgx.zju.edu.cn/diverge) provides:
- Interactive Analysis: Upload MSA and phylogeny files for functional divergence analysis
- Comprehensive Database: Access pre-computed analyses of 4,540 human protein families
- Search Functionality: Query the database using UniProtKB ID, Ensembl ID, HGNC ID, or gene name
- Functional Annotations: Access Gene Ontology terms, pathways, and protein class assignments for human proteins
- Visualization Tools: Interactive visualization of results and amino acid sites
Common issues and solutions:
- Sequence Name Mismatch: Ensure sequence names in MSA file exactly match those in tree file
- Tree Depth Error: Verify tree file has at least 3 levels of depth
- Internal Node Names: Remove names from internal nodes in tree file
- File Format: Confirm MSA is in proper FASTA or CLUSTAL format
If you use DIVERGE in your research, please cite:
Chen Y, Xu X, Pan Y, Wang S, Zhao W, Zhou B, Zhou J, Zheng Y, Zhou Z, Gu X. DIVERGE v4: A Platform for Large-Scale Analysis of Functional Divergence Across Multi-Gene Families. Mol Biol Evol. 2025, 42(11):msaf277. doi: 10.1093/molbev/msaf277.
🌐 Web Server | 📦 GitHub | 📚 DIVERGE v3 | 💡 Issues | 📧 Support
DIVERGE v4 is licensed under the MIT License. See the LICENSE file for more details.
