A complete implementation combining the DiCE (Differential Centrality-Ensemble) methodology with the state-of-the-art Duan et al. 2025 shortest path algorithm for identifying disease-associated genes through network analysis.
This project implements a computational pipeline that:
- Integrates multi-source protein-protein interaction (PPI) networks from KEGG and BioGRID
- Weights networks using gene expression correlations (DiCE methodology)
- Computes network centralities efficiently using the Duan 2025 algorithm
- Identifies disease-associated genes through differential centrality analysis
- Performs virtual knockout experiments to assess gene essentiality
- Fast SSSP computation: O(m log^(2/3) n) complexity vs O(m + n log n) for Dijkstra
- Directed network support: Proper handling of signaling pathways
- Expression-weighted edges: Correlation-based weights capture functional relationships
- Differential analysis: Compares normal vs disease conditions
- Ensemble ranking: Combines multiple centrality measures
- Virtual knockout: Simulates gene perturbations
dice_duan_project/
├── data/
│ ├── raw/
│ │ ├── kegg/ # KEGG pathway XML files
│ │ ├── biogrid/ # BioGRID TAB2 file
│ │ └── expression/ # Gene expression matrices
│ ├── processed/
│ │ ├── networks/ # Parsed and merged networks
│ │ └── weighted/ # Expression-weighted networks
│ └── results/ # Analysis outputs
├── src/
│ ├── cpp/ # C++ Duan 2025 implementation
│ │ ├── duan_engine.hpp # Core algorithm
│ │ └── main.cpp # Analysis pipeline
│ └── python/ # Python preprocessing
│ ├── parse_kegg.py
│ ├── parse_biogrid.py
│ ├── merge_networks.py
│ ├── weight_network.py
│ └── differential_centrality.py
├── CMakeLists.txt # C++ build configuration
├── requirements.txt # Python dependencies
├── run_pipeline.sh # Main execution script
└── README.md # This file
- Python: 3.8+
- C++ Compiler: g++ 9+ or clang++ 10+ (C++17 support)
- CMake: 3.10+
- Git: For version control
git clone <your-repo-url>
cd dice_duan_projectpip install -r requirements.txtmkdir -p build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j$(nproc)
cd ..Download from: https://www.genome.jp/kegg/pathway.html
Instructions:
- Go to KEGG Pathway Database
- Select organism: "Homo sapiens (human)"
- Download KGML (XML) files for pathways of interest
- Place XML files in:
data/raw/kegg/
Alternative: Use KEGG REST API
# Example: Download all human pathways
mkdir -p data/raw/kegg
for pathway in hsa04010 hsa04020 hsa04110 hsa04151; do
wget "https://rest.kegg.jp/get/$pathway/kgml" -O "data/raw/kegg/${pathway}.xml"
doneRecommended pathways for cancer research:
- hsa04010: MAPK signaling
- hsa04110: Cell cycle
- hsa04151: PI3K-Akt signaling
- hsa04510: Focal adhesion
- hsa04520: Adherens junction
Download from: https://downloads.thebiogrid.org/BioGRID/Release-Archive/
Current version: BIOGRID-ALL-5.0.253.tab2.txt
Instructions:
mkdir -p data/raw/biogrid
cd data/raw/biogrid
# Download latest BioGRID
wget "https://downloads.thebiogrid.org/Download/BioGRID/Latest-Release/BIOGRID-ALL-LATEST.tab2.zip"
unzip BIOGRID-ALL-LATEST.tab2.zip
cd ../../..File location: data/raw/biogrid/BIOGRID-ALL-5.0.253.tab2.txt
You need expression matrices for two conditions (e.g., Normal vs Tumor).
Download from: https://gtexportal.org/home/downloads/adult-gtex
Files needed:
GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct.gz
Instructions:
mkdir -p data/raw/expression
cd data/raw/expression
# Download GTEx (requires registration)
# Manual download from: https://gtexportal.org/home/downloads/adult-gtex
# Or use provided link (if you have access)
wget "https://storage.googleapis.com/gtex_analysis_v8/rna_seq_data/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct.gz"
gunzip GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct.gz
cd ../../..Extract specific tissue (e.g., Prostate):
import pandas as pd
# Load GTEx
gtex = pd.read_csv('data/raw/expression/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct',
sep='\t', skiprows=2, index_col=1)
gtex = gtex.iloc[:, 1:] # Remove Description column
# Filter for prostate samples
prostate_cols = [col for col in gtex.columns if 'Prostate' in col]
prostate_normal = gtex[prostate_cols]
# Save
prostate_normal.to_csv('data/raw/expression/expression_normal.txt', sep='\t')Download from: https://portal.gdc.cancer.gov/ or https://xenabrowser.net/
Using UCSC Xena (Recommended):
# Install UCSC Xena tools
pip install xenaPython
# Download TCGA data (Python)
python3 << 'EOF'
import xenaPython as xena
# List available datasets
hub = "https://tcga.xenahubs.net"
# Download TCGA-PRAD (Prostate Cancer) expression
dataset = "TCGA.PRAD.sampleMap/HiSeqV2"
samples = xena.dataset_samples(hub, dataset)
genes = xena.dataset_field(hub, dataset)
# Fetch expression matrix
import pandas as pd
expr_data = xena.dataset_gene_expression(hub, dataset, genes, samples)
df = pd.DataFrame(expr_data, index=genes, columns=samples)
df.to_csv('data/raw/expression/expression_tumor.txt', sep='\t')
print(f"Downloaded {len(genes)} genes x {len(samples)} samples")
EOFAlternative - Manual download:
- Go to: https://xenabrowser.net/datapages/
- Select: "TCGA Pan-Cancer (PANCAN)"
- Download: "gene expression RNAseq - IlluminaHiSeq"
- Filter for your cancer type (e.g., TCGA-PRAD for prostate)
Search: https://www.ncbi.nlm.nih.gov/geo/
Example: GSE21032 (Prostate cancer metastasis study)
# Install GEOparse
pip install GEOparse
# Download dataset
python3 << 'EOF'
import GEOparse
gse = GEOparse.get_GEO("GSE21032", destdir="./data/raw/expression/")
# Process and save as needed
EOFRequired format: Tab-separated text file
Sample1 Sample2 Sample3 ...
GENE1 12.5 14.2 11.8 ...
GENE2 8.3 9.1 7.9 ...
GENE3 15.6 16.2 14.9 ...
...
- Rows: Genes (gene symbols or Entrez IDs)
- Columns: Samples
- Values: Expression levels (TPM, FPKM, or normalized counts)
# Generate dummy data and run full pipeline
chmod +x run_pipeline.sh
./run_pipeline.sh --testThis will:
- Generate synthetic network and expression data
- Run the complete DiCE-Duan pipeline
- Produce results in ~2-3 minutes
# Ensure data is in place:
# - data/raw/kegg/*.xml
# - data/raw/biogrid/BIOGRID-ALL-5.0.253.tab2.txt
# - data/raw/expression/expression_normal.txt
# - data/raw/expression/expression_tumor.txt
# Run pipeline
./run_pipeline.shIf you prefer to run each phase separately:
# Phase 1: Parse KEGG
python3 src/python/parse_kegg.py \
--kegg-dir data/raw/kegg \
--output data/processed/networks/kegg_interactions.txt
# Phase 2: Parse BioGRID
python3 src/python/parse_biogrid.py \
--biogrid-file data/raw/biogrid/BIOGRID-ALL-5.0.253.tab2.txt \
--output data/processed/networks/biogrid_interactions.txt
# Phase 3: Merge networks
python3 src/python/merge_networks.py \
--kegg data/processed/networks/kegg_interactions.txt \
--biogrid data/processed/networks/biogrid_interactions.txt \
--output data/processed/networks/merged_network.txt
# Phase 4: Weight with expression (Normal)
python3 src/python/weight_network.py \
--network data/processed/networks/merged_network.txt \
--expression data/raw/expression/expression_normal.txt \
--output data/processed/weighted/network_normal.txt
# Phase 5: Weight with expression (Tumor)
python3 src/python/weight_network.py \
--network data/processed/networks/merged_network.txt \
--expression data/raw/expression/expression_tumor.txt \
--output data/processed/weighted/network_tumor.txt
# Phase 6: Compute centralities (C++)
./build/dice_analyzer \
data/processed/weighted/network_normal_int.txt \
data/processed/weighted/network_tumor_int.txt \
data/results
# Phase 7: Differential analysis
python3 src/python/differential_centrality.py \
--normal data/results/centrality_normal.txt \
--tumor data/results/centrality_tumor.txt \
--output data/results/dice_genes.txt-
dice_genes.txt: Complete ranked list of DiCE genes
- Columns: gene, betweenness (normal/tumor), eigenvector (normal/tumor), delta values, ranks, ensemble score
-
dice_genes_top500.txt: Top 500 DiCE genes for downstream analysis
-
centrality_normal.txt: Network centralities in normal condition
-
centrality_tumor.txt: Network centralities in tumor condition
-
knockout_analysis.txt: Virtual knockout vitality scores
High ensemble score = Gene with large centrality changes between conditions
- Potential disease driver or biomarker
- Candidate for experimental validation
High vitality score = Gene whose removal disrupts network connectivity
- Essential "bottleneck" protein
- Potential therapeutic target
- Open Cytoscape
- Import Network:
data/processed/weighted/network_tumor.txt - Import Node Attributes:
data/results/dice_genes_top500.txt - Create visual mappings:
- Node size → ensemble_score
- Node color → delta_betweenness
- Edge width → weight
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load DiCE genes
dice = pd.read_csv('data/results/dice_genes.txt', sep='\t')
# Plot top 20
top20 = dice.head(20)
plt.figure(figsize=(10, 8))
plt.barh(range(len(top20)), top20['ensemble_score'])
plt.yticks(range(len(top20)), top20['gene'])
plt.xlabel('Ensemble Score')
plt.title('Top 20 DiCE Genes')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.savefig('top_dice_genes.png', dpi=300)| Algorithm | SSSP Time | Betweenness Time |
|---|---|---|
| Dijkstra | 45 ms | 45 seconds |
| Duan 2025 | 8 ms | 8 seconds |
Speedup: ~5-6x for typical PPI networks
- Peak memory: ~500 MB for 5000-gene network
- Scales linearly with network size
- Solution: Check file paths in
run_pipeline.sh - Ensure data files exist in correct locations
- Solution: Check compiler version
g++ --version # Should be 9.0 or higher
cmake --version # Should be 3.10 or higher- Solution: Install missing dependencies
pip install -r requirements.txt- Solution: Reduce network size or increase RAM
- Consider filtering to high-confidence interactions
If your expression data needs normalization:
import pandas as pd
import numpy as np
# Load raw counts
expr = pd.read_csv('raw_counts.txt', sep='\t', index_col=0)
# Log2 transformation
expr_log = np.log2(expr + 1)
# Z-score normalization
expr_norm = (expr_log - expr_log.mean()) / expr_log.std()
# Save
expr_norm.to_csv('expression_normalized.txt', sep='\t')# Filter BioGRID by experimental evidence
biogrid = pd.read_csv('biogrid_interactions.txt', sep='\t')
# Keep only high-confidence methods
high_conf = ['Two-hybrid', 'Affinity Capture-MS', 'Co-crystal Structure']
biogrid_filt = biogrid[biogrid['interaction_type'].isin(high_conf)]
biogrid_filt.to_csv('biogrid_filtered.txt', sep='\t', index=False)If you use this pipeline, please cite:
-
DiCE Paper: Pashaei et al. (2025). DiCE: differential centrality-ensemble analysis based on gene expression profiles and protein–protein interaction network. Nucleic Acids Research, 53, gkaf609.
-
Duan Algorithm: Duan et al. (2025). Faster Shortest Paths in Graphs. STOC 2025.
MIT License - See LICENSE file for details
For questions or issues:
- Open an issue on GitHub
- Email: [your-email]
- KEGG Database
- BioGRID Database
- GTEx Consortium
- TCGA Research Network