DiCE-Duan Project: Differential Centrality Analysis with Duan 2025 SSSP Algorithm

A complete implementation combining the DiCE (Differential Centrality-Ensemble) methodology with the state-of-the-art Duan et al. 2025 shortest path algorithm for identifying disease-associated genes through network analysis.

Overview

This project implements a computational pipeline that:

Integrates multi-source protein-protein interaction (PPI) networks from KEGG and BioGRID
Weights networks using gene expression correlations (DiCE methodology)
Computes network centralities efficiently using the Duan 2025 algorithm
Identifies disease-associated genes through differential centrality analysis
Performs virtual knockout experiments to assess gene essentiality

Key Features

Fast SSSP computation: O(m log^(2/3) n) complexity vs O(m + n log n) for Dijkstra
Directed network support: Proper handling of signaling pathways
Expression-weighted edges: Correlation-based weights capture functional relationships
Differential analysis: Compares normal vs disease conditions
Ensemble ranking: Combines multiple centrality measures
Virtual knockout: Simulates gene perturbations

Project Structure

dice_duan_project/
├── data/
│   ├── raw/
│   │   ├── kegg/              # KEGG pathway XML files
│   │   ├── biogrid/           # BioGRID TAB2 file
│   │   └── expression/        # Gene expression matrices
│   ├── processed/
│   │   ├── networks/          # Parsed and merged networks
│   │   └── weighted/          # Expression-weighted networks
│   └── results/               # Analysis outputs
├── src/
│   ├── cpp/                   # C++ Duan 2025 implementation
│   │   ├── duan_engine.hpp    # Core algorithm
│   │   └── main.cpp           # Analysis pipeline
│   └── python/                # Python preprocessing
│       ├── parse_kegg.py
│       ├── parse_biogrid.py
│       ├── merge_networks.py
│       ├── weight_network.py
│       └── differential_centrality.py
├── CMakeLists.txt             # C++ build configuration
├── requirements.txt           # Python dependencies
├── run_pipeline.sh            # Main execution script
└── README.md                  # This file

Installation

Requirements

Python: 3.8+
C++ Compiler: g++ 9+ or clang++ 10+ (C++17 support)
CMake: 3.10+
Git: For version control

Step 1: Clone the repository

git clone <your-repo-url>
cd dice_duan_project

Step 2: Install Python dependencies

pip install -r requirements.txt

Step 3: Build C++ engine

mkdir -p build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j$(nproc)
cd ..

Data Acquisition

1. KEGG Pathways (Required)

Download from: https://www.genome.jp/kegg/pathway.html

Instructions:

Go to KEGG Pathway Database
Select organism: "Homo sapiens (human)"
Download KGML (XML) files for pathways of interest
Place XML files in: data/raw/kegg/

Alternative: Use KEGG REST API

# Example: Download all human pathways
mkdir -p data/raw/kegg
for pathway in hsa04010 hsa04020 hsa04110 hsa04151; do
    wget "https://rest.kegg.jp/get/$pathway/kgml" -O "data/raw/kegg/${pathway}.xml"
done

Recommended pathways for cancer research:

hsa04010: MAPK signaling
hsa04110: Cell cycle
hsa04151: PI3K-Akt signaling
hsa04510: Focal adhesion
hsa04520: Adherens junction

2. BioGRID (Required)

Download from: https://downloads.thebiogrid.org/BioGRID/Release-Archive/

Current version: BIOGRID-ALL-5.0.253.tab2.txt

Instructions:

mkdir -p data/raw/biogrid
cd data/raw/biogrid

# Download latest BioGRID
wget "https://downloads.thebiogrid.org/Download/BioGRID/Latest-Release/BIOGRID-ALL-LATEST.tab2.zip"
unzip BIOGRID-ALL-LATEST.tab2.zip

cd ../../..

File location: data/raw/biogrid/BIOGRID-ALL-5.0.253.tab2.txt

3. Gene Expression Data (Required)

You need expression matrices for two conditions (e.g., Normal vs Tumor).

Option A: GTEx (Normal tissue)

Download from: https://gtexportal.org/home/downloads/adult-gtex

Files needed:

GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct.gz

Instructions:

mkdir -p data/raw/expression
cd data/raw/expression

# Download GTEx (requires registration)
# Manual download from: https://gtexportal.org/home/downloads/adult-gtex

# Or use provided link (if you have access)
wget "https://storage.googleapis.com/gtex_analysis_v8/rna_seq_data/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct.gz"

gunzip GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct.gz

cd ../../..

Extract specific tissue (e.g., Prostate):

import pandas as pd

# Load GTEx
gtex = pd.read_csv('data/raw/expression/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct', 
                   sep='\t', skiprows=2, index_col=1)
gtex = gtex.iloc[:, 1:]  # Remove Description column

# Filter for prostate samples
prostate_cols = [col for col in gtex.columns if 'Prostate' in col]
prostate_normal = gtex[prostate_cols]

# Save
prostate_normal.to_csv('data/raw/expression/expression_normal.txt', sep='\t')

Option B: TCGA (Tumor tissue)

Download from: https://portal.gdc.cancer.gov/ or https://xenabrowser.net/

Using UCSC Xena (Recommended):

# Install UCSC Xena tools
pip install xenaPython

# Download TCGA data (Python)
python3 << 'EOF'
import xenaPython as xena

# List available datasets
hub = "https://tcga.xenahubs.net"

# Download TCGA-PRAD (Prostate Cancer) expression
dataset = "TCGA.PRAD.sampleMap/HiSeqV2"
samples = xena.dataset_samples(hub, dataset)
genes = xena.dataset_field(hub, dataset)

# Fetch expression matrix
import pandas as pd
expr_data = xena.dataset_gene_expression(hub, dataset, genes, samples)

df = pd.DataFrame(expr_data, index=genes, columns=samples)
df.to_csv('data/raw/expression/expression_tumor.txt', sep='\t')

print(f"Downloaded {len(genes)} genes x {len(samples)} samples")
EOF

Alternative - Manual download:

Go to: https://xenabrowser.net/datapages/
Select: "TCGA Pan-Cancer (PANCAN)"
Download: "gene expression RNAseq - IlluminaHiSeq"
Filter for your cancer type (e.g., TCGA-PRAD for prostate)

Option C: GEO (Custom datasets)

Search: https://www.ncbi.nlm.nih.gov/geo/

Example: GSE21032 (Prostate cancer metastasis study)

# Install GEOparse
pip install GEOparse

# Download dataset
python3 << 'EOF'
import GEOparse

gse = GEOparse.get_GEO("GSE21032", destdir="./data/raw/expression/")
# Process and save as needed
EOF

Expression File Format

Required format: Tab-separated text file

        Sample1    Sample2    Sample3    ...
GENE1   12.5       14.2       11.8      ...
GENE2   8.3        9.1        7.9       ...
GENE3   15.6       16.2       14.9      ...
...

Rows: Genes (gene symbols or Entrez IDs)
Columns: Samples
Values: Expression levels (TPM, FPKM, or normalized counts)

Usage

Quick Start - Test with Dummy Data

# Generate dummy data and run full pipeline
chmod +x run_pipeline.sh
./run_pipeline.sh --test

This will:

Generate synthetic network and expression data
Run the complete DiCE-Duan pipeline
Produce results in ~2-3 minutes

Full Pipeline - Real Data

# Ensure data is in place:
# - data/raw/kegg/*.xml
# - data/raw/biogrid/BIOGRID-ALL-5.0.253.tab2.txt
# - data/raw/expression/expression_normal.txt
# - data/raw/expression/expression_tumor.txt

# Run pipeline
./run_pipeline.sh

Step-by-Step Execution

If you prefer to run each phase separately:

# Phase 1: Parse KEGG
python3 src/python/parse_kegg.py \
    --kegg-dir data/raw/kegg \
    --output data/processed/networks/kegg_interactions.txt

# Phase 2: Parse BioGRID
python3 src/python/parse_biogrid.py \
    --biogrid-file data/raw/biogrid/BIOGRID-ALL-5.0.253.tab2.txt \
    --output data/processed/networks/biogrid_interactions.txt

# Phase 3: Merge networks
python3 src/python/merge_networks.py \
    --kegg data/processed/networks/kegg_interactions.txt \
    --biogrid data/processed/networks/biogrid_interactions.txt \
    --output data/processed/networks/merged_network.txt

# Phase 4: Weight with expression (Normal)
python3 src/python/weight_network.py \
    --network data/processed/networks/merged_network.txt \
    --expression data/raw/expression/expression_normal.txt \
    --output data/processed/weighted/network_normal.txt

# Phase 5: Weight with expression (Tumor)
python3 src/python/weight_network.py \
    --network data/processed/networks/merged_network.txt \
    --expression data/raw/expression/expression_tumor.txt \
    --output data/processed/weighted/network_tumor.txt

# Phase 6: Compute centralities (C++)
./build/dice_analyzer \
    data/processed/weighted/network_normal_int.txt \
    data/processed/weighted/network_tumor_int.txt \
    data/results

# Phase 7: Differential analysis
python3 src/python/differential_centrality.py \
    --normal data/results/centrality_normal.txt \
    --tumor data/results/centrality_tumor.txt \
    --output data/results/dice_genes.txt

Output Files

Main Results

dice_genes.txt: Complete ranked list of DiCE genes
- Columns: gene, betweenness (normal/tumor), eigenvector (normal/tumor), delta values, ranks, ensemble score
dice_genes_top500.txt: Top 500 DiCE genes for downstream analysis
centrality_normal.txt: Network centralities in normal condition
centrality_tumor.txt: Network centralities in tumor condition
knockout_analysis.txt: Virtual knockout vitality scores

Interpretation

High ensemble score = Gene with large centrality changes between conditions

Potential disease driver or biomarker
Candidate for experimental validation

High vitality score = Gene whose removal disrupts network connectivity

Essential "bottleneck" protein
Potential therapeutic target

Visualization

Cytoscape Import

Open Cytoscape
Import Network: data/processed/weighted/network_tumor.txt
Import Node Attributes: data/results/dice_genes_top500.txt
Create visual mappings:
- Node size → ensemble_score
- Node color → delta_betweenness
- Edge width → weight

Python Visualization

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load DiCE genes
dice = pd.read_csv('data/results/dice_genes.txt', sep='\t')

# Plot top 20
top20 = dice.head(20)

plt.figure(figsize=(10, 8))
plt.barh(range(len(top20)), top20['ensemble_score'])
plt.yticks(range(len(top20)), top20['gene'])
plt.xlabel('Ensemble Score')
plt.title('Top 20 DiCE Genes')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.savefig('top_dice_genes.png', dpi=300)

Performance

Benchmarks (1000 genes, 10,000 edges)

Algorithm	SSSP Time	Betweenness Time
Dijkstra	45 ms	45 seconds
Duan 2025	8 ms	8 seconds

Speedup: ~5-6x for typical PPI networks

Memory Usage

Peak memory: ~500 MB for 5000-gene network
Scales linearly with network size

Troubleshooting

Issue: "Cannot open file"

Solution: Check file paths in run_pipeline.sh
Ensure data files exist in correct locations

Issue: "C++ compilation failed"

Solution: Check compiler version

g++ --version  # Should be 9.0 or higher
cmake --version  # Should be 3.10 or higher

Issue: "Module not found" (Python)

Solution: Install missing dependencies

pip install -r requirements.txt

Issue: "Out of memory"

Solution: Reduce network size or increase RAM
Consider filtering to high-confidence interactions

Advanced Usage

Custom Expression Preprocessing

If your expression data needs normalization:

import pandas as pd
import numpy as np

# Load raw counts
expr = pd.read_csv('raw_counts.txt', sep='\t', index_col=0)

# Log2 transformation
expr_log = np.log2(expr + 1)

# Z-score normalization
expr_norm = (expr_log - expr_log.mean()) / expr_log.std()

# Save
expr_norm.to_csv('expression_normalized.txt', sep='\t')

Filtering Low-Quality Interactions

# Filter BioGRID by experimental evidence
biogrid = pd.read_csv('biogrid_interactions.txt', sep='\t')

# Keep only high-confidence methods
high_conf = ['Two-hybrid', 'Affinity Capture-MS', 'Co-crystal Structure']
biogrid_filt = biogrid[biogrid['interaction_type'].isin(high_conf)]

biogrid_filt.to_csv('biogrid_filtered.txt', sep='\t', index=False)

Citation

If you use this pipeline, please cite:

DiCE Paper: Pashaei et al. (2025). DiCE: differential centrality-ensemble analysis based on gene expression profiles and protein–protein interaction network. Nucleic Acids Research, 53, gkaf609.
Duan Algorithm: Duan et al. (2025). Faster Shortest Paths in Graphs. STOC 2025.

License

MIT License - See LICENSE file for details

Contact

For questions or issues:

Open an issue on GitHub
Email: [your-email]

Acknowledgments

KEGG Database
BioGRID Database
GTEx Consortium
TCGA Research Network

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
figures		figures
lib		lib
src		src
web		web
.gitattributes		.gitattributes
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
PYTHON_README.md		PYTHON_README.md
README.md		README.md
TECHNICAL_DOCUMENTATION.md		TECHNICAL_DOCUMENTATION.md
base.pdf		base.pdf
config.yaml		config.yaml
dual_dataset_analysis.py		dual_dataset_analysis.py
fitness_analysis.py		fitness_analysis.py
kaplen.png		kaplen.png
post_analysis.py		post_analysis.py
requirements.txt		requirements.txt
run_pipeline.sh		run_pipeline.sh
survival_validation.py		survival_validation.py
validation_pipeline.py		validation_pipeline.py

Folders and files

Latest commit

History

Repository files navigation

DiCE-Duan Project: Differential Centrality Analysis with Duan 2025 SSSP Algorithm

Overview

Key Features

Project Structure

Installation

Requirements

Step 1: Clone the repository

Step 2: Install Python dependencies

Step 3: Build C++ engine

Data Acquisition

1. KEGG Pathways (Required)

2. BioGRID (Required)

3. Gene Expression Data (Required)

Option A: GTEx (Normal tissue)

Option B: TCGA (Tumor tissue)

Option C: GEO (Custom datasets)

Expression File Format

Usage

Quick Start - Test with Dummy Data

Full Pipeline - Real Data

Step-by-Step Execution

Output Files

Main Results

Interpretation

Visualization

Cytoscape Import

Python Visualization

Performance

Benchmarks (1000 genes, 10,000 edges)

Memory Usage

Troubleshooting

Issue: "Cannot open file"

Issue: "C++ compilation failed"

Issue: "Module not found" (Python)

Issue: "Out of memory"

Advanced Usage

Custom Expression Preprocessing

Filtering Low-Quality Interactions

Citation

License

Contact

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages