Skip to content

murugnn/DICE-Flow

Repository files navigation

DiCE-Duan Project: Differential Centrality Analysis with Duan 2025 SSSP Algorithm

A complete implementation combining the DiCE (Differential Centrality-Ensemble) methodology with the state-of-the-art Duan et al. 2025 shortest path algorithm for identifying disease-associated genes through network analysis.

Overview

This project implements a computational pipeline that:

  1. Integrates multi-source protein-protein interaction (PPI) networks from KEGG and BioGRID
  2. Weights networks using gene expression correlations (DiCE methodology)
  3. Computes network centralities efficiently using the Duan 2025 algorithm
  4. Identifies disease-associated genes through differential centrality analysis
  5. Performs virtual knockout experiments to assess gene essentiality

Key Features

  • Fast SSSP computation: O(m log^(2/3) n) complexity vs O(m + n log n) for Dijkstra
  • Directed network support: Proper handling of signaling pathways
  • Expression-weighted edges: Correlation-based weights capture functional relationships
  • Differential analysis: Compares normal vs disease conditions
  • Ensemble ranking: Combines multiple centrality measures
  • Virtual knockout: Simulates gene perturbations

Project Structure

dice_duan_project/
├── data/
│   ├── raw/
│   │   ├── kegg/              # KEGG pathway XML files
│   │   ├── biogrid/           # BioGRID TAB2 file
│   │   └── expression/        # Gene expression matrices
│   ├── processed/
│   │   ├── networks/          # Parsed and merged networks
│   │   └── weighted/          # Expression-weighted networks
│   └── results/               # Analysis outputs
├── src/
│   ├── cpp/                   # C++ Duan 2025 implementation
│   │   ├── duan_engine.hpp    # Core algorithm
│   │   └── main.cpp           # Analysis pipeline
│   └── python/                # Python preprocessing
│       ├── parse_kegg.py
│       ├── parse_biogrid.py
│       ├── merge_networks.py
│       ├── weight_network.py
│       └── differential_centrality.py
├── CMakeLists.txt             # C++ build configuration
├── requirements.txt           # Python dependencies
├── run_pipeline.sh            # Main execution script
└── README.md                  # This file

Installation

Requirements

  • Python: 3.8+
  • C++ Compiler: g++ 9+ or clang++ 10+ (C++17 support)
  • CMake: 3.10+
  • Git: For version control

Step 1: Clone the repository

git clone <your-repo-url>
cd dice_duan_project

Step 2: Install Python dependencies

pip install -r requirements.txt

Step 3: Build C++ engine

mkdir -p build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j$(nproc)
cd ..

Data Acquisition

1. KEGG Pathways (Required)

Download from: https://www.genome.jp/kegg/pathway.html

Instructions:

  1. Go to KEGG Pathway Database
  2. Select organism: "Homo sapiens (human)"
  3. Download KGML (XML) files for pathways of interest
  4. Place XML files in: data/raw/kegg/

Alternative: Use KEGG REST API

# Example: Download all human pathways
mkdir -p data/raw/kegg
for pathway in hsa04010 hsa04020 hsa04110 hsa04151; do
    wget "https://rest.kegg.jp/get/$pathway/kgml" -O "data/raw/kegg/${pathway}.xml"
done

Recommended pathways for cancer research:

  • hsa04010: MAPK signaling
  • hsa04110: Cell cycle
  • hsa04151: PI3K-Akt signaling
  • hsa04510: Focal adhesion
  • hsa04520: Adherens junction

2. BioGRID (Required)

Download from: https://downloads.thebiogrid.org/BioGRID/Release-Archive/

Current version: BIOGRID-ALL-5.0.253.tab2.txt

Instructions:

mkdir -p data/raw/biogrid
cd data/raw/biogrid

# Download latest BioGRID
wget "https://downloads.thebiogrid.org/Download/BioGRID/Latest-Release/BIOGRID-ALL-LATEST.tab2.zip"
unzip BIOGRID-ALL-LATEST.tab2.zip

cd ../../..

File location: data/raw/biogrid/BIOGRID-ALL-5.0.253.tab2.txt

3. Gene Expression Data (Required)

You need expression matrices for two conditions (e.g., Normal vs Tumor).

Option A: GTEx (Normal tissue)

Download from: https://gtexportal.org/home/downloads/adult-gtex

Files needed:

  • GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct.gz

Instructions:

mkdir -p data/raw/expression
cd data/raw/expression

# Download GTEx (requires registration)
# Manual download from: https://gtexportal.org/home/downloads/adult-gtex

# Or use provided link (if you have access)
wget "https://storage.googleapis.com/gtex_analysis_v8/rna_seq_data/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct.gz"

gunzip GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct.gz

cd ../../..

Extract specific tissue (e.g., Prostate):

import pandas as pd

# Load GTEx
gtex = pd.read_csv('data/raw/expression/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct', 
                   sep='\t', skiprows=2, index_col=1)
gtex = gtex.iloc[:, 1:]  # Remove Description column

# Filter for prostate samples
prostate_cols = [col for col in gtex.columns if 'Prostate' in col]
prostate_normal = gtex[prostate_cols]

# Save
prostate_normal.to_csv('data/raw/expression/expression_normal.txt', sep='\t')

Option B: TCGA (Tumor tissue)

Download from: https://portal.gdc.cancer.gov/ or https://xenabrowser.net/

Using UCSC Xena (Recommended):

# Install UCSC Xena tools
pip install xenaPython

# Download TCGA data (Python)
python3 << 'EOF'
import xenaPython as xena

# List available datasets
hub = "https://tcga.xenahubs.net"

# Download TCGA-PRAD (Prostate Cancer) expression
dataset = "TCGA.PRAD.sampleMap/HiSeqV2"
samples = xena.dataset_samples(hub, dataset)
genes = xena.dataset_field(hub, dataset)

# Fetch expression matrix
import pandas as pd
expr_data = xena.dataset_gene_expression(hub, dataset, genes, samples)

df = pd.DataFrame(expr_data, index=genes, columns=samples)
df.to_csv('data/raw/expression/expression_tumor.txt', sep='\t')

print(f"Downloaded {len(genes)} genes x {len(samples)} samples")
EOF

Alternative - Manual download:

  1. Go to: https://xenabrowser.net/datapages/
  2. Select: "TCGA Pan-Cancer (PANCAN)"
  3. Download: "gene expression RNAseq - IlluminaHiSeq"
  4. Filter for your cancer type (e.g., TCGA-PRAD for prostate)

Option C: GEO (Custom datasets)

Search: https://www.ncbi.nlm.nih.gov/geo/

Example: GSE21032 (Prostate cancer metastasis study)

# Install GEOparse
pip install GEOparse

# Download dataset
python3 << 'EOF'
import GEOparse

gse = GEOparse.get_GEO("GSE21032", destdir="./data/raw/expression/")
# Process and save as needed
EOF

Expression File Format

Required format: Tab-separated text file

        Sample1    Sample2    Sample3    ...
GENE1   12.5       14.2       11.8      ...
GENE2   8.3        9.1        7.9       ...
GENE3   15.6       16.2       14.9      ...
...
  • Rows: Genes (gene symbols or Entrez IDs)
  • Columns: Samples
  • Values: Expression levels (TPM, FPKM, or normalized counts)

Usage

Quick Start - Test with Dummy Data

# Generate dummy data and run full pipeline
chmod +x run_pipeline.sh
./run_pipeline.sh --test

This will:

  1. Generate synthetic network and expression data
  2. Run the complete DiCE-Duan pipeline
  3. Produce results in ~2-3 minutes

Full Pipeline - Real Data

# Ensure data is in place:
# - data/raw/kegg/*.xml
# - data/raw/biogrid/BIOGRID-ALL-5.0.253.tab2.txt
# - data/raw/expression/expression_normal.txt
# - data/raw/expression/expression_tumor.txt

# Run pipeline
./run_pipeline.sh

Step-by-Step Execution

If you prefer to run each phase separately:

# Phase 1: Parse KEGG
python3 src/python/parse_kegg.py \
    --kegg-dir data/raw/kegg \
    --output data/processed/networks/kegg_interactions.txt

# Phase 2: Parse BioGRID
python3 src/python/parse_biogrid.py \
    --biogrid-file data/raw/biogrid/BIOGRID-ALL-5.0.253.tab2.txt \
    --output data/processed/networks/biogrid_interactions.txt

# Phase 3: Merge networks
python3 src/python/merge_networks.py \
    --kegg data/processed/networks/kegg_interactions.txt \
    --biogrid data/processed/networks/biogrid_interactions.txt \
    --output data/processed/networks/merged_network.txt

# Phase 4: Weight with expression (Normal)
python3 src/python/weight_network.py \
    --network data/processed/networks/merged_network.txt \
    --expression data/raw/expression/expression_normal.txt \
    --output data/processed/weighted/network_normal.txt

# Phase 5: Weight with expression (Tumor)
python3 src/python/weight_network.py \
    --network data/processed/networks/merged_network.txt \
    --expression data/raw/expression/expression_tumor.txt \
    --output data/processed/weighted/network_tumor.txt

# Phase 6: Compute centralities (C++)
./build/dice_analyzer \
    data/processed/weighted/network_normal_int.txt \
    data/processed/weighted/network_tumor_int.txt \
    data/results

# Phase 7: Differential analysis
python3 src/python/differential_centrality.py \
    --normal data/results/centrality_normal.txt \
    --tumor data/results/centrality_tumor.txt \
    --output data/results/dice_genes.txt

Output Files

Main Results

  1. dice_genes.txt: Complete ranked list of DiCE genes

    • Columns: gene, betweenness (normal/tumor), eigenvector (normal/tumor), delta values, ranks, ensemble score
  2. dice_genes_top500.txt: Top 500 DiCE genes for downstream analysis

  3. centrality_normal.txt: Network centralities in normal condition

  4. centrality_tumor.txt: Network centralities in tumor condition

  5. knockout_analysis.txt: Virtual knockout vitality scores

Interpretation

High ensemble score = Gene with large centrality changes between conditions

  • Potential disease driver or biomarker
  • Candidate for experimental validation

High vitality score = Gene whose removal disrupts network connectivity

  • Essential "bottleneck" protein
  • Potential therapeutic target

Visualization

Cytoscape Import

  1. Open Cytoscape
  2. Import Network: data/processed/weighted/network_tumor.txt
  3. Import Node Attributes: data/results/dice_genes_top500.txt
  4. Create visual mappings:
    • Node size → ensemble_score
    • Node color → delta_betweenness
    • Edge width → weight

Python Visualization

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load DiCE genes
dice = pd.read_csv('data/results/dice_genes.txt', sep='\t')

# Plot top 20
top20 = dice.head(20)

plt.figure(figsize=(10, 8))
plt.barh(range(len(top20)), top20['ensemble_score'])
plt.yticks(range(len(top20)), top20['gene'])
plt.xlabel('Ensemble Score')
plt.title('Top 20 DiCE Genes')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.savefig('top_dice_genes.png', dpi=300)

Performance

Benchmarks (1000 genes, 10,000 edges)

Algorithm SSSP Time Betweenness Time
Dijkstra 45 ms 45 seconds
Duan 2025 8 ms 8 seconds

Speedup: ~5-6x for typical PPI networks

Memory Usage

  • Peak memory: ~500 MB for 5000-gene network
  • Scales linearly with network size

Troubleshooting

Issue: "Cannot open file"

  • Solution: Check file paths in run_pipeline.sh
  • Ensure data files exist in correct locations

Issue: "C++ compilation failed"

  • Solution: Check compiler version
g++ --version  # Should be 9.0 or higher
cmake --version  # Should be 3.10 or higher

Issue: "Module not found" (Python)

  • Solution: Install missing dependencies
pip install -r requirements.txt

Issue: "Out of memory"

  • Solution: Reduce network size or increase RAM
  • Consider filtering to high-confidence interactions

Advanced Usage

Custom Expression Preprocessing

If your expression data needs normalization:

import pandas as pd
import numpy as np

# Load raw counts
expr = pd.read_csv('raw_counts.txt', sep='\t', index_col=0)

# Log2 transformation
expr_log = np.log2(expr + 1)

# Z-score normalization
expr_norm = (expr_log - expr_log.mean()) / expr_log.std()

# Save
expr_norm.to_csv('expression_normalized.txt', sep='\t')

Filtering Low-Quality Interactions

# Filter BioGRID by experimental evidence
biogrid = pd.read_csv('biogrid_interactions.txt', sep='\t')

# Keep only high-confidence methods
high_conf = ['Two-hybrid', 'Affinity Capture-MS', 'Co-crystal Structure']
biogrid_filt = biogrid[biogrid['interaction_type'].isin(high_conf)]

biogrid_filt.to_csv('biogrid_filtered.txt', sep='\t', index=False)

Citation

If you use this pipeline, please cite:

  1. DiCE Paper: Pashaei et al. (2025). DiCE: differential centrality-ensemble analysis based on gene expression profiles and protein–protein interaction network. Nucleic Acids Research, 53, gkaf609.

  2. Duan Algorithm: Duan et al. (2025). Faster Shortest Paths in Graphs. STOC 2025.

License

MIT License - See LICENSE file for details

Contact

For questions or issues:

  • Open an issue on GitHub
  • Email: [your-email]

Acknowledgments

  • KEGG Database
  • BioGRID Database
  • GTEx Consortium
  • TCGA Research Network

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors