ComorbidNet — Correlated Multi-Disease Detection

Detecting co-occurring diseases that share biomarkers — a harder problem than standard disease prediction

Analysis Graphs

Visualisation scripts are in graphs/. Run from the repo root:

python graphs/disease_cooccurrence.py   # Co-occurrence heatmap + prevalence by disease
python graphs/shap_summary.py           # SHAP feature importance per disease label
python graphs/model_performance.py      # Per-disease precision / recall / F1

Script	What it shows
`disease_cooccurrence.py`	Phi-correlation heatmap between T2D/HTN/MetS/CKD + prevalence bar chart
`shap_summary.py`	Grouped bar chart of mean
`model_performance.py`	Side-by-side precision/recall/F1 bars + score profile lines per disease

The Problem

Standard disease prediction treats each disease independently:

Biomarkers → Model_A → "Has Diabetes? Yes/No"
Biomarkers → Model_B → "Has Hypertension? Yes/No"

This fails in practice. Comorbid diseases share biomarkers (confounders):

Biomarker	Diabetes	Hypertension	Metabolic Syndrome	CKD
Glucose / HbA1c	Primary	—	Contributing	Consequence
Blood Pressure	—	Primary	Contributing	Cause
BMI / Waist	Risk	Risk	Primary	Risk
Creatinine / eGFR	—	Effect	—	Primary
Triglycerides / HDL	Risk	—	Primary	—

A model trained only on Diabetes data will misattribute shared signals from Hypertension or Metabolic Syndrome — producing spurious feature importances and miscalibrated risk scores.

Why This Is Harder Than Normal Disease Prediction

Challenge	Standard Prediction	ComorbidNet
Feature space	Clean, independent	High VIF (multicollinear biomarkers)
Label space	Binary (0/1)	Multi-label with correlated outputs
Model output	One probability	Joint probability over 4 diseases
Failure mode	Low AUC	Correct AUC, wrong attribution
Clinical risk	Missed diagnosis	Wrong disease blamed for the signal

Architecture

Patient Biomarkers (13 features)
        │
        ▼
┌─────────────────────────┐
│  Feature Correlation    │
│  Analysis (VIF + SHAP)  │
│  → Identify confounders │
└──────────┬──────────────┘
           │
           ▼
┌──────────────────────────────────────────────┐
│         Classifier Chain XGBoost             │
│                                              │
│  [T2D] → prediction fed as feature →        │
│  [HTN] → prediction fed as feature →        │
│  [MetS] → prediction fed as feature →       │
│  [CKD]  → final prediction                  │
│                                              │
│  Order: metabolic cascade (T2D → CKD)       │
└──────────────────────────────────────────────┘
           │
           ▼
┌─────────────────────────┐
│  SHAP Interaction       │
│  Explainability         │
│  → Per-disease drivers  │
└─────────────────────────┘

The Key Insight: Label Correlation

Disease Correlation Matrix (why independent models fail):

        T2D    HTN    MetS   CKD
T2D   1.000  0.312  0.478  0.201
HTN   0.312  1.000  0.267  0.389
MetS  0.478  0.267  1.000  0.143
CKD   0.201  0.389  0.143  1.000

Ignoring these correlations means your Diabetes model "learns" some Hypertension signal. Classifier Chains explicitly model this by passing each disease prediction as a feature to the next model in the chain.

Setup

git clone https://github.com/Hridambiswas/hello.git
cd hello
pip install -r requirements.txt
python main.py

Results

Disease	Baseline AUC	ComorbidNet AUC	Gain
Type 2 Diabetes (T2D)	~0.91	~0.93	+0.02
Hypertension (HTN)	~0.87	~0.90	+0.03
Metabolic Syndrome (MetS)	~0.89	~0.91	+0.02
Chronic Kidney Disease (CKD)	~0.84	~0.87	+0.03

Hamming Loss: 0.09 → 0.07 (lower is better)
Subset Accuracy: 0.61 → 0.67 (exact match across all 4 diseases)

Project Structure

comorbidnet/
│
├── main.py              # Full pipeline — run this
├── generate_data.py     # Synthetic patient cohort with realistic biomarker correlations
├── requirements.txt
│
└── outputs/
    └── shap_t2d.png     # SHAP feature importance for T2D (generated on run)

Methodology

1. Synthetic Cohort Generation

Real patient data is protected under HIPAA/DISHA. We generate a 2,000-patient cohort using clinically realistic latent variable models:

Latent factors:
  insulin_resistance → T2D, MetS
  vascular_stress    → HTN, CKD
  obesity_factor     → all four diseases

Observable biomarkers emerge from these latent factors + clinical noise.
Disease labels are probabilistic (sigmoid of clinical thresholds).

2. Multicollinearity Quantification (VIF)

Variance Inflation Factor detects how correlated each feature is with all others. Features with VIF > 5 are problematic for naive classifiers — our biomarkers show VIF > 8 for glucose/HbA1c, confirming the need for correlation-aware models.

3. Classifier Chains

Unlike MultiOutputClassifier (independent models), ClassifierChain propagates predictions:

XGB(T2D | biomarkers)
→ XGB(HTN | biomarkers + T2D_pred)
→ XGB(MetS | biomarkers + T2D_pred + HTN_pred)
→ XGB(CKD  | biomarkers + T2D_pred + HTN_pred + MetS_pred)

Chain order follows the metabolic cascade — insulin resistance (T2D) causes vascular damage (HTN), leading to metabolic disruption (MetS) and eventually kidney damage (CKD).

4. SHAP Interaction Values

Standard SHAP treats each disease independently. We use SHAP TreeExplainer on each chain estimator to compute interaction-aware importances — showing how glucose influences T2D prediction after accounting for the T2D signal that feeds into downstream disease predictions.

Clinical Relevance

Polypharmacy risk — knowing which diseases are truly co-occurring vs. measurement artifacts
Biomarker prioritization — ordering which tests to run first in a constrained clinical setting
Risk stratification — patients with T2D + HTN + MetS have compounded CKD risk not captured by independent models

Author

Hridam Biswas — IEEE Researcher, KIIT University
GitHub · Portfolio

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
graphs		graphs
outputs		outputs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generate_data.py		generate_data.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ComorbidNet — Correlated Multi-Disease Detection

Analysis Graphs

The Problem

Why This Is Harder Than Normal Disease Prediction

Architecture

The Key Insight: Label Correlation

Setup

Results

Project Structure

Methodology

1. Synthetic Cohort Generation

2. Multicollinearity Quantification (VIF)

3. Classifier Chains

4. SHAP Interaction Values

Clinical Relevance

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ComorbidNet — Correlated Multi-Disease Detection

Analysis Graphs

The Problem

Why This Is Harder Than Normal Disease Prediction

Architecture

The Key Insight: Label Correlation

Setup

Results

Project Structure

Methodology

1. Synthetic Cohort Generation

2. Multicollinearity Quantification (VIF)

3. Classifier Chains

4. SHAP Interaction Values

Clinical Relevance

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages