Skip to content

Visual Analytics for Cross-Site Biomedical Data and Federated Model Evaluation

License

Notifications You must be signed in to change notification settings

collaborativebioinformatics/FedViz

Repository files navigation

FedViz: Federated Data Discovery & Visualization for Biobank Collaboration

FedViz Dashboard

CMU Γ— NVIDIA Federated Learning Hackathon (Jan 7–9, 2026)

Team: Visualization Tool for Multiple Datasets


🎯 The Challenge

Different biobank sites have heterogeneous datasets, making federated learning nearly impossible without first understanding what data exists across the network.

Before launching any federated learning project, researchers face critical questions:

  • πŸ” What data does each site have?
  • πŸ“Š Are the features comparable across sites?
  • πŸ₯ What's the sample size and quality at each location?
  • 🀝 Which sites should collaborate on specific research questions?

Without answers to these questions, FL projects fail before they start.


πŸ’‘ Our Solution: Federated Data Discovery Pipeline

We demonstrate an end-to-end workflow that enables privacy-preserving data discovery and visualization to prepare biobank networks for federated learning:

1️⃣ Federated Cohort Discovery

Each site runs a local data extractor that:

  • Reads local patient data (remains on-site, never shared)
  • Computes aggregated cohort metadata (cohort size, available data types, disease coverage, biospecimen types)
  • Sends only the metadata summary to the central catalog

Privacy preserved: Raw patient data never leaves the site.

2️⃣ Federated Statistics Computation

Using NVIDIA FLARE, sites compute cross-site statistics:

  • Mean age, standard deviation
  • Disease prevalence (hypertension, diabetes)
  • Genomic data availability
  • Data quality metrics

Privacy preserved: Only aggregated statistics shared, not individual records.

3️⃣ Interactive Dashboard Visualization

Centralized dashboard displays:

  • Cohort Atlas: Browse and search available cohorts with rich metadata
  • Harmonization Matrix: Visualize which features are available across sites
  • Federated Statistics: Compare aggregated metrics across the network
  • Readiness Assessment: Identify which sites can collaborate on specific research questions

4️⃣ Enable FL Collaboration

Armed with discovery insights, researchers can:

  • βœ… Select compatible sites for FL projects
  • βœ… Identify harmonization gaps before training
  • βœ… Plan data collection strategies
  • βœ… Design robust cross-site studies

πŸ—οΈ Architecture

architecture

Tech Stack:

  • Federated Orchestration: NVIDIA FLARE (NVFlare)
  • Backend: Node.js + Python, Elasticsearch 7.17.x
  • Frontend: React + TypeScript, Apache ECharts
  • Data Privacy: Local computation, aggregated-only sharing

πŸš€ Demo Workflow

This demo simulates 8 real-world biobank sites (Nordic Biobank, CHoP, Penn, Japan Biobank, AWS Open Data, CanPath, Sage NF1, QIAGEN) with heterogeneous patient data.

Step 1: Setup Mock Sites

cd demo
python setup_sites.py

Generates mock CSV datasets for 8 sites at /tmp/nvflare/cross_bio_bank/site-{1-8}/patients.csv

Step 2: Run Federated Workflows

bash run_both.sh

This executes two NVFlare workflows:

Workflow A: Cohort Discovery (demo/discovery/)

  • Each site's CohortMetadataExtractor reads local patients.csv
  • Computes cohort profile (name, size, countries, data types, diseases, biospecimens)
  • Server's CatalogWriter aggregates all site metadata into:
    • cohort_catalog.json (rich metadata for dashboard)
    • cohort_catalog.csv (flat format for Elasticsearch import)

Output:

/tmp/nvflare/simulation/cohort_discovery/server/simulate_job/cohort_catalog.json
/tmp/nvflare/simulation/cohort_discovery/server/simulate_job/cohort_catalog.csv

Workflow B: Federated Statistics (demo/fedstats/)

  • Each site's PatientStatistics computes local feature statistics (mean, stddev, histogram)
  • NVFlare server aggregates across all sites
  • Produces global statistics: mean age, disease prevalence, genomic data rates

Output:

/tmp/nvflare/simulation/patient_stats/server/simulate_job/statistics/patient_stats.json

Step 3: Import to Dashboard

# Start Elasticsearch
cd ihcc-api
docker run -d \
  --name elasticsearch \
  -p 9200:9200 \
  -p 9300:9300 \
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=false" \
  docker.elastic.co/elasticsearch/elasticsearch:7.17.10

# Install dependencies and import data
pip install -r scripts/requirements.txt
python scripts/import_csv.py /tmp/nvflare/simulation/cohort_discovery/server/simulate_job/cohort_catalog.csv

# Start API server
npm run build
npm run prod

Step 4: Launch Dashboard

cd ihcc-ui
cp .schema.env .env
npm run buildAndServe

Step 5: Explore at http://localhost:3000/

  • Atlas Tab: Browse discovered cohorts, filter by country, disease, data types

πŸ“Š Key Insights from Real Data Audit

Harmonization Gap

Our audit of 14 international biobank nodes reveals the critical bottleneck in global federated research. By indexing 11,511 unique variables, we quantified the "Federated Surface Area" currently available for multi-site modeling.

Metric Value
Total Unique Variables - "11,511"
Common Variables (β‰₯8 Cohorts) - 120
Global Readiness Index - 1.04%

The Problem: Structural Heterogeneity Massive Sparsity: 98.96% of detected variables are cohort-specific and cannot be used for cross-site training without extensive manual mapping or semantic alignment.

Unique Silos: 98.7% (11,366) of variables exist in only a single cohort, categorized as "Unique" silos.

Terminology Gaps: Clinical concepts are fragmented across inconsistent institutional registries; for example, a reference variable like "sleep history" may have 6747 exact matches but only 42 domain-expanded semantic match across the network. Screenshot 2026-01-09 at 3 22 37β€―PM

Inconsistent Metadata: Variable coverage varies drastically by site, with some biobanks like PLCO contributing nearly 2,500 harmonized variables, while others like SAPRIN contribute fewer than 250.

This "Harmonization Gap" is why federated learning projects struggle:

  • 98.96% of data cannot be used for cross-site training without additional mapping
  • Different sites use different terminologies for the same clinical concepts
  • No systematic way to discover what's compatible before launching FL

FedViz addresses this by: βœ… Making the gap visible through the harmonization matrix
βœ… Enabling targeted harmonization efforts where they matter most
βœ… Allowing researchers to select the 1.04% high-quality feature space for immediate FL pilots
βœ… Providing a roadmap for expanding harmonization to 5%+ through semantic mapping (future work)

Screenshot 2026-01-09 at 3 30 41β€―PM Screenshot 2026-01-09 at 3 33 06β€―PM

A Python-based layer ingests IHCC-style metadata CSVs (tracking 50,000+ participants per site) and transforms them into a standardized Federated Matrix.

import rdflib
import os
import json
from collections import defaultdict

def find_repo_path():
    for root, dirs, files in os.walk('/content'):
        if 'data_dictionaries' in dirs:
            return os.path.join(root, 'data_dictionaries')
    return None

def get_federated_matrix(data_dir):
    variable_map = defaultdict(list)
    cohort_stats = {}

    print(f"πŸ“‚ Processing files in: {data_dir}")
    
    files = [f for f in os.listdir(data_dir) if f.endswith(".owl")]
    if not files:
        print("⚠️ No .owl files found in the directory.")
        return None

    for file in files:
        cohort_id = file.replace(".owl", "").upper()
        path = os.path.join(data_dir, file)
        
        g = rdflib.Graph()
        try:
            g.parse(path, format="xml")
            labels = set()
            for s, p, o in g.triples((None, rdflib.RDFS.label, None)):
                var_name = str(o).strip()
                labels.add(var_name)
                variable_map[var_name].append(cohort_id)
            
            cohort_stats[cohort_id] = len(labels)
            print(f"βœ”οΈ Synced {cohort_id} ({len(labels)} variables)")
        except Exception as parse_error:
            print(f"❌ Failed to parse {file}: {parse_error}")

    threshold = len(cohort_stats) / 2
    common_vars = {k: v for k, v in variable_map.items() if len(v) >= threshold}

    output = {
        "summary": {
            "total_cohorts": len(cohort_stats),
            "total_unique_variables": len(variable_map),
            "common_variables_count": len(common_vars),
            "readiness_percentage": round((len(common_vars)/len(variable_map)*100), 2) if variable_map else 0
        },
        "matrix": variable_map,
        "cohort_sizes": cohort_stats
    }
    return output

repo_path = find_repo_path()

if repo_path:
    results = get_federated_matrix(repo_path)
    if results:
        with open("federated_matrix.json", "w") as f:
            json.dump(results, f, indent=4)
        print("\nπŸš€ SUCCESS! 'federated_matrix.json' is ready for the JS team.")
        print(f"πŸ“Š Summary: Found {results['summary']['common_variables_count']} shared variables across {results['summary']['total_cohorts']} sites.")
else:
    print("❌ ERROR: Could not find 'data_dictionaries' folder. Did you run 'git clone' in this session?")

Cross-Site Biobanks Dashboard & Visualization

A visual analytics dashboard for cross-site biomedical cohorts with mock/demo data to support:

  • Atlas view
  • Data exploration
  • Classification model evaluation (confusion matrix, cohort accuracies, ROC curves)
  • Clustering visualization/performance (cohort vs global centroids, cluster ranges, cohort/cluster distributions)

Key Features

A visual analytics dashboard for cross-site Biobanks with mock/demo data to support:

Tabs

  1. Atlas Dashboard Screenshot

  2. Classification Model Visualization / Performance

    • Confusion-matrix quadrant pies (TP/FP/TN/FN)
    • Bar chart of cohort accuracy (derived from TP/FP/TN/FN)
    • ROC curves per cohort (synthetic/mock ROC points supported) Dashboard Screenshot Dashboard1 Screenshot Dashboard2 Screenshot
  3. Clustering Visualization / Performance

    • Cohort-specific centroids per cluster
    • Global centroids as mean across cohorts
    • Cluster ranges (ellipses / dispersion visual)
    • Pie charts: cohort contribution by sample size (n) and centroid distributions Dashboard Screenshot Dashboard1 Screenshot Dashboard2 Screenshot

🚧 Future Directions

Automated Semantic Harmonization

  • LLM-powered term mapping: Automatically suggest synonyms for clinical variables
  • Expected impact: Increase readiness index from 1.04% β†’ 5%+
  • Example: Map "HBP_AGE" (Site A) ↔ "hypertension_diagnosis_age" (Site B)

Real-Time FL Model Performance Tracking

  • Display per-site confusion matrices during federated training
  • Track convergence metrics across heterogeneous data distributions
  • Visualize fairness metrics (performance equity across demographics)

Multi-Modal Data Support

  • Extend beyond tabular data to imaging, genomics, EHR time-series
  • Visualize data modality availability matrix
  • Enable multi-modal FL project planning

πŸ“œ License & Citation

MIT License – Deploy freely at your biobank to audit FL readiness.

References

IHCC Consortium:

@article{IHCC2020,
  title={The International HundredK+ Cohorts Consortium: Integrating Large-scale Cohorts for Global Health},
  author={Manolio, Teri A. and Goodhand, Peter and Lowrance, William and others},
  journal={Gene},
  volume={738},
  pages={144491},
  year={2020},
  publisher={Elsevier},
  doi={10.1016/j.gene.2020.144491}
}

NVIDIA FLARE:

@article{Roth2022,
  title={{NVIDIA FLARE}: Federated Learning from Simulation to Real-World},
  author={Roth, Holger R. and Cheng, Yan and Wen, Yuhong and Yang, Isaac and Xu, Ziyue and Hsieh, Yuan-Ting and others},
  journal={arXiv preprint arXiv:2210.13291},
  year={2022},
  url={https://arxiv.org/abs/2210.13291}
}

πŸ‘₯ Contributors

  • Hieu (Henry) Ngo
  • Yuan-Ting Hsieh
  • Aditya Kumar Karna

FedViz β€” Bridging the gap between raw biobank heterogeneity and robust federated model training.

About

Visual Analytics for Cross-Site Biomedical Data and Federated Model Evaluation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •