Skip to content

FedPathHarmony: Federated Harmonization for Multi-Center Digital Pathology

License

Notifications You must be signed in to change notification settings

collaborativebioinformatics/FedPathHarmony

Repository files navigation

FedPathHarmony

License

2025 CMUxNVIDIA Hackathon Presentation

Federated Harmonization for Multi-Center Digital Pathology. Compatible with Pytorch and NVFlare.

Team Members

Name Email
Ravi Madduri (Team Lead) madduri@anl.gov
Derek Mu derekmu@andrew.cmu.edu
Mrunali Thokadiwala emrunali@gmail.com
Alina Devkota ad00139@mix.wvu.edu
Alis Bao ailisib@andrew.cmu.edu
Peiran Jiang peiran@cmu.edu
Jacob Thrasher jdt0025@mix.wvu.edu
Prashnna Gyawali prashnna.gyawali@mail.wvu.edu
Suratha Sriram surathas@andrew.cmu.edu
Rodela mrozbu@alumni.cmu.edu
Pete Lawson plawson@jhu.edu
Jiahao He leah12577@gmail.com
Abhijit Chunduru schunduru@umass.edu

Problem Statement:

In precision medicine, integrating data from different biobanks is hindered by domain shifts. A Federated Learning model trained on "raw" data from disjoint hospitals often fails to generalize because it learns to recognize site-specific artifacts rather than biological pathology.

In the CAMELYON17 dataset, we observe significant heterogeneity across 5 medical centers due to differences in staining protocols and scanners:

Stain Heterogeneity across 5 Centers/Clients
Fig. 1: CAMELYON17-WRDS patches by center and label. Heterogeneity in staining protocols between centers is shown, with Center 3 demonstrating high-intensity, saturated H&E staining, whereas Center 5 exhibits low-intensity, lightly stained H&E with reduced color saturation.

The Challenge: A standard AI model might incorrectly learn that "Pink = Tumor" or "Purple = Normal" simply based on which hospital the data came from.

Our Solution:

We propose a Data-Side Harmonization Adapter integrated with NVIDIA FLARE.

Instead of aggregating raw, discordant data, each client standardizes its histology slides locally using a Stain Normalization technique before participating in the federation. This ensures that the global model learns morphological features rather than overfitting to color artifacts.

Architecture:

flowchart TD
    subgraph Global ["Central Server"]
        Agg["Global Model Aggregation"]
    end

    subgraph SiteA ["Biobank A (Center 1)"]
        direction TB
        RawA["Raw Slide Images (Purple)"]
        NormA["Harmonization Adapter"]
        ClientA["NVFlare Client"]
        RawA --> NormA --> ClientA
    end

    subgraph SiteB ["Biobank B (Center 2)"]
        direction TB
        RawB["Raw Slide Images (Pink)"]
        NormB["Harmonization Adapter"]
        ClientB["NVFlare Client"]
        RawB --> NormB --> ClientB
    end

    ClientA -.->|Weights| Agg
    ClientB -.->|Weights| Agg
    Agg -.->|Global Weights| ClientA
    Agg -.->|Global Weights| ClientB
Loading

Methods

Data Sources

Patch Based Histopathology: The CAMELYON17 dataset comprises 1,300 hematoxylin and eosin (H&E)–stained sentinel lymph node whole-slide images (WSIs) from breast cancer patients. Using a patch-based variant of CAMELYON17 [4], approximately 450,000 patches of size 96 × 96 pixels were extracted from the WSIs. Each WSI was manually annotated by pathologists to delineate tumor regions, and the resulting segmentation masks were used to assign binary labels (tumor or non-tumor) to each patch. (Source: https://wilds.stanford.edu/datasets

Biobank Proxy: The CAMELYON17 dataset includes whole slide images from five pathology centers: RadboudUMC, UMCU, Erasmus MC, UMCG and the Institute Jules Bordet. By treating each pathology center as a proxy for a separate biobank, we can explore the impact of a diverse range of staining protocols, slide preparation methods, and scanning equipment on inter-center variability, and the need for data harmonization across sites.

Data Harmonization

Due to the inter-center variability in staining protocols, slide preparation methods, and scanning equipment in the CAMELYON17 dataset, harmonizing the visual representation of histopathology images is critical for ensuring robust model performance across sites. To address this challenge, we developed a harmonization approach based on the Beer-Lambert law to compute image-level frequency information for each whole-slide hematoxylin and eosin (H&E) image. This approach captures the spectral characteristics of each WSI, representing key staining features (e.g., variations in hematoxylin and eosin absorption) in a quantitative and standardized manner.

In a federated learning framework, this image-level frequency information is computed locally at each site and sent back to a central server. The centralized server aggregates these frequency profiles to compute a global average representation of staining parameters across sites. This global average is then shared back with each site, enabling local adjustments of staining features to align with the harmonized global baseline. By ensuring consistency in image representation using global H&E harmonization, this approach minimizes inter-center drift while preserving the fidelity of clinically relevant features within the histopathology images.

Experimental Setup

  • Federated Averaging (naive harmonization): Each center trains a local model on its patches; model weights are periodically averaged across centers using FedAvg via NVFLARE, without explicit stain harmonization.

  • Beer–Lambert Stain Normalization (smart harmonization): Patches are first stain-normalized to reduce inter-center variability, then local models are trained and aggregated using FedAvg in NVFLARE.https://github.com/collaborativebioinformatics/FedPathHarmony/blob/main/README.md

  • Pooled Centers (centralized evaluation): Patches from all five centers are combined into a single dataset, and a centralized model is trained to evaluate the performance difference between conventional centralized training and federated approaches.

How to use this repository:

  1. Download the Camelyon Dataset:
wget "https://worksheets.codalab.org/rest/bundles/0xe45e15f39fb54e9d9e919556af67aabe/contents/blob/?download=1"  -O camelyon17.tar.gz
tar -xzf camelyon17.tar.gz
  1. Create Conda Environment:
conda create --name fpharmo python=3.10 -y \
conda activate fpharmo
  1. Install Required Libraries

    a. Install the pytorch for the corresponding local CUDA version from here.

    # Example for CUDA 12.1
    pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

    b. Install requirements:

    pip install -r requirements.txt
  2. Start the training:

python -m harmo_flare.job --fl_type harmo

Command Line Arguments:

Argument Default Description
--n_clients 5 Number of federated sites/clients
--num_rounds 200 Number of FL rounds
--epochs 2 Local training epochs per round
--batch_size 128 Batch size for local training
--fl_type fedavg Federated learning type (fedavg or harmo)
  1. Evaluation: To get the evaluation metrics (Accuracy, F1 and AUROC), use the following command:
python -m harmo_flare.test_job --fl_type harmo

References:

  1. CAMELYON17 Dataset: Litjens, G., et al. (2018). 1399 H&E-stained sentinel lymph node sections of breast cancer patients: the CAMELYON dataset. GigaScience.

  2. M. Jiang, Z. Wang, and Q. Dou, “HarmoFL: Harmonizing Local and Global Drifts in Federated Learning on Heterogeneous Medical Images,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 1, pp. 1087–1095, June 2022, doi: 10.1609/aaai.v36i1.19993.

  3. T. Xu, Y. Wu, A. K. Tripathi, M. M. Ippolito, and B. D. Haeffele, “Adaptive Stain Normalization for Cross-Domain Medical Histology,” vol. 15966, 2026, pp. 24–33. doi: 10.1007/978-3-032-04981-0_3.

  4. P. Bandi et al., “From Detection of Individual Metastases to Classification of Lymph Node Status at the Patient Level: The CAMELYON17 Challenge,” IEEE Trans Med Imaging, vol. 38, no. 2, pp. 550–560, Feb. 2019, doi: 10.1109/TMI.2018.2867350.

About

FedPathHarmony: Federated Harmonization for Multi-Center Digital Pathology

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 6