Skip to content

Releases: KatherLab/MediSwarm

v1.4.2

10 Apr 13:44
71685d3

Choose a tag to compare

MediSwarm v1.4.2

Bug Fixes

Challenge Model Selection (#284, #286)

--job challenge_4abmil and --job challenge_5pimed previously produced a ValueError: Unsupported model name because the auto-derive logic stripped the challenge_ prefix and passed the raw suffix (4abmil, 5pimed) to the model registry instead of the registered keys (4LME_ABMIL, 5Pimed). Only challenge_1DivideAndConquer worked correctly since its suffix matched the registry key exactly.

  • Added a case statement in docker_config/master_template.yml that maps 4abmil→4LME_ABMIL and 5pimed→5Pimed during auto-derive
  • Embedded the correct MODEL_NAME= directly in each challenge job's SubprocessLauncher script (config_fed_client.conf) so admin-submitted jobs also select the right model regardless of how the client container was started
  • Affects all 5 challenge jobs: 1DivideAndConquer, 2BCN_AIM, 3agaldran, 4LME_ABMIL, 5Pimed

Live-sync UX Improvements (#285, #286)

  • Pre-configured startup kits: sync.conf.example now contains the real monitoring server address (172.24.4.65) with ConnectTimeout=5. The kit injection script falls back to copying .example as sync.conf when no local override exists, so participant startup kits always arrive with a working configuration — no manual file copying required
  • Interactive SSH connectivity check: instead of silently exiting on SSH failure, live_sync.sh now tests connectivity, prints the test command and SSH key-sharing instructions, then asks "Continue training without live sync? [y/N]". Non-interactive environments (no TTY) default to continuing without sync

--job Flag Ignoring Challenge Model (#277)

When --job specified a challenge model while starting a client, the flag was not being passed to the training subprocess correctly. Fixed so the --job selection is honoured end-to-end.

Live-sync SSH Hanging (#277)

SSH connections without BatchMode=yes could hang indefinitely waiting for password input in non-interactive environments. Fixed by enforcing BatchMode=yes in all SSH options.

Checkpoint Collection After stop_all (#268)

stop_all deletes the server working directory, causing checkpoint collection to fail with a path error. Fixed to handle the missing server dir gracefully.

Increased Timeouts and Webviewer Error Detection (#269 / base)

Raised NVFlare peer/heartbeat timeouts to accommodate large model transfers (2 GB+) over VPN. Enhanced webviewer startup to detect and report errors instead of silently hanging.


New Features

STAMP Deploy Test (#269, #279, #280)

Added an automated 2-node deploy test for STAMP participants:

  • Separate master_template_STAMP.yml and Dockerfile_STAMP for isolated STAMP provisioning
  • scripts/deploy/run_stamp_deploy_test.sh orchestrates the full test: build → provision → inject → launch → collect
  • Multi-model support: test can cycle through all STAMP models in a single run
  • Retry loop with configurable attempts for flaky network conditions
  • Participant README (STAMP_PARTICIPANT_README.md) with step-by-step setup guide

Apt Pin Refresh (#283)

Refreshed pinned apt package versions in Dockerfile_ODELIA to match the current Ubuntu package index, preventing apt-get install failures on fresh builds. Added helper scripts to automate future pin updates.


Full Changelog

v1.4.1...v1.4.2

v1.4.1

08 Apr 05:02

Choose a tag to compare

MediSwarm v1.4.1

Changes

  • Update 2-site deploy test configuration to use DL0 (RUMC_1) + DL2 (MHA_1)
  • Bump version to 1.4.1

Deploy Test Validation

Successfully validated challenge_1DivideAndConquer over Tailscale VPN with 2 clients:

  • 30+ error-free training rounds across DL0 (RUMC_1) and DL2 (MHA_1)
  • P2P model exchange (689MB model): ~2-3 seconds
  • Adaptive epoch calculation with EPOCHS_MAX_CAP=10 working correctly
  • Both swarm_config and swarm_start phases completed cleanly

Full Changelog

v1.4.0...v1.4.1

v1.4.0

07 Apr 18:47

Choose a tag to compare

What's New

Webviewer (Live Monitor)

  • Fix age column flicker — Replaced <meta http-equiv="refresh"> with JS-based auto-refresh and client-side age ticking. Age now counts up smoothly without resetting to 0s on reload.
  • Hostname column — Dashboard now shows which machine each run is coming from (parsed from heartbeat.json).
  • Error status detection — Runs that hit FATAL_SYSTEM_ERROR, EXECUTION_EXCEPTION, RuntimeError, OutOfMemoryError, or CUDA out of memory are now flagged with a red "error" badge instead of appearing as "stale" or "finished".
  • Default metrics visibility — Only train/val ACC and AUC-ROC are shown by default in charts. All other series are hidden but toggleable via the Chart.js legend.
  • Label distribution chart — Detail page now shows a grouped bar chart of class counts per train/val/test split, parsed from console output.

Training

  • Reduce EPOCHS_MAX_CAP default from 20 → 10, preventing excessive epochs on small sites (e.g. RUMC_1 with 22 samples was doing 20 epochs per round, now capped at 10). Override with EPOCHS_MAX_CAP env var.

Heartbeat / Live Sync

  • Hostname field added to heartbeat.json output
  • ANSI escape code stripping from RUN_NAME (fixes garbled names from colored terminal output)
  • Quote cleanup on kit_version field

CI/CD

  • Deploy test workflow now triggers on release publish instead of weekly schedule (manual dispatch retained)

Housekeeping

MediSwarm v1.3.0

05 Apr 14:11

Choose a tag to compare

MediSwarm v1.3.0

Released: 2026-04-05

Major release adding the STAMP histopathology classification pipeline, FedProx aggregation strategy, comprehensive CI/CD infrastructure, Duke benchmark pipeline, and expanded documentation with architecture diagrams.


🔬 STAMP Classification Pipeline

Full support for KatherLab STAMP 2.4.0 histopathology classification in federated learning:

  • Separate Dockerfile_STAMP — Python 3.11, PyTorch 2.7.1, CUDA 12.6 (independent from ODELIA's Python 3.10/PyTorch 2.2.2 image)
  • Build flagbuildDockerImageAndStartupKits.sh now accepts -d / --dockerfile to select between Dockerfile_ODELIA and Dockerfile_STAMP
  • Synthetic dataset generator — Creates 2 sites × 15 patients with H5 feature files for integration testing
  • Integration tests — Preflight check, local training, and NVFlare simulation mode (3 rounds, 2 clients)
  • Per-round metrics CSVSTAMPMetricsCallback writes ground-truth/prediction probabilities and summary metrics per epoch

Two Docker Images

After v1.3.0, MediSwarm maintains two Docker images:

Image Python PyTorch Use Case
jefftud/odelia:<ver> 3.10 2.2.2 3D breast MRI classification
jefftud/stamp:<ver> 3.11 2.7.1 STAMP histopathology classification

🔄 FedProx Aggregation Strategy

Alternative to FedAvg for improved convergence with non-IID medical data:

  • FedProxCallback — Lightning callback adds proximal term (μ/2) × ‖w_local − w_global‖² to gradient updates
  • Cross-pipeline — Compatible with both ODELIA (pytorch_lightning) and STAMP (lightning)
  • Configurable — Set FEDPROX_MU environment variable (default: 0 = disabled, recommended: 0.001–0.01)
  • Documentationdocs/AGGREGATION_STRATEGIES.md compares FedAvg, FedProx, Scaffold, and FedOpt with decision matrix

🧪 CI/CD for STAMP

Expanded test infrastructure covering both pipelines:

  • Unit teststest_stamp_training.py (465 lines), test_stamp_model_wrapper.py (257 lines), test_fedprox_callback.py (286 lines)
  • Integration tests — STAMP Docker build + preflight + local training + simulation in pr-test.yaml
  • Unified packagesunit-tests.yaml switched from pytorch-lightning to unified lightning package
  • Timeout — PR test timeout increased from 45 to 60 minutes
  • Cleanup — CI cleanup step now kills stamp and nvflare containers alongside odelia

📊 Duke Benchmark Pipeline

Automated end-to-end benchmarking on the Duke Breast MRI dataset:

  • run_duke_benchmark.sh — Orchestrates build → deploy → swarm training → result collection → local model comparison
  • Configurable deploydeploy_and_test.sh reads SITES and SERVER_NAME from deploy_sites.conf (backward-compatible defaults)
  • deploy_sites.conf.example — Template with dl0/dl2/dl3 configuration for TUD compute cluster
  • Results templatedocs/DUKE_BENCHMARK_RESULTS.md for recording benchmark outcomes

📐 Architecture Documentation

Expanded README from 46 lines to 214 lines:

  • System Architecture — Mermaid diagram showing site-to-server topology with NVFlare aggregation
  • Training Pipeline — Mermaid sequence diagram showing federated learning round lifecycle
  • Supported Pipelines — Comparison table (ODELIA 3D CNN vs STAMP Classification)
  • Key Features — Privacy, Docker reproducibility, multi-pipeline support
  • Project Structure — Annotated directory tree

🔐 Differential Privacy Assessment

Gap analysis and roadmap (documentation only — implementation deferred to v1.4.0):

  • docs/DIFFERENTIAL_PRIVACY.md — Current PercentilePrivacy is gradient clipping, NOT formal (ε,δ)-DP. Detailed analysis of Opacus/DP-SGD integration path, compatibility issues, and privacy budget accounting
  • docs/DIFFERENTIAL_PRIVACY_DECISION.md — Architecture decision record

Changed

  • deploy_and_test.sh container matching broadened to include stamp and nvflare alongside odelia
  • CI pr-test.yaml timeout increased from 45 to 60 minutes
  • CI cleanup step now kills stamp and nvflare containers

Stats

  • 31 files changed, 3,465 insertions, 162 deletions
  • 16 new files created
  • 9 pull requests (#252#260)

Upgrade Notes

  • No breaking changes from v1.2.0
  • ODELIA pipeline users: no action required — Dockerfile_ODELIA is unchanged
  • STAMP pipeline users: build with ./buildDockerImageAndStartupKits.sh -d docker_config/Dockerfile_STAMP -p <project>
  • FedProx: opt-in via FEDPROX_MU env var — set to 0 or leave unset for standard FedAvg behavior

Full Changelog: v1.2.0...v1.3.0

MediSwarm v1.2.0

04 Apr 21:36

Choose a tag to compare

MediSwarm v1.2.0

Highlights

This release introduces STAMP classification support for swarm learning, a prediction workflow for external test data, significant code deduplication, improved training stability, and comprehensive documentation for making standalone training code MediSwarm-compatible.

New Features

STAMP Classification Job (#249)

  • New STAMP_classification job for swarm learning with STAMP's data pipeline (H5 features + clinical tables)
  • Supports VIT, MLP, TransMIL, and other STAMP model architectures
  • Configurable via STAMP_* environment variables
  • Stratified train/val split with STAMP's data loading pipeline

Prediction Workflow (#247)

  • New prediction workflow for evaluating trained swarm models on external test data
  • Supports both ODELIA 3D CNN and STAMP classification models
  • Configurable via environment variables for model path, data directory, and output format

Weighted Epochs Per Site (#251)

  • Replaces hardcoded per-site epoch dictionaries with a formula-based approach
  • Formula: epochs = base_epochs × (reference_size / num_train_samples), clamped to [1, max_cap]
  • Sites with fewer training samples get more local epochs per round, equalizing gradient updates across sites
  • Configurable via EPOCHS_PER_ROUND, EPOCHS_REFERENCE_DATASET_SIZE, EPOCHS_MAX_CAP env vars

Best + Last Model Checkpoints (#251)

  • finalize_training() now saves both best (by monitor metric) and latest checkpoints
  • Deployers can choose between peak-validation and final-aggregated models

Server Dashboard Enhancement (#240)

  • Enhanced server-side monitoring dashboard for real-time swarm training visibility

Client Stability Improvements (#245)

  • Systemd service for VPN with auto-reconnect and keepalive
  • GPU health check script for pre-training and Docker health checks
  • Docker container restart policies (--restart=on-failure:5)
  • VPN health monitor with automatic service restart after consecutive failures

Infrastructure & DevOps

Code Deduplication (#241)

  • Consolidated 5 duplicate challenge job directories into shared _shared/custom/ with symlinks
  • Moved build scripts to scripts/build/ and CI scripts to scripts/ci/
  • Single source of truth for training code across all ODELIA/challenge jobs

NVFlare Workflow Enhancements (#242)

  • Cross-site evaluation (CSE) workflow added to server and client configs
  • Tuned timeouts from 100-hour placeholders to practical values
  • Explicit metric comparator configuration
  • PercentilePrivacy filter for gradient quality control

Automated Tests (#243)

  • New unit test suite in tests/unit_tests/ (models_config, env_config, data_module)
  • GitHub Actions workflow for unit tests on PRs
  • Fixed hardcoded paths in test_challenge_models.py

Docker Build Optimization (#250)

  • Reordered Dockerfile layers: pip installs (expensive, stable) before apt installs (cheap, frequent CVE bumps)
  • Added --no-cache-dir flags to reduce image size
  • Consolidated RUN layers for better caching

NVFlare 2.7.2 Upgrade (#235, #236)

  • Upgraded from NVFlare 2.5.x to 2.7.2

Bug Fixes

  • Fix integration test printed icons (#224)
  • Fix site name argument ordering (#237, fixes #227)
  • Update CI Node.js version (#238, fixes #222)
  • Fix CI apt-get update permissions (#239)
  • Fix CLI flags for env vars lost when using sudo (#230)

Documentation

  • MediSwarm Compatibility Guide (#244, addresses #216) — step-by-step guide for making standalone training code MediSwarm-compatible
  • Updated README with correct repository links

Training Improvements (#246)

  • Class-weighted loss for imbalanced datasets
  • Gradient accumulation (effective batch size of 8)
  • Gradient clipping (val=1.0) to prevent explosion
  • 16-mixed precision for stability

Full Changelog: v1.1.0...v1.2.0

v1.1.0 — Challenge Models

02 Apr 21:11

Choose a tag to compare

MediSwarm v1.1.0 — Challenge Models Release

This release integrates five ODELIA challenge models into MediSwarm for federated swarm training, along with infrastructure improvements for deployment, testing, and CI.

New Challenge Models

Job Model Architecture
challenge_1DivideAndConquer ResidualEncoder
challenge_2BCN_AIM SwinUNETR
challenge_3agaldran MViT v2
challenge_4abmil CrossModalAttentionABMIL + Swin
challenge_5pimed ResNet18

Each challenge job is a self-contained NVFlare application with its own model code, data pipeline, configs, and synthetic dataset generator.

Highlights

  • --job flag for docker.sh — Participants can now run preflight checks and local training for any challenge model:
    ./docker.sh --preflight_check --job challenge_5pimed --data_dir $DATADIR --scratch_dir $SCRATCHDIR --GPU device=0
  • Pretrained weight caching — Large model weights (checkpoint_final.pth, mvit_v2_s-ae3be167.pth) are stored outside job directories to prevent NVFlare from bundling them during job submission
  • MODEL_NAME env var fix — All challenge jobs hardcode their MODEL_NAME to prevent the docker.sh default (MST) from silently overriding the intended model
  • Deployment automation — New deploy_and_test.sh script for multi-site Docker image push, startup kit deployment, and swarm lifecycle management
  • Live sync — New kit_live_sync/ for startup kit synchronization with heartbeat monitoring
  • CI reliability — Fixed script permissions, auto-install of gdown, NVFlare submodule sync

Breaking Changes

None. The default behavior of docker.sh (without --job) remains unchanged and runs ODELIA_ternary_classification.

odelia-challenge-v1.0

10 Jul 11:40

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: https://github.com/KatherLab/MediSwarm/commits/Odelia_Challenge