Releases · KatherLab/MediSwarm

10 Apr 13:44

Ultimate-Storm

v1.4.2

71685d3

v1.4.2 Latest

Latest

MediSwarm v1.4.2

Bug Fixes

Challenge Model Selection (#284, #286)

--job challenge_4abmil and --job challenge_5pimed previously produced a ValueError: Unsupported model name because the auto-derive logic stripped the challenge_ prefix and passed the raw suffix (4abmil, 5pimed) to the model registry instead of the registered keys (4LME_ABMIL, 5Pimed). Only challenge_1DivideAndConquer worked correctly since its suffix matched the registry key exactly.

Added a case statement in docker_config/master_template.yml that maps 4abmil→4LME_ABMIL and 5pimed→5Pimed during auto-derive
Embedded the correct MODEL_NAME= directly in each challenge job's SubprocessLauncher script (config_fed_client.conf) so admin-submitted jobs also select the right model regardless of how the client container was started
Affects all 5 challenge jobs: 1DivideAndConquer, 2BCN_AIM, 3agaldran, 4LME_ABMIL, 5Pimed

Live-sync UX Improvements (#285, #286)

Pre-configured startup kits: sync.conf.example now contains the real monitoring server address (172.24.4.65) with ConnectTimeout=5. The kit injection script falls back to copying .example as sync.conf when no local override exists, so participant startup kits always arrive with a working configuration — no manual file copying required
Interactive SSH connectivity check: instead of silently exiting on SSH failure, live_sync.sh now tests connectivity, prints the test command and SSH key-sharing instructions, then asks "Continue training without live sync? [y/N]". Non-interactive environments (no TTY) default to continuing without sync

--job Flag Ignoring Challenge Model (#277)

When --job specified a challenge model while starting a client, the flag was not being passed to the training subprocess correctly. Fixed so the --job selection is honoured end-to-end.

Live-sync SSH Hanging (#277)

SSH connections without BatchMode=yes could hang indefinitely waiting for password input in non-interactive environments. Fixed by enforcing BatchMode=yes in all SSH options.

Checkpoint Collection After stop_all (#268)

stop_all deletes the server working directory, causing checkpoint collection to fail with a path error. Fixed to handle the missing server dir gracefully.

Increased Timeouts and Webviewer Error Detection (#269 / base)

Raised NVFlare peer/heartbeat timeouts to accommodate large model transfers (2 GB+) over VPN. Enhanced webviewer startup to detect and report errors instead of silently hanging.

New Features

STAMP Deploy Test (#269, #279, #280)

Added an automated 2-node deploy test for STAMP participants:

Separate master_template_STAMP.yml and Dockerfile_STAMP for isolated STAMP provisioning
scripts/deploy/run_stamp_deploy_test.sh orchestrates the full test: build → provision → inject → launch → collect
Multi-model support: test can cycle through all STAMP models in a single run
Retry loop with configurable attempts for flaky network conditions
Participant README (STAMP_PARTICIPANT_README.md) with step-by-step setup guide

Apt Pin Refresh (#283)

Refreshed pinned apt package versions in Dockerfile_ODELIA to match the current Ubuntu package index, preventing apt-get install failures on fresh builds. Added helper scripts to automate future pin updates.

Full Changelog

v1.4.1...v1.4.2

Assets 2

08 Apr 05:02

Ultimate-Storm

v1.4.1

41c7471

v1.4.1

MediSwarm v1.4.1

Changes

Update 2-site deploy test configuration to use DL0 (RUMC_1) + DL2 (MHA_1)
Bump version to 1.4.1

Deploy Test Validation

Successfully validated challenge_1DivideAndConquer over Tailscale VPN with 2 clients:

30+ error-free training rounds across DL0 (RUMC_1) and DL2 (MHA_1)
P2P model exchange (689MB model): ~2-3 seconds
Adaptive epoch calculation with EPOCHS_MAX_CAP=10 working correctly
Both swarm_config and swarm_start phases completed cleanly

Full Changelog

v1.4.0...v1.4.1

Assets 2

07 Apr 18:47

Ultimate-Storm

v1.4.0

a45410e

v1.4.0

What's New

Webviewer (Live Monitor)

Fix age column flicker — Replaced <meta http-equiv="refresh"> with JS-based auto-refresh and client-side age ticking. Age now counts up smoothly without resetting to 0s on reload.
Hostname column — Dashboard now shows which machine each run is coming from (parsed from heartbeat.json).
Error status detection — Runs that hit FATAL_SYSTEM_ERROR, EXECUTION_EXCEPTION, RuntimeError, OutOfMemoryError, or CUDA out of memory are now flagged with a red "error" badge instead of appearing as "stale" or "finished".
Default metrics visibility — Only train/val ACC and AUC-ROC are shown by default in charts. All other series are hidden but toggleable via the Chart.js legend.
Label distribution chart — Detail page now shows a grouped bar chart of class counts per train/val/test split, parsed from console output.

Training

Reduce EPOCHS_MAX_CAP default from 20 → 10, preventing excessive epochs on small sites (e.g. RUMC_1 with 22 samples was doing 20 epochs per round, now capped at 10). Override with EPOCHS_MAX_CAP env var.

Heartbeat / Live Sync

Hostname field added to heartbeat.json output
ANSI escape code stripping from RUN_NAME (fixes garbled names from colored terminal output)
Quote cleanup on kit_version field

CI/CD

Deploy test workflow now triggers on release publish instead of weekly schedule (manual dispatch retained)

Housekeeping

Closed PR #263 (superseded by #266)

Assets 2

05 Apr 14:11

Ultimate-Storm

v1.3.0

d1ccf85

MediSwarm v1.3.0

Released: 2026-04-05

Major release adding the STAMP histopathology classification pipeline, FedProx aggregation strategy, comprehensive CI/CD infrastructure, Duke benchmark pipeline, and expanded documentation with architecture diagrams.

🔬 STAMP Classification Pipeline

Full support for KatherLab STAMP 2.4.0 histopathology classification in federated learning:

Separate Dockerfile_STAMP — Python 3.11, PyTorch 2.7.1, CUDA 12.6 (independent from ODELIA's Python 3.10/PyTorch 2.2.2 image)
Build flag — buildDockerImageAndStartupKits.sh now accepts -d / --dockerfile to select between Dockerfile_ODELIA and Dockerfile_STAMP
Synthetic dataset generator — Creates 2 sites × 15 patients with H5 feature files for integration testing
Integration tests — Preflight check, local training, and NVFlare simulation mode (3 rounds, 2 clients)
Per-round metrics CSV — STAMPMetricsCallback writes ground-truth/prediction probabilities and summary metrics per epoch

Two Docker Images

After v1.3.0, MediSwarm maintains two Docker images:

Image	Python	PyTorch	Use Case
`jefftud/odelia:<ver>`	3.10	2.2.2	3D breast MRI classification
`jefftud/stamp:<ver>`	3.11	2.7.1	STAMP histopathology classification

🔄 FedProx Aggregation Strategy

Alternative to FedAvg for improved convergence with non-IID medical data:

FedProxCallback — Lightning callback adds proximal term (μ/2) × ‖w_local − w_global‖² to gradient updates
Cross-pipeline — Compatible with both ODELIA (pytorch_lightning) and STAMP (lightning)
Configurable — Set FEDPROX_MU environment variable (default: 0 = disabled, recommended: 0.001–0.01)
Documentation — docs/AGGREGATION_STRATEGIES.md compares FedAvg, FedProx, Scaffold, and FedOpt with decision matrix

🧪 CI/CD for STAMP

Expanded test infrastructure covering both pipelines:

Unit tests — test_stamp_training.py (465 lines), test_stamp_model_wrapper.py (257 lines), test_fedprox_callback.py (286 lines)
Integration tests — STAMP Docker build + preflight + local training + simulation in pr-test.yaml
Unified packages — unit-tests.yaml switched from pytorch-lightning to unified lightning package
Timeout — PR test timeout increased from 45 to 60 minutes
Cleanup — CI cleanup step now kills stamp and nvflare containers alongside odelia

📊 Duke Benchmark Pipeline

Automated end-to-end benchmarking on the Duke Breast MRI dataset:

run_duke_benchmark.sh — Orchestrates build → deploy → swarm training → result collection → local model comparison
Configurable deploy — deploy_and_test.sh reads SITES and SERVER_NAME from deploy_sites.conf (backward-compatible defaults)
deploy_sites.conf.example — Template with dl0/dl2/dl3 configuration for TUD compute cluster
Results template — docs/DUKE_BENCHMARK_RESULTS.md for recording benchmark outcomes

📐 Architecture Documentation

Expanded README from 46 lines to 214 lines:

System Architecture — Mermaid diagram showing site-to-server topology with NVFlare aggregation
Training Pipeline — Mermaid sequence diagram showing federated learning round lifecycle
Supported Pipelines — Comparison table (ODELIA 3D CNN vs STAMP Classification)
Key Features — Privacy, Docker reproducibility, multi-pipeline support
Project Structure — Annotated directory tree

🔐 Differential Privacy Assessment

Gap analysis and roadmap (documentation only — implementation deferred to v1.4.0):

docs/DIFFERENTIAL_PRIVACY.md — Current PercentilePrivacy is gradient clipping, NOT formal (ε,δ)-DP. Detailed analysis of Opacus/DP-SGD integration path, compatibility issues, and privacy budget accounting
docs/DIFFERENTIAL_PRIVACY_DECISION.md — Architecture decision record

Changed

deploy_and_test.sh container matching broadened to include stamp and nvflare alongside odelia
CI pr-test.yaml timeout increased from 45 to 60 minutes
CI cleanup step now kills stamp and nvflare containers

Stats

31 files changed, 3,465 insertions, 162 deletions
16 new files created
9 pull requests (#252–#260)

Upgrade Notes

No breaking changes from v1.2.0
ODELIA pipeline users: no action required — Dockerfile_ODELIA is unchanged
STAMP pipeline users: build with ./buildDockerImageAndStartupKits.sh -d docker_config/Dockerfile_STAMP -p <project>
FedProx: opt-in via FEDPROX_MU env var — set to 0 or leave unset for standard FedAvg behavior

Full Changelog: v1.2.0...v1.3.0

Assets 2

04 Apr 21:36

Ultimate-Storm

v1.2.0

f93a96e

MediSwarm v1.2.0

Highlights

This release introduces STAMP classification support for swarm learning, a prediction workflow for external test data, significant code deduplication, improved training stability, and comprehensive documentation for making standalone training code MediSwarm-compatible.

New Features

STAMP Classification Job (#249)

New STAMP_classification job for swarm learning with STAMP's data pipeline (H5 features + clinical tables)
Supports VIT, MLP, TransMIL, and other STAMP model architectures
Configurable via STAMP_* environment variables
Stratified train/val split with STAMP's data loading pipeline

Prediction Workflow (#247)

New prediction workflow for evaluating trained swarm models on external test data
Supports both ODELIA 3D CNN and STAMP classification models
Configurable via environment variables for model path, data directory, and output format

Weighted Epochs Per Site (#251)

Replaces hardcoded per-site epoch dictionaries with a formula-based approach
Formula: epochs = base_epochs × (reference_size / num_train_samples), clamped to [1, max_cap]
Sites with fewer training samples get more local epochs per round, equalizing gradient updates across sites
Configurable via EPOCHS_PER_ROUND, EPOCHS_REFERENCE_DATASET_SIZE, EPOCHS_MAX_CAP env vars

Best + Last Model Checkpoints (#251)

finalize_training() now saves both best (by monitor metric) and latest checkpoints
Deployers can choose between peak-validation and final-aggregated models

Server Dashboard Enhancement (#240)

Enhanced server-side monitoring dashboard for real-time swarm training visibility

Client Stability Improvements (#245)

Systemd service for VPN with auto-reconnect and keepalive
GPU health check script for pre-training and Docker health checks
Docker container restart policies (--restart=on-failure:5)
VPN health monitor with automatic service restart after consecutive failures

Infrastructure & DevOps

Code Deduplication (#241)

Consolidated 5 duplicate challenge job directories into shared _shared/custom/ with symlinks
Moved build scripts to scripts/build/ and CI scripts to scripts/ci/
Single source of truth for training code across all ODELIA/challenge jobs

NVFlare Workflow Enhancements (#242)

Cross-site evaluation (CSE) workflow added to server and client configs
Tuned timeouts from 100-hour placeholders to practical values
Explicit metric comparator configuration
PercentilePrivacy filter for gradient quality control

Automated Tests (#243)

New unit test suite in tests/unit_tests/ (models_config, env_config, data_module)
GitHub Actions workflow for unit tests on PRs
Fixed hardcoded paths in test_challenge_models.py

Docker Build Optimization (#250)

Reordered Dockerfile layers: pip installs (expensive, stable) before apt installs (cheap, frequent CVE bumps)
Added --no-cache-dir flags to reduce image size
Consolidated RUN layers for better caching

NVFlare 2.7.2 Upgrade (#235, #236)

Upgraded from NVFlare 2.5.x to 2.7.2

Bug Fixes

Fix integration test printed icons (#224)
Fix site name argument ordering (#237, fixes #227)
Update CI Node.js version (#238, fixes #222)
Fix CI apt-get update permissions (#239)
Fix CLI flags for env vars lost when using sudo (#230)

Documentation

MediSwarm Compatibility Guide (#244, addresses #216) — step-by-step guide for making standalone training code MediSwarm-compatible
Updated README with correct repository links

Training Improvements (#246)

Class-weighted loss for imbalanced datasets
Gradient accumulation (effective batch size of 8)
Gradient clipping (val=1.0) to prevent explosion
16-mixed precision for stability

Full Changelog: v1.1.0...v1.2.0

Assets 2

02 Apr 21:11

Ultimate-Storm

v1.1.0

947584a

v1.1.0 — Challenge Models

MediSwarm v1.1.0 — Challenge Models Release

This release integrates five ODELIA challenge models into MediSwarm for federated swarm training, along with infrastructure improvements for deployment, testing, and CI.

New Challenge Models

Job	Model Architecture
`challenge_1DivideAndConquer`	ResidualEncoder
`challenge_2BCN_AIM`	SwinUNETR
`challenge_3agaldran`	MViT v2
`challenge_4abmil`	CrossModalAttentionABMIL + Swin
`challenge_5pimed`	ResNet18

Each challenge job is a self-contained NVFlare application with its own model code, data pipeline, configs, and synthetic dataset generator.

Highlights

--job flag for docker.sh — Participants can now run preflight checks and local training for any challenge model:

./docker.sh --preflight_check --job challenge_5pimed --data_dir $DATADIR --scratch_dir $SCRATCHDIR --GPU device=0

Pretrained weight caching — Large model weights (checkpoint_final.pth, mvit_v2_s-ae3be167.pth) are stored outside job directories to prevent NVFlare from bundling them during job submission
MODEL_NAME env var fix — All challenge jobs hardcode their MODEL_NAME to prevent the docker.sh default (MST) from silently overriding the intended model
Deployment automation — New deploy_and_test.sh script for multi-site Docker image push, startup kit deployment, and swarm lifecycle management
Live sync — New kit_live_sync/ for startup kit synchronization with heartbeat monitoring
CI reliability — Fixed script permissions, auto-install of gdown, NVFlare submodule sync

Breaking Changes

None. The default behavior of docker.sh (without --job) remains unchanged and runs ODELIA_ternary_classification.

Assets 2

10 Jul 11:40

Ultimate-Storm

Odelia_Challenge

bcef7de

odelia-challenge-v1.0

What's Changed

Dev dashboard enhancements by @Ultimate-Storm in #5
Merge include_nvflare to dev-7-test-controller by @oleschwen in #13
Dev 7 test controller by @oleschwen in #10
App example cifar10 by @Ultimate-Storm in #12
Dev 8 single logging by @oleschwen in #17
Dev 11 minimal application code by @oleschwen in #15
Get training via VPN to work by @oleschwen in #20
Dev controller first review by @Ultimate-Storm in #29
Dev 36 fix tests after merge by @oleschwen in #37
dev-22 use same application code for local and swarm training by @oleschwen in #24
dev-27 use consistent nvflare version by @oleschwen in #28
Dev 26 versioning docker images by @oleschwen in #33
Documentation and docker.sh scripts for multiple GPUs by @oleschwen in #46
Dev 32 setup swarm training on odelia data by @oleschwen in #47
fix missing version numbers for last-in-line apt packages by @oleschwen in #48
further corrections of apt package versions by @oleschwen in #49
Apt package version update by @oleschwen in #57
53 do not git directory in image by @oleschwen in #54
update apt package versions by @oleschwen in #59
55 automate updating apt packages by @oleschwen in #56
update apt package version by @oleschwen in #61
Update apt package versions 20250612 by @oleschwen in #65
Dev apt version ci by @Ultimate-Storm in #63
Update apt package versions 2025-06-23 by @oleschwen in #66
Fix auto apt update github action by @Ultimate-Storm in #67
Dev 34 latest update from gustav code by @oleschwen in #58
Automatically updating APT package by @Ultimate-Storm in #71
chore: Update APT versions in Dockerfile by @github-actions in #78
Dev demo odelia by @Ultimate-Storm in #79

New Contributors

@Ultimate-Storm made their first contribution in #5
@oleschwen made their first contribution in #13
@github-actions made their first contribution in #78

Full Changelog: https://github.com/KatherLab/MediSwarm/commits/Odelia_Challenge

Contributors

Ultimate-Storm and oleschwen

Assets 2

Releases: KatherLab/MediSwarm

v1.4.2

MediSwarm v1.4.2

Bug Fixes

Challenge Model Selection (#284, #286)

Live-sync UX Improvements (#285, #286)

--job Flag Ignoring Challenge Model (#277)

Live-sync SSH Hanging (#277)

Checkpoint Collection After stop_all (#268)

Increased Timeouts and Webviewer Error Detection (#269 / base)

New Features

STAMP Deploy Test (#269, #279, #280)

Apt Pin Refresh (#283)

Full Changelog

Uh oh!

v1.4.1

MediSwarm v1.4.1

Changes

Deploy Test Validation

Full Changelog

Uh oh!

v1.4.0

What's New

Webviewer (Live Monitor)

Training

Heartbeat / Live Sync

CI/CD

Housekeeping

Uh oh!

MediSwarm v1.3.0

MediSwarm v1.3.0

🔬 STAMP Classification Pipeline

Two Docker Images

🔄 FedProx Aggregation Strategy

🧪 CI/CD for STAMP

📊 Duke Benchmark Pipeline

📐 Architecture Documentation

🔐 Differential Privacy Assessment

Changed

Stats

Upgrade Notes

Uh oh!

MediSwarm v1.2.0

MediSwarm v1.2.0

Highlights

New Features

STAMP Classification Job (#249)

Prediction Workflow (#247)

Weighted Epochs Per Site (#251)

Best + Last Model Checkpoints (#251)

Server Dashboard Enhancement (#240)

Client Stability Improvements (#245)

Infrastructure & DevOps

Code Deduplication (#241)

NVFlare Workflow Enhancements (#242)

Automated Tests (#243)

Docker Build Optimization (#250)

NVFlare 2.7.2 Upgrade (#235, #236)

Bug Fixes

Documentation

Training Improvements (#246)

Uh oh!

v1.1.0 — Challenge Models

MediSwarm v1.1.0 — Challenge Models Release

New Challenge Models

Highlights

Breaking Changes

Uh oh!

odelia-challenge-v1.0

What's Changed

New Contributors

Contributors

Uh oh!