MediSwarm v1.4.2
Bug Fixes
Challenge Model Selection (#284, #286)
--job challenge_4abmil and --job challenge_5pimed previously produced a ValueError: Unsupported model name because the auto-derive logic stripped the challenge_ prefix and passed the raw suffix (4abmil, 5pimed) to the model registry instead of the registered keys (4LME_ABMIL, 5Pimed). Only challenge_1DivideAndConquer worked correctly since its suffix matched the registry key exactly.
- Added a case statement in
docker_config/master_template.ymlthat maps4abmil→4LME_ABMILand5pimed→5Pimedduring auto-derive - Embedded the correct
MODEL_NAME=directly in each challenge job'sSubprocessLauncherscript (config_fed_client.conf) so admin-submitted jobs also select the right model regardless of how the client container was started - Affects all 5 challenge jobs:
1DivideAndConquer,2BCN_AIM,3agaldran,4LME_ABMIL,5Pimed
Live-sync UX Improvements (#285, #286)
- Pre-configured startup kits:
sync.conf.examplenow contains the real monitoring server address (172.24.4.65) withConnectTimeout=5. The kit injection script falls back to copying.exampleassync.confwhen no local override exists, so participant startup kits always arrive with a working configuration — no manual file copying required - Interactive SSH connectivity check: instead of silently exiting on SSH failure,
live_sync.shnow tests connectivity, prints the test command and SSH key-sharing instructions, then asks"Continue training without live sync? [y/N]". Non-interactive environments (no TTY) default to continuing without sync
--job Flag Ignoring Challenge Model (#277)
When --job specified a challenge model while starting a client, the flag was not being passed to the training subprocess correctly. Fixed so the --job selection is honoured end-to-end.
Live-sync SSH Hanging (#277)
SSH connections without BatchMode=yes could hang indefinitely waiting for password input in non-interactive environments. Fixed by enforcing BatchMode=yes in all SSH options.
Checkpoint Collection After stop_all (#268)
stop_all deletes the server working directory, causing checkpoint collection to fail with a path error. Fixed to handle the missing server dir gracefully.
Increased Timeouts and Webviewer Error Detection (#269 / base)
Raised NVFlare peer/heartbeat timeouts to accommodate large model transfers (2 GB+) over VPN. Enhanced webviewer startup to detect and report errors instead of silently hanging.
New Features
STAMP Deploy Test (#269, #279, #280)
Added an automated 2-node deploy test for STAMP participants:
- Separate
master_template_STAMP.ymlandDockerfile_STAMPfor isolated STAMP provisioning scripts/deploy/run_stamp_deploy_test.shorchestrates the full test: build → provision → inject → launch → collect- Multi-model support: test can cycle through all STAMP models in a single run
- Retry loop with configurable attempts for flaky network conditions
- Participant README (
STAMP_PARTICIPANT_README.md) with step-by-step setup guide
Apt Pin Refresh (#283)
Refreshed pinned apt package versions in Dockerfile_ODELIA to match the current Ubuntu package index, preventing apt-get install failures on fresh builds. Added helper scripts to automate future pin updates.