Skip to content

v1.4.2

Latest

Choose a tag to compare

@Ultimate-Storm Ultimate-Storm released this 10 Apr 13:44
71685d3

MediSwarm v1.4.2

Bug Fixes

Challenge Model Selection (#284, #286)

--job challenge_4abmil and --job challenge_5pimed previously produced a ValueError: Unsupported model name because the auto-derive logic stripped the challenge_ prefix and passed the raw suffix (4abmil, 5pimed) to the model registry instead of the registered keys (4LME_ABMIL, 5Pimed). Only challenge_1DivideAndConquer worked correctly since its suffix matched the registry key exactly.

  • Added a case statement in docker_config/master_template.yml that maps 4abmil→4LME_ABMIL and 5pimed→5Pimed during auto-derive
  • Embedded the correct MODEL_NAME= directly in each challenge job's SubprocessLauncher script (config_fed_client.conf) so admin-submitted jobs also select the right model regardless of how the client container was started
  • Affects all 5 challenge jobs: 1DivideAndConquer, 2BCN_AIM, 3agaldran, 4LME_ABMIL, 5Pimed

Live-sync UX Improvements (#285, #286)

  • Pre-configured startup kits: sync.conf.example now contains the real monitoring server address (172.24.4.65) with ConnectTimeout=5. The kit injection script falls back to copying .example as sync.conf when no local override exists, so participant startup kits always arrive with a working configuration — no manual file copying required
  • Interactive SSH connectivity check: instead of silently exiting on SSH failure, live_sync.sh now tests connectivity, prints the test command and SSH key-sharing instructions, then asks "Continue training without live sync? [y/N]". Non-interactive environments (no TTY) default to continuing without sync

--job Flag Ignoring Challenge Model (#277)

When --job specified a challenge model while starting a client, the flag was not being passed to the training subprocess correctly. Fixed so the --job selection is honoured end-to-end.

Live-sync SSH Hanging (#277)

SSH connections without BatchMode=yes could hang indefinitely waiting for password input in non-interactive environments. Fixed by enforcing BatchMode=yes in all SSH options.

Checkpoint Collection After stop_all (#268)

stop_all deletes the server working directory, causing checkpoint collection to fail with a path error. Fixed to handle the missing server dir gracefully.

Increased Timeouts and Webviewer Error Detection (#269 / base)

Raised NVFlare peer/heartbeat timeouts to accommodate large model transfers (2 GB+) over VPN. Enhanced webviewer startup to detect and report errors instead of silently hanging.


New Features

STAMP Deploy Test (#269, #279, #280)

Added an automated 2-node deploy test for STAMP participants:

  • Separate master_template_STAMP.yml and Dockerfile_STAMP for isolated STAMP provisioning
  • scripts/deploy/run_stamp_deploy_test.sh orchestrates the full test: build → provision → inject → launch → collect
  • Multi-model support: test can cycle through all STAMP models in a single run
  • Retry loop with configurable attempts for flaky network conditions
  • Participant README (STAMP_PARTICIPANT_README.md) with step-by-step setup guide

Apt Pin Refresh (#283)

Refreshed pinned apt package versions in Dockerfile_ODELIA to match the current Ubuntu package index, preventing apt-get install failures on fresh builds. Added helper scripts to automate future pin updates.


Full Changelog

v1.4.1...v1.4.2