Skip to content

Fix flaky CI swarm tests and self-hosted runner reliability#261

Merged
Ultimate-Storm merged 5 commits intomainfrom
fix/ci-swarm-test-flakiness
Apr 5, 2026
Merged

Fix flaky CI swarm tests and self-hosted runner reliability#261
Ultimate-Storm merged 5 commits intomainfrom
fix/ci-swarm-test-flakiness

Conversation

@Ultimate-Storm
Copy link
Copy Markdown
Contributor

Summary

  • Replace fixed sleep 120 / sleep 3600 in swarm integration tests with polling loops that check for "Server runner finished." in the server log every 10s/30s, preventing premature assertion failures when the self-hosted runner is under load
  • Add concurrency group to pr-test.yaml so only one CI run executes on the self-hosted GPU runner at a time, preventing resource contention
  • Add docker system prune step at the start of each workflow run to reclaim disk space from stale containers/images and prevent "no space left on device" failures

Test plan

  • CI pipeline runs successfully on this PR (self-validating — the fixes apply to the workflow that tests this PR)
  • Verify concurrency control works by pushing two commits in quick succession — second run should cancel the first
  • Verify disk cleanup step runs before checkout (check "Reclaim disk space" step output in workflow logs)

🤖 Generated with Claude Code

Ultimate-Storm and others added 5 commits April 5, 2026 16:42
…r image

Docker COPY preserves symlinks, but NVFlare's os.walk()-based job signing
and zip utilities do not follow symlinks. This caused job submission to
fail with "job signature verification failed" because the custom/ symlink
directories were empty in the signed zip. Resolve all symlinks to actual
file/directory copies after COPY so os.walk() can traverse them.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…smatch

The server-side PTFileModelPersistor creates models without class weights,
so _class_weight is a plain attribute (absent from state_dict). Client-side
training computes class weights from data, registering _class_weight as a
buffer (present in state_dict). When the aggregated model is sent back,
load_state_dict fails with "Missing key(s): _class_weight".

Fix: pass loss_kwargs through models_config.py to all challenge factory
functions, so both server and client models have consistent buffers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
models_config.py: Add fallback to /MediSwarm/pretrained_weights/ for
challenge model checkpoints. The build script stores weights there to
avoid bloating NVFlare job transfers, but the path resolution only
looked inside the job folder where the weights don't exist.

deploy_and_test.sh: Pass --model_name flag to docker.sh when starting
clients so the correct challenge model is used instead of defaulting
to MST. Add job_to_model_name() mapping function.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace fixed sleep durations with polling loops that check for
"Server runner finished." in the server log, preventing premature
assertion failures when the runner is under load. Add concurrency
control so only one CI run uses the self-hosted GPU runner at a
time, and prune stale Docker resources before each run to avoid
"no space left on device" errors.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The _generateStartupKitArchives.sh script requires zip to create
startup kit archives. The ODELIA Dockerfile includes zip/unzip but
the STAMP Dockerfile was missing them, causing CI to fail with
"zip: command not found" during the STAMP Docker build step.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Ultimate-Storm Ultimate-Storm merged commit cd65d6b into main Apr 5, 2026
6 checks passed
@Ultimate-Storm Ultimate-Storm deleted the fix/ci-swarm-test-flakiness branch April 5, 2026 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant