Fix flaky CI swarm tests and self-hosted runner reliability#261
Merged
Ultimate-Storm merged 5 commits intomainfrom Apr 5, 2026
Merged
Fix flaky CI swarm tests and self-hosted runner reliability#261Ultimate-Storm merged 5 commits intomainfrom
Ultimate-Storm merged 5 commits intomainfrom
Conversation
…r image Docker COPY preserves symlinks, but NVFlare's os.walk()-based job signing and zip utilities do not follow symlinks. This caused job submission to fail with "job signature verification failed" because the custom/ symlink directories were empty in the signed zip. Resolve all symlinks to actual file/directory copies after COPY so os.walk() can traverse them. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…smatch The server-side PTFileModelPersistor creates models without class weights, so _class_weight is a plain attribute (absent from state_dict). Client-side training computes class weights from data, registering _class_weight as a buffer (present in state_dict). When the aggregated model is sent back, load_state_dict fails with "Missing key(s): _class_weight". Fix: pass loss_kwargs through models_config.py to all challenge factory functions, so both server and client models have consistent buffers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
models_config.py: Add fallback to /MediSwarm/pretrained_weights/ for challenge model checkpoints. The build script stores weights there to avoid bloating NVFlare job transfers, but the path resolution only looked inside the job folder where the weights don't exist. deploy_and_test.sh: Pass --model_name flag to docker.sh when starting clients so the correct challenge model is used instead of defaulting to MST. Add job_to_model_name() mapping function. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace fixed sleep durations with polling loops that check for "Server runner finished." in the server log, preventing premature assertion failures when the runner is under load. Add concurrency control so only one CI run uses the self-hosted GPU runner at a time, and prune stale Docker resources before each run to avoid "no space left on device" errors. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The _generateStartupKitArchives.sh script requires zip to create startup kit archives. The ODELIA Dockerfile includes zip/unzip but the STAMP Dockerfile was missing them, causing CI to fail with "zip: command not found" during the STAMP Docker build step. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
sleep 120/sleep 3600in swarm integration tests with polling loops that check for "Server runner finished." in the server log every 10s/30s, preventing premature assertion failures when the self-hosted runner is under loadconcurrencygroup topr-test.yamlso only one CI run executes on the self-hosted GPU runner at a time, preventing resource contentiondocker system prunestep at the start of each workflow run to reclaim disk space from stale containers/images and prevent "no space left on device" failuresTest plan
🤖 Generated with Claude Code