fix: CI cleanup fails with Permission denied on root-owned Docker files#266
Merged
Ultimate-Storm merged 6 commits intomainfrom Apr 7, 2026
Merged
fix: CI cleanup fails with Permission denied on root-owned Docker files#266Ultimate-Storm merged 6 commits intomainfrom
Ultimate-Storm merged 6 commits intomainfrom
Conversation
Embed live_sync integration directly in the master_template.yml docker_cln_sh template so each client startup kit produces a single docker.sh with all flags. _injectLiveSyncIntoStartupKits.sh now only copies the helper files (sync.conf, build_heartbeat.sh, live_sync.sh) instead of creating a wrapper that delegates to docker_original.sh. Live sync auto-starts for --local_training (foreground, killed on exit) and --start_client (nohup daemon). All other modes are unchanged. If live_sync.sh is not present the hooks are a graceful no-op. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ads, version tracking
- server_tools/app.py: Major overhaul of the MediSwarm Live Monitor webviewer
- Add filter bar (site, mode, status, job_id) with default sort by newest
- Add status inference (stale >5min, finished >1hr, heartbeat_final.json wins)
- Add file download endpoint for all run artifacts
- Add job grouping for swarm runs
- Add kit version column from heartbeat data
- Add training summary extraction (best val metrics, epoch count, FL rounds)
- Add TensorBoard metric parsing and inline charts via tbparse
- Add enriched detail page with full file inventory, checkpoints, models cards
- Add stats bar with running/finished/stale/site counts
- Add server-side file paths with download buttons
- kit_live_sync/build_heartbeat.sh: Add kit_version field extracted from docker.sh
MEDISWARM_VERSION baked in at build time
- kit_live_sync/live_sync.sh: Fix duplicate entries and empty heartbeat fields
- Export SCRATCHDIR before calling build_heartbeat.sh so run_dir is populated
- Track current run and finalize old runs with heartbeat_final.json when a new
local training run starts (prevents stale "running" entries)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…hart is present Chart.js CDN script was only included inside the console metrics chart block, so TensorBoard charts would try to use Chart() without the library loaded when console metrics were absent. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
learn_task_ack_timeout 120→600s to handle 552MB model weight streaming over Tailscale VPN. Also increase final_result_ack_timeout, start_task_timeout, configure_task_timeout, and progress_timeout. Add tensor_min_download_timeout=600 to fix the download timeout inconsistency warning. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PTClientAPILauncherExecutor in NVFlare 2.7.2 does not support tensor_min_download_timeout as a constructor parameter. This caused all clients to fail job deployment with: TypeError: PTClientAPILauncherExecutor.__init__() got an unexpected keyword argument 'tensor_min_download_timeout' Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The NVFlare server container runs as root and creates root-owned files in the bind-mounted workspace directory. The cleanup_temporary_data function fails with "Permission denied" when the non-root runner tries to rm -rf these files, causing the integration test step to fail. Add a _rm_rf helper that falls back to a disposable Alpine container (running as root) to delete the directory when normal rm fails — the same approach already used in the pr-test.yaml workflow steps. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.
| if not events: | ||
| return {"scalars": []} | ||
|
|
||
| # Parse the directory containing events |
| media_type="application/octet-stream", | ||
| ) | ||
|
|
||
|
|
|
|
||
| return FileResponse( | ||
| path=str(target), | ||
| filename=target.name, |
|
|
||
| return FileResponse( | ||
| path=str(target), | ||
| filename=target.name, |
|
|
||
| if not target.exists() or not target.is_file(): | ||
| raise HTTPException(status_code=404, detail="File not found") | ||
|
|
|
|
||
| if has_final: | ||
| try: | ||
| final = json.loads((run_dir / "heartbeat_final.json").read_text()) |
| - If status is "running" but heartbeat is >1 hour old -> "finished" (presumed) | ||
| - Otherwise use heartbeat status as-is | ||
| """ | ||
| has_final = (run_dir / "heartbeat_final.json").exists() |
| cls = "badge-finished" | ||
| elif status in ("error", "failed"): | ||
| cls = "badge-error" | ||
| elif status == "stale": |
Comment on lines
+1117
to
+1121
|
|
||
|
|
||
| # --------------------------------------------------------------------------- | ||
| # File download endpoint | ||
| # --------------------------------------------------------------------------- |
Comment on lines
+1117
to
+1121
|
|
||
|
|
||
| # --------------------------------------------------------------------------- | ||
| # File download endpoint | ||
| # --------------------------------------------------------------------------- |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
workspace/directorycleanup_temporary_data()inrunIntegrationTests.shrunsrm -rfas the non-root CI runner user, it fails with Permission denied on those files, causing therun_dummy_training_in_swarmstep to fail_rm_rfhelper that first tries normalrm -rf, then falls back to a disposable Alpine Docker container (running as root) to delete the directory — the same approach already used inpr-test.yaml's pre-checkout and post-cleanup stepsTest plan
MediSwarm PR Validationworkflow on this PR and verify theRun dummy training in swarmstep passes without permission errors🤖 Generated with Claude Code