fix: CI cleanup fails with Permission denied on root-owned Docker files by Ultimate-Storm · Pull Request #266 · KatherLab/MediSwarm

Ultimate-Storm · 2026-04-07T15:30:38Z

Summary

The NVFlare server container runs as root and creates root-owned files inside the bind-mounted workspace/ directory
When cleanup_temporary_data() in runIntegrationTests.sh runs rm -rf as the non-root CI runner user, it fails with Permission denied on those files, causing the run_dummy_training_in_swarm step to fail
Added a _rm_rf helper that first tries normal rm -rf, then falls back to a disposable Alpine Docker container (running as root) to delete the directory — the same approach already used in pr-test.yaml's pre-checkout and post-cleanup steps

Test plan

Trigger the MediSwarm PR Validation workflow on this PR and verify the Run dummy training in swarm step passes without permission errors
Confirm the cleanup step successfully removes root-owned workspace files

🤖 Generated with Claude Code

Embed live_sync integration directly in the master_template.yml docker_cln_sh template so each client startup kit produces a single docker.sh with all flags. _injectLiveSyncIntoStartupKits.sh now only copies the helper files (sync.conf, build_heartbeat.sh, live_sync.sh) instead of creating a wrapper that delegates to docker_original.sh. Live sync auto-starts for --local_training (foreground, killed on exit) and --start_client (nohup daemon). All other modes are unchanged. If live_sync.sh is not present the hooks are a graceful no-op. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ads, version tracking - server_tools/app.py: Major overhaul of the MediSwarm Live Monitor webviewer - Add filter bar (site, mode, status, job_id) with default sort by newest - Add status inference (stale >5min, finished >1hr, heartbeat_final.json wins) - Add file download endpoint for all run artifacts - Add job grouping for swarm runs - Add kit version column from heartbeat data - Add training summary extraction (best val metrics, epoch count, FL rounds) - Add TensorBoard metric parsing and inline charts via tbparse - Add enriched detail page with full file inventory, checkpoints, models cards - Add stats bar with running/finished/stale/site counts - Add server-side file paths with download buttons - kit_live_sync/build_heartbeat.sh: Add kit_version field extracted from docker.sh MEDISWARM_VERSION baked in at build time - kit_live_sync/live_sync.sh: Fix duplicate entries and empty heartbeat fields - Export SCRATCHDIR before calling build_heartbeat.sh so run_dir is populated - Track current run and finalize old runs with heartbeat_final.json when a new local training run starts (prevents stale "running" entries) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…hart is present Chart.js CDN script was only included inside the console metrics chart block, so TensorBoard charts would try to use Chart() without the library loaded when console metrics were absent. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

learn_task_ack_timeout 120→600s to handle 552MB model weight streaming over Tailscale VPN. Also increase final_result_ack_timeout, start_task_timeout, configure_task_timeout, and progress_timeout. Add tensor_min_download_timeout=600 to fix the download timeout inconsistency warning. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

PTClientAPILauncherExecutor in NVFlare 2.7.2 does not support tensor_min_download_timeout as a constructor parameter. This caused all clients to fail job deployment with: TypeError: PTClientAPILauncherExecutor.__init__() got an unexpected keyword argument 'tensor_min_download_timeout' Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The NVFlare server container runs as root and creates root-owned files in the bind-mounted workspace directory. The cleanup_temporary_data function fails with "Permission denied" when the non-root runner tries to rm -rf these files, causing the integration test step to fail. Add a _rm_rf helper that falls back to a disposable Alpine container (running as root) to delete the directory when normal rm fails — the same approach already used in the pr-test.yaml workflow steps. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-advanced-security

CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

server_tools/app.py

    if not events:
        return {"scalars": []}

-    # Parse the directory containing events


server_tools/app.py

+        media_type="application/octet-stream",
+    )




server_tools/app.py

+
+    return FileResponse(
+        path=str(target),
+        filename=target.name,


server_tools/app.py

+
+    return FileResponse(
+        path=str(target),
+        filename=target.name,


server_tools/app.py

+
+    if not target.exists() or not target.is_file():
+        raise HTTPException(status_code=404, detail="File not found")
+


server_tools/app.py

+
+    if has_final:
+        try:
+            final = json.loads((run_dir / "heartbeat_final.json").read_text())


server_tools/app.py

+    - If status is "running" but heartbeat is >1 hour old -> "finished" (presumed)
+    - Otherwise use heartbeat status as-is
+    """
+    has_final = (run_dir / "heartbeat_final.json").exists()


server_tools/app.py

        cls = "badge-finished"
    elif status in ("error", "failed"):
        cls = "badge-error"
+    elif status == "stale":


server_tools/app.py

+
+
+# ---------------------------------------------------------------------------
+# File download endpoint
+# ---------------------------------------------------------------------------


server_tools/app.py

+
+
+# ---------------------------------------------------------------------------
+# File download endpoint
+# ---------------------------------------------------------------------------


Ultimate-Storm and others added 6 commits April 7, 2026 14:12

github-advanced-security AI found potential problems Apr 7, 2026

View reviewed changes

Ultimate-Storm merged commit 6054a39 into main Apr 7, 2026
5 of 6 checks passed

Ultimate-Storm deleted the fix/ci-cleanup-permission-denied branch April 7, 2026 16:14

Ultimate-Storm mentioned this pull request Apr 7, 2026

Merge docker.sh scripts + enhance live monitor and sync #263

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: CI cleanup fails with Permission denied on root-owned Docker files#266

fix: CI cleanup fails with Permission denied on root-owned Docker files#266
Ultimate-Storm merged 6 commits intomainfrom
fix/ci-cleanup-permission-denied

Ultimate-Storm commented Apr 7, 2026

Uh oh!

github-advanced-security AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		if not target.exists() or not target.is_file():
		raise HTTPException(status_code=404, detail="File not found")

Conversation

Ultimate-Storm commented Apr 7, 2026

Summary

Test plan

Uh oh!

github-advanced-security AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants