Skip to content

fix: CI cleanup fails with Permission denied on root-owned Docker files#266

Merged
Ultimate-Storm merged 6 commits intomainfrom
fix/ci-cleanup-permission-denied
Apr 7, 2026
Merged

fix: CI cleanup fails with Permission denied on root-owned Docker files#266
Ultimate-Storm merged 6 commits intomainfrom
fix/ci-cleanup-permission-denied

Conversation

@Ultimate-Storm
Copy link
Copy Markdown
Contributor

Summary

  • The NVFlare server container runs as root and creates root-owned files inside the bind-mounted workspace/ directory
  • When cleanup_temporary_data() in runIntegrationTests.sh runs rm -rf as the non-root CI runner user, it fails with Permission denied on those files, causing the run_dummy_training_in_swarm step to fail
  • Added a _rm_rf helper that first tries normal rm -rf, then falls back to a disposable Alpine Docker container (running as root) to delete the directory — the same approach already used in pr-test.yaml's pre-checkout and post-cleanup steps

Test plan

  • Trigger the MediSwarm PR Validation workflow on this PR and verify the Run dummy training in swarm step passes without permission errors
  • Confirm the cleanup step successfully removes root-owned workspace files

🤖 Generated with Claude Code

Ultimate-Storm and others added 6 commits April 7, 2026 14:12
Embed live_sync integration directly in the master_template.yml
docker_cln_sh template so each client startup kit produces a single
docker.sh with all flags.  _injectLiveSyncIntoStartupKits.sh now only
copies the helper files (sync.conf, build_heartbeat.sh, live_sync.sh)
instead of creating a wrapper that delegates to docker_original.sh.

Live sync auto-starts for --local_training (foreground, killed on exit)
and --start_client (nohup daemon).  All other modes are unchanged.
If live_sync.sh is not present the hooks are a graceful no-op.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ads, version tracking

- server_tools/app.py: Major overhaul of the MediSwarm Live Monitor webviewer
  - Add filter bar (site, mode, status, job_id) with default sort by newest
  - Add status inference (stale >5min, finished >1hr, heartbeat_final.json wins)
  - Add file download endpoint for all run artifacts
  - Add job grouping for swarm runs
  - Add kit version column from heartbeat data
  - Add training summary extraction (best val metrics, epoch count, FL rounds)
  - Add TensorBoard metric parsing and inline charts via tbparse
  - Add enriched detail page with full file inventory, checkpoints, models cards
  - Add stats bar with running/finished/stale/site counts
  - Add server-side file paths with download buttons

- kit_live_sync/build_heartbeat.sh: Add kit_version field extracted from docker.sh
  MEDISWARM_VERSION baked in at build time

- kit_live_sync/live_sync.sh: Fix duplicate entries and empty heartbeat fields
  - Export SCRATCHDIR before calling build_heartbeat.sh so run_dir is populated
  - Track current run and finalize old runs with heartbeat_final.json when a new
    local training run starts (prevents stale "running" entries)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…hart is present

Chart.js CDN script was only included inside the console metrics chart
block, so TensorBoard charts would try to use Chart() without the
library loaded when console metrics were absent.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
learn_task_ack_timeout 120→600s to handle 552MB model weight streaming
over Tailscale VPN. Also increase final_result_ack_timeout, start_task_timeout,
configure_task_timeout, and progress_timeout. Add tensor_min_download_timeout=600
to fix the download timeout inconsistency warning.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PTClientAPILauncherExecutor in NVFlare 2.7.2 does not support
tensor_min_download_timeout as a constructor parameter. This caused
all clients to fail job deployment with:
  TypeError: PTClientAPILauncherExecutor.__init__() got an unexpected
  keyword argument 'tensor_min_download_timeout'

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The NVFlare server container runs as root and creates root-owned files
in the bind-mounted workspace directory. The cleanup_temporary_data
function fails with "Permission denied" when the non-root runner tries
to rm -rf these files, causing the integration test step to fail.

Add a _rm_rf helper that falls back to a disposable Alpine container
(running as root) to delete the directory when normal rm fails — the
same approach already used in the pr-test.yaml workflow steps.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@github-advanced-security github-advanced-security AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

if not events:
return {"scalars": []}

# Parse the directory containing events
media_type="application/octet-stream",
)



return FileResponse(
path=str(target),
filename=target.name,

return FileResponse(
path=str(target),
filename=target.name,

if not target.exists() or not target.is_file():
raise HTTPException(status_code=404, detail="File not found")


if has_final:
try:
final = json.loads((run_dir / "heartbeat_final.json").read_text())
- If status is "running" but heartbeat is >1 hour old -> "finished" (presumed)
- Otherwise use heartbeat status as-is
"""
has_final = (run_dir / "heartbeat_final.json").exists()
cls = "badge-finished"
elif status in ("error", "failed"):
cls = "badge-error"
elif status == "stale":
Comment on lines +1117 to +1121


# ---------------------------------------------------------------------------
# File download endpoint
# ---------------------------------------------------------------------------
Comment on lines +1117 to +1121


# ---------------------------------------------------------------------------
# File download endpoint
# ---------------------------------------------------------------------------
@Ultimate-Storm Ultimate-Storm merged commit 6054a39 into main Apr 7, 2026
5 of 6 checks passed
@Ultimate-Storm Ultimate-Storm deleted the fix/ci-cleanup-permission-denied branch April 7, 2026 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants