Skip to content

Merge docker.sh scripts + enhance live monitor and sync#263

Closed
Ultimate-Storm wants to merge 3 commits intomainfrom
feature/live-monitor-and-sync-enhancements
Closed

Merge docker.sh scripts + enhance live monitor and sync#263
Ultimate-Storm wants to merge 3 commits intomainfrom
feature/live-monitor-and-sync-enhancements

Conversation

@Ultimate-Storm
Copy link
Copy Markdown
Contributor

Summary

  • Merged docker.sh: Client startup kits now produce a single docker.sh instead of a wrapper docker.sh + docker_original.sh. All flags (--dummy_training, --preflight_check, --local_training, --start_client, --job, --model_name, etc.) work directly. Live sync for the MediSwarm Live Monitor is automatically started for --local_training and --start_client modes.
  • Enhanced live monitor (server_tools/app.py): Major overhaul — filter bar (site/mode/status/job), status inference (stale >5min, finished >1hr, heartbeat_final.json always wins), file download endpoint, job grouping for swarm runs, kit version column, training summary, TensorBoard metric charts via tbparse, enriched detail page with full file inventory and download buttons.
  • Fixed live_sync duplicate entries: When a new local training run starts, the old run's heartbeat is now finalized with heartbeat_final.json so it doesn't linger as "running" forever. Also fixed SCRATCHDIR not being exported before build_heartbeat.sh, causing empty heartbeat fields.
  • Added kit version to heartbeat: build_heartbeat.sh now extracts MEDISWARM_VERSION from the startup kit's docker.sh and includes it in heartbeat JSON.

Files Changed

File Change
docker_config/master_template.yml Add live_sync functions + hooks into docker_cln_sh template
scripts/build/_injectLiveSyncIntoStartupKits.sh Simplify: only copy helper files, no more wrapper/rename
server_tools/app.py Major overhaul with filters, status inference, downloads, charts
kit_live_sync/build_heartbeat.sh Add kit_version field
kit_live_sync/live_sync.sh Fix SCRATCHDIR export, add duplicate run finalization

Test plan

  • Built startup kits — single docker.sh generated (no docker_original.sh)
  • Dummy training on DL0 (RUMC_1) — completed successfully
  • Preflight check on DL0 — completed successfully
  • Preflight check with --log_dataset_details on DL0 — completed successfully
  • Local training on DL0 — 100 epochs, results synced to server via live_sync
  • Webviewer verified: filters, status inference, file downloads, training summary all working
  • Full 4-site swarm training test (in progress)

🤖 Generated with Claude Code

Ultimate-Storm and others added 2 commits April 7, 2026 14:12
Embed live_sync integration directly in the master_template.yml
docker_cln_sh template so each client startup kit produces a single
docker.sh with all flags.  _injectLiveSyncIntoStartupKits.sh now only
copies the helper files (sync.conf, build_heartbeat.sh, live_sync.sh)
instead of creating a wrapper that delegates to docker_original.sh.

Live sync auto-starts for --local_training (foreground, killed on exit)
and --start_client (nohup daemon).  All other modes are unchanged.
If live_sync.sh is not present the hooks are a graceful no-op.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ads, version tracking

- server_tools/app.py: Major overhaul of the MediSwarm Live Monitor webviewer
  - Add filter bar (site, mode, status, job_id) with default sort by newest
  - Add status inference (stale >5min, finished >1hr, heartbeat_final.json wins)
  - Add file download endpoint for all run artifacts
  - Add job grouping for swarm runs
  - Add kit version column from heartbeat data
  - Add training summary extraction (best val metrics, epoch count, FL rounds)
  - Add TensorBoard metric parsing and inline charts via tbparse
  - Add enriched detail page with full file inventory, checkpoints, models cards
  - Add stats bar with running/finished/stale/site counts
  - Add server-side file paths with download buttons

- kit_live_sync/build_heartbeat.sh: Add kit_version field extracted from docker.sh
  MEDISWARM_VERSION baked in at build time

- kit_live_sync/live_sync.sh: Fix duplicate entries and empty heartbeat fields
  - Export SCRATCHDIR before calling build_heartbeat.sh so run_dir is populated
  - Track current run and finalize old runs with heartbeat_final.json when a new
    local training run starts (prevents stale "running" entries)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@github-advanced-security github-advanced-security AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

…hart is present

Chart.js CDN script was only included inside the console metrics chart
block, so TensorBoard charts would try to use Chart() without the
library loaded when console metrics were absent.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Ultimate-Storm Ultimate-Storm self-assigned this Apr 7, 2026
if not events:
return {"scalars": []}

# Parse the directory containing events
media_type="application/octet-stream",
)



return FileResponse(
path=str(target),
filename=target.name,

return FileResponse(
path=str(target),
filename=target.name,

if not target.exists() or not target.is_file():
raise HTTPException(status_code=404, detail="File not found")


if has_final:
try:
final = json.loads((run_dir / "heartbeat_final.json").read_text())
- If status is "running" but heartbeat is >1 hour old -> "finished" (presumed)
- Otherwise use heartbeat status as-is
"""
has_final = (run_dir / "heartbeat_final.json").exists()
cls = "badge-finished"
elif status in ("error", "failed"):
cls = "badge-error"
elif status == "stale":
Comment on lines +1117 to +1121


# ---------------------------------------------------------------------------
# File download endpoint
# ---------------------------------------------------------------------------
Comment on lines +1117 to +1121


# ---------------------------------------------------------------------------
# File download endpoint
# ---------------------------------------------------------------------------
@Ultimate-Storm
Copy link
Copy Markdown
Contributor Author

Closing — these changes were already merged via PR #266 and subsequent commits to main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants