-
Notifications
You must be signed in to change notification settings - Fork 6
[Epic] Hot-Reload Config #166
Copy link
Copy link
Open
Description
Overview
Hot-Reload expands the set of copilot-bridge configuration changes that take effect without a full service restart. Today, a restart tears down all active sessions across all channels - a disruptive, multi-tenant operation that forces unrelated users to lose in-flight work. copilot-bridge already has a sophisticated ConfigWatcher that auto-reloads many settings via fs.watch; this feature builds on that foundation in three phases to eliminate the remaining restart-required scenarios.
Spec Reference
https://github.com/raykao/dark-factory/tree/speckit/hot-reload/specs/hot-reload
Phases and Tasks
Phase 1 - Setup
- [T001] Create empty test file stubs to unblock parallel test authoring:
tests/unit/config.test.ts,tests/unit/session-manager.test.ts,tests/unit/adapters/mattermost.test.ts,tests/unit/adapters/slack.test.ts,tests/integration/sighup.test.ts,tests/integration/reload-bot.test.ts,tests/integration/hooks-watcher.test.ts
Phase 2 - Foundational (Blocking Prerequisites)
- [T002] Extend
ConfigDiffinterface withbotReloadNeeded: Map<string, string[]>and updatediffConfigs()to reclassify adapter-bound fields fromrestartNeededtobotReloadNeeded; update admin-channel notification accordingly (src/config.ts) - [T003] Extract bot adapter construction from
main()into a new async factorycreateAdapterForBot()insrc/index.ts- pure refactor, zero behavior change (src/index.ts)
Phase 3 - User Story 1: Per-Bot Adapter Restart (Priority P1)
- [T004] Write failing unit tests for
diffConfigs()returningbotReloadNeededfor all five reclassified cases (tests/unit/config.test.ts) - [T005] Write failing integration tests for
/reload botcovering success, failure+restore, restore-failure, concurrent reload, and non-admin user paths (tests/integration/reload-bot.test.ts) - [T006] Add
reloadingBots: Set<string>concurrency guard and circular message ring buffer (capacity 100) toStreamingHandler(src/index.ts,src/core/streaming-handler.ts) - [T007] Implement first half of
/reload bot <name>command handler: auth check, bot lookup, duplicate disambiguation, concurrency guard, pre-restart channel notification (src/index.ts) - [T008] Implement second half of
/reload bot <name>: drain, disconnect,createAdapterForBot(), success/failure/restore paths, structured logging,finallycleanup (src/index.ts)
Phase 4 - User Story 2: Reload Hooks Without Session Loss (Priority P2)
- [T009] Write failing unit tests for
reloadHooks(channelId)covering cache invalidation, success result, malformed JSON error with retention, no-workspace error, and SDK fallback path (tests/unit/session-manager.test.ts) - [T010] Implement
async reloadHooks(channelId)onSessionManager: resolve workspace, invalidate cache, reload from disk, SDK RPC or session-reload fallback, error handling with line/col detail (src/core/session-manager.ts) - [T011] Route
/reload hookstosessionManager.reloadHooks()in command handler; reply with count confirmation or error; structured log atinfolevel (src/index.ts)
Phase 5 - User Story 3: SIGHUP Config Reload (Priority P2)
- [T012] Write failing integration tests for SIGHUP: handler called once within 100 ms,
botReloadNeededlogged without auto-restart, Windows silent skip (tests/integration/sighup.test.ts) - [T013] Register SIGHUP handler in
main()guarded byprocess.platform !== 'win32'; triggersreloadConfig()andsetLogLevel(); logs bot-reload advisory whenbotReloadNeededis non-empty; does NOT auto-restart adapters (src/index.ts)
Phase 6 - Seamless Token Rotation via reconnect()
- [T014] Write failing unit tests for
MattermostAdapter.reconnect(): atomic swap sequence, timeout path preserving old connection, connection-error path (tests/unit/adapters/mattermost.test.ts) - [T015] Write failing unit tests for
SlackAdapter.reconnect(): handler re-registration, timeout path stopping new app only, old app never stopped on failure (tests/unit/adapters/slack.test.ts) - [T016] Add
reconnect(newToken, options?)to theChannelAdapterinterface (src/types.ts) - [T017] Implement
MattermostAdapter.reconnect()using 7-step atomic swap: new client, new WebSocket, subscribe, await confirmation (10 s timeout), atomic swap, close old WebSocket, update refs (src/adapters/mattermost.ts) - [T018] Implement
SlackAdapter.reconnect()using 7-step atomic swap: new App, re-register handlers, start, await connected (10 s timeout), atomic swap, stop old app, update refs (src/adapters/slack.ts) - [T019] Upgrade
/reload bothandler to attemptreconnect()before full teardown for credential-only changes; fall through to Phase 3 teardown path onReconnectTimeoutError(src/index.ts) - [T020] Add
telemetryReloadNeeded: booleantoConfigDiff; movetelemetry.*fromrestartNeeded; wire handler to flush and reinitialize OTel pipeline on change (src/config.ts,src/index.ts)
Phase 7 - User Story 4: Auto-Reload Hooks on File Change (Priority P3)
- [T021] Write failing integration tests for hooks file watcher: reload within 2000 ms, 500 ms debounce coalescing, malformed JSON retention, no watcher leak after workspace deregistration (
tests/integration/hooks-watcher.test.ts) - [T022] Implement
watchHooksFile()andunwatchHooksFile()usingfs.watchon parent directory,hookWatchers: Mapfor dedup, 500 ms debounce, per-channelreloadHooks()calls with error containment (src/index.ts) - [T023] Call
watchHooksFile()on session creation for channels withworkingDirectoryandunwatchHooksFile()on workspace deregistration to prevent watcher leaks (src/index.tsorsrc/core/session-manager.ts)
Phase 8 - Polish and Cross-Cutting Concerns
- [T024] Update inline JSDoc on
ConfigDiffdocumentingbotReloadNeededsemantics andtelemetryReloadNeeded; addreloadDrainTimeoutMsto config schema docs (src/config.ts) - [T025] Add operator guide entries: SIGHUP is POSIX-only, drain timeout is configurable via
reloadDrainTimeoutMs(default 10000 ms),fs.watchknown-unreliable on Docker bind mounts (README.mdordocs/) - [T026] Run full Jest test suite (
npm test) and confirm all new tests pass and no regressions in existing startup, config-reload, or adapter tests
Acceptance Criteria
Phase 1 (SC-001 to SC-005)
- SC-001: An operator can rotate a bot token or update a platform URL and restore service for only that bot's channels within 60 seconds - without restarting the bridge or disrupting any other bot's channels.
- SC-002: Users can reload
hooks.jsonchanges with a single/reload hookscommand without losing their active Copilot session or any session context. - SC-003: A
SIGHUPsignal triggers a config reload equivalent to a file-watch event within 1 second of signal receipt; all sessions remain active. - SC-004: After Phase 1, the bridge log MUST NOT show
"restartNeeded"for bot token, bot app token, platform URL, or bot add/remove changes. These MUST appear as"botReloadNeeded"instead. - SC-005: Non-admin users issuing
/reload botreceive a clear permission-denied message 100% of the time, with no unintended side effects.
Phase 2 (SC-006 to SC-008)
- SC-006: Token rotation via
reconnect()completes with zero message loss, verified by sending 100 messages at 1/second during rotation and confirming all 100 are received. - SC-007: Reconnect latency from
reconnect()invocation to confirmed new connection MUST be under 30 seconds on a normally functioning platform instance. - SC-008:
telemetry.*config changes take effect without a process restart; telemetry continues to export spans within 5 seconds of the reload.
Phase 3 (SC-009)
- SC-009: Changes to
hooks.jsonare detected and applied within 2 seconds of the file being saved, without any user command.
Notes
Key Design Decisions
/reload botvs full restart: Per-bot adapter restart (Phase 1) is the primary mitigation for the most common restart causes. Thereconnect()path (Phase 2) upgrades credential-only changes to zero-disruption with an automatic fallback to the Phase 1 teardown path on timeout.- Atomic connection swap: The Phase 2 reconnect sequence briefly holds both old and new connections open in parallel. The old connection is closed ONLY after the new connection is confirmed active and the atomic swap commits. If confirmation times out (10 s), the old connection is preserved and
ReconnectTimeoutErroris thrown. diffConfigs()evolution: The return type gainsbotReloadNeeded: Map<string, string[]>in Phase 1 andtelemetryReloadNeeded: booleanin Phase 2. Fields inbotReloadNeededare removed fromrestartNeeded. This is additive and backward-compatible (NFR-004)./reload hooksfallback: If the Copilot SDK does not expose a hooks RPC,reloadHooks()falls back to a session-level reload that preserves the SDK connection. Full/newis never required.- SIGHUP: Registered only on POSIX platforms (
process.platform !== 'win32'). Does not auto-restart adapters even whenbotReloadNeededis non-empty - it logs an advisory and leaves manual/reload botto the operator. - Concurrency guard:
reloadingBots: Set<string>prevents concurrent/reload botcalls for the same bot. The second caller gets "Reload already in progress." - Drain timeout: Configurable via
reloadDrainTimeoutMsinconfig.json(default 10,000 ms). In-flight streams are aborted after timeout with a truncated-response notice.
Risks
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
@mattermost/client singleton prevents parallel connections |
Medium | Phase 2 blocked | Investigate source; fall back to stop/start cycle if needed |
@slack/bolt stop/start loses event handler registrations |
Medium | Phase 2 Slack reconnect complex | Test empirically; re-register handlers in reconnect sequence |
| Copilot SDK does not expose hooks RPC | High | /reload hooks requires full session reload |
Implement session reload path as primary; document disruption level |
fs.watch unreliable on container volumes (Phase 3) |
Medium | Auto-reload fails silently in Docker | Add polling fallback or use chokidar for Phase 3 watcher |
Concurrent /reload bot commands causing race condition |
Low | botAdapters Map corrupted |
Gate with per-bot lock (reloadingBots Set) |
Open Questions
- Drain timeout: Should
reloadDrainTimeoutMsbe configurable per-deployment (recommended) or hardcoded at 10,000 ms? - Multi-bot atomicity: If a platform URL change affects all bots on a platform, should
/reload botbe run per-bot, or should a/reload platform <name>command be added? (Recommend: per-bot for now.) - Phase 3 notifications: Should file-watcher-triggered hook reloads notify users in-channel, or reload silently with a log entry? (Recommend: silent with optional verbose config flag.)
- Telemetry flush guarantee:
TracerProvider.shutdown()is async and may drop in-flight spans. Is an acceptable span loss window during telemetry reload tolerable? (Recommend: yes, document in operator guide.)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels