Skip to content

[Epic] Hot-Reload Config #166

@raykao

Description

@raykao

Overview

Hot-Reload expands the set of copilot-bridge configuration changes that take effect without a full service restart. Today, a restart tears down all active sessions across all channels - a disruptive, multi-tenant operation that forces unrelated users to lose in-flight work. copilot-bridge already has a sophisticated ConfigWatcher that auto-reloads many settings via fs.watch; this feature builds on that foundation in three phases to eliminate the remaining restart-required scenarios.

Spec Reference

https://github.com/raykao/dark-factory/tree/speckit/hot-reload/specs/hot-reload

Phases and Tasks

Phase 1 - Setup

  • [T001] Create empty test file stubs to unblock parallel test authoring: tests/unit/config.test.ts, tests/unit/session-manager.test.ts, tests/unit/adapters/mattermost.test.ts, tests/unit/adapters/slack.test.ts, tests/integration/sighup.test.ts, tests/integration/reload-bot.test.ts, tests/integration/hooks-watcher.test.ts

Phase 2 - Foundational (Blocking Prerequisites)

  • [T002] Extend ConfigDiff interface with botReloadNeeded: Map<string, string[]> and update diffConfigs() to reclassify adapter-bound fields from restartNeeded to botReloadNeeded; update admin-channel notification accordingly (src/config.ts)
  • [T003] Extract bot adapter construction from main() into a new async factory createAdapterForBot() in src/index.ts - pure refactor, zero behavior change (src/index.ts)

Phase 3 - User Story 1: Per-Bot Adapter Restart (Priority P1)

  • [T004] Write failing unit tests for diffConfigs() returning botReloadNeeded for all five reclassified cases (tests/unit/config.test.ts)
  • [T005] Write failing integration tests for /reload bot covering success, failure+restore, restore-failure, concurrent reload, and non-admin user paths (tests/integration/reload-bot.test.ts)
  • [T006] Add reloadingBots: Set<string> concurrency guard and circular message ring buffer (capacity 100) to StreamingHandler (src/index.ts, src/core/streaming-handler.ts)
  • [T007] Implement first half of /reload bot <name> command handler: auth check, bot lookup, duplicate disambiguation, concurrency guard, pre-restart channel notification (src/index.ts)
  • [T008] Implement second half of /reload bot <name>: drain, disconnect, createAdapterForBot(), success/failure/restore paths, structured logging, finally cleanup (src/index.ts)

Phase 4 - User Story 2: Reload Hooks Without Session Loss (Priority P2)

  • [T009] Write failing unit tests for reloadHooks(channelId) covering cache invalidation, success result, malformed JSON error with retention, no-workspace error, and SDK fallback path (tests/unit/session-manager.test.ts)
  • [T010] Implement async reloadHooks(channelId) on SessionManager: resolve workspace, invalidate cache, reload from disk, SDK RPC or session-reload fallback, error handling with line/col detail (src/core/session-manager.ts)
  • [T011] Route /reload hooks to sessionManager.reloadHooks() in command handler; reply with count confirmation or error; structured log at info level (src/index.ts)

Phase 5 - User Story 3: SIGHUP Config Reload (Priority P2)

  • [T012] Write failing integration tests for SIGHUP: handler called once within 100 ms, botReloadNeeded logged without auto-restart, Windows silent skip (tests/integration/sighup.test.ts)
  • [T013] Register SIGHUP handler in main() guarded by process.platform !== 'win32'; triggers reloadConfig() and setLogLevel(); logs bot-reload advisory when botReloadNeeded is non-empty; does NOT auto-restart adapters (src/index.ts)

Phase 6 - Seamless Token Rotation via reconnect()

  • [T014] Write failing unit tests for MattermostAdapter.reconnect(): atomic swap sequence, timeout path preserving old connection, connection-error path (tests/unit/adapters/mattermost.test.ts)
  • [T015] Write failing unit tests for SlackAdapter.reconnect(): handler re-registration, timeout path stopping new app only, old app never stopped on failure (tests/unit/adapters/slack.test.ts)
  • [T016] Add reconnect(newToken, options?) to the ChannelAdapter interface (src/types.ts)
  • [T017] Implement MattermostAdapter.reconnect() using 7-step atomic swap: new client, new WebSocket, subscribe, await confirmation (10 s timeout), atomic swap, close old WebSocket, update refs (src/adapters/mattermost.ts)
  • [T018] Implement SlackAdapter.reconnect() using 7-step atomic swap: new App, re-register handlers, start, await connected (10 s timeout), atomic swap, stop old app, update refs (src/adapters/slack.ts)
  • [T019] Upgrade /reload bot handler to attempt reconnect() before full teardown for credential-only changes; fall through to Phase 3 teardown path on ReconnectTimeoutError (src/index.ts)
  • [T020] Add telemetryReloadNeeded: boolean to ConfigDiff; move telemetry.* from restartNeeded; wire handler to flush and reinitialize OTel pipeline on change (src/config.ts, src/index.ts)

Phase 7 - User Story 4: Auto-Reload Hooks on File Change (Priority P3)

  • [T021] Write failing integration tests for hooks file watcher: reload within 2000 ms, 500 ms debounce coalescing, malformed JSON retention, no watcher leak after workspace deregistration (tests/integration/hooks-watcher.test.ts)
  • [T022] Implement watchHooksFile() and unwatchHooksFile() using fs.watch on parent directory, hookWatchers: Map for dedup, 500 ms debounce, per-channel reloadHooks() calls with error containment (src/index.ts)
  • [T023] Call watchHooksFile() on session creation for channels with workingDirectory and unwatchHooksFile() on workspace deregistration to prevent watcher leaks (src/index.ts or src/core/session-manager.ts)

Phase 8 - Polish and Cross-Cutting Concerns

  • [T024] Update inline JSDoc on ConfigDiff documenting botReloadNeeded semantics and telemetryReloadNeeded; add reloadDrainTimeoutMs to config schema docs (src/config.ts)
  • [T025] Add operator guide entries: SIGHUP is POSIX-only, drain timeout is configurable via reloadDrainTimeoutMs (default 10000 ms), fs.watch known-unreliable on Docker bind mounts (README.md or docs/)
  • [T026] Run full Jest test suite (npm test) and confirm all new tests pass and no regressions in existing startup, config-reload, or adapter tests

Acceptance Criteria

Phase 1 (SC-001 to SC-005)

  • SC-001: An operator can rotate a bot token or update a platform URL and restore service for only that bot's channels within 60 seconds - without restarting the bridge or disrupting any other bot's channels.
  • SC-002: Users can reload hooks.json changes with a single /reload hooks command without losing their active Copilot session or any session context.
  • SC-003: A SIGHUP signal triggers a config reload equivalent to a file-watch event within 1 second of signal receipt; all sessions remain active.
  • SC-004: After Phase 1, the bridge log MUST NOT show "restartNeeded" for bot token, bot app token, platform URL, or bot add/remove changes. These MUST appear as "botReloadNeeded" instead.
  • SC-005: Non-admin users issuing /reload bot receive a clear permission-denied message 100% of the time, with no unintended side effects.

Phase 2 (SC-006 to SC-008)

  • SC-006: Token rotation via reconnect() completes with zero message loss, verified by sending 100 messages at 1/second during rotation and confirming all 100 are received.
  • SC-007: Reconnect latency from reconnect() invocation to confirmed new connection MUST be under 30 seconds on a normally functioning platform instance.
  • SC-008: telemetry.* config changes take effect without a process restart; telemetry continues to export spans within 5 seconds of the reload.

Phase 3 (SC-009)

  • SC-009: Changes to hooks.json are detected and applied within 2 seconds of the file being saved, without any user command.

Notes

Key Design Decisions

  • /reload bot vs full restart: Per-bot adapter restart (Phase 1) is the primary mitigation for the most common restart causes. The reconnect() path (Phase 2) upgrades credential-only changes to zero-disruption with an automatic fallback to the Phase 1 teardown path on timeout.
  • Atomic connection swap: The Phase 2 reconnect sequence briefly holds both old and new connections open in parallel. The old connection is closed ONLY after the new connection is confirmed active and the atomic swap commits. If confirmation times out (10 s), the old connection is preserved and ReconnectTimeoutError is thrown.
  • diffConfigs() evolution: The return type gains botReloadNeeded: Map<string, string[]> in Phase 1 and telemetryReloadNeeded: boolean in Phase 2. Fields in botReloadNeeded are removed from restartNeeded. This is additive and backward-compatible (NFR-004).
  • /reload hooks fallback: If the Copilot SDK does not expose a hooks RPC, reloadHooks() falls back to a session-level reload that preserves the SDK connection. Full /new is never required.
  • SIGHUP: Registered only on POSIX platforms (process.platform !== 'win32'). Does not auto-restart adapters even when botReloadNeeded is non-empty - it logs an advisory and leaves manual /reload bot to the operator.
  • Concurrency guard: reloadingBots: Set<string> prevents concurrent /reload bot calls for the same bot. The second caller gets "Reload already in progress."
  • Drain timeout: Configurable via reloadDrainTimeoutMs in config.json (default 10,000 ms). In-flight streams are aborted after timeout with a truncated-response notice.

Risks

Risk Likelihood Impact Mitigation
@mattermost/client singleton prevents parallel connections Medium Phase 2 blocked Investigate source; fall back to stop/start cycle if needed
@slack/bolt stop/start loses event handler registrations Medium Phase 2 Slack reconnect complex Test empirically; re-register handlers in reconnect sequence
Copilot SDK does not expose hooks RPC High /reload hooks requires full session reload Implement session reload path as primary; document disruption level
fs.watch unreliable on container volumes (Phase 3) Medium Auto-reload fails silently in Docker Add polling fallback or use chokidar for Phase 3 watcher
Concurrent /reload bot commands causing race condition Low botAdapters Map corrupted Gate with per-bot lock (reloadingBots Set)

Open Questions

  1. Drain timeout: Should reloadDrainTimeoutMs be configurable per-deployment (recommended) or hardcoded at 10,000 ms?
  2. Multi-bot atomicity: If a platform URL change affects all bots on a platform, should /reload bot be run per-bot, or should a /reload platform <name> command be added? (Recommend: per-bot for now.)
  3. Phase 3 notifications: Should file-watcher-triggered hook reloads notify users in-channel, or reload silently with a log entry? (Recommend: silent with optional verbose config flag.)
  4. Telemetry flush guarantee: TracerProvider.shutdown() is async and may drop in-flight spans. Is an acceptable span loss window during telemetry reload tolerable? (Recommend: yes, document in operator guide.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions