[Epic] Hot-Reload Config

## Overview

Hot-Reload expands the set of copilot-bridge configuration changes that take effect without a full service restart. Today, a restart tears down all active sessions across all channels - a disruptive, multi-tenant operation that forces unrelated users to lose in-flight work. copilot-bridge already has a sophisticated `ConfigWatcher` that auto-reloads many settings via `fs.watch`; this feature builds on that foundation in three phases to eliminate the remaining restart-required scenarios.

## Spec Reference

https://github.com/raykao/dark-factory/tree/speckit/hot-reload/specs/hot-reload

## Phases and Tasks

### Phase 1 - Setup

- [ ] [T001] Create empty test file stubs to unblock parallel test authoring: `tests/unit/config.test.ts`, `tests/unit/session-manager.test.ts`, `tests/unit/adapters/mattermost.test.ts`, `tests/unit/adapters/slack.test.ts`, `tests/integration/sighup.test.ts`, `tests/integration/reload-bot.test.ts`, `tests/integration/hooks-watcher.test.ts`

### Phase 2 - Foundational (Blocking Prerequisites)

- [ ] [T002] Extend `ConfigDiff` interface with `botReloadNeeded: Map<string, string[]>` and update `diffConfigs()` to reclassify adapter-bound fields from `restartNeeded` to `botReloadNeeded`; update admin-channel notification accordingly (`src/config.ts`)
- [ ] [T003] Extract bot adapter construction from `main()` into a new async factory `createAdapterForBot()` in `src/index.ts` - pure refactor, zero behavior change (`src/index.ts`)

### Phase 3 - User Story 1: Per-Bot Adapter Restart (Priority P1)

- [ ] [T004] Write failing unit tests for `diffConfigs()` returning `botReloadNeeded` for all five reclassified cases (`tests/unit/config.test.ts`)
- [ ] [T005] Write failing integration tests for `/reload bot` covering success, failure+restore, restore-failure, concurrent reload, and non-admin user paths (`tests/integration/reload-bot.test.ts`)
- [ ] [T006] Add `reloadingBots: Set<string>` concurrency guard and circular message ring buffer (capacity 100) to `StreamingHandler` (`src/index.ts`, `src/core/streaming-handler.ts`)
- [ ] [T007] Implement first half of `/reload bot <name>` command handler: auth check, bot lookup, duplicate disambiguation, concurrency guard, pre-restart channel notification (`src/index.ts`)
- [ ] [T008] Implement second half of `/reload bot <name>`: drain, disconnect, `createAdapterForBot()`, success/failure/restore paths, structured logging, `finally` cleanup (`src/index.ts`)

### Phase 4 - User Story 2: Reload Hooks Without Session Loss (Priority P2)

- [ ] [T009] Write failing unit tests for `reloadHooks(channelId)` covering cache invalidation, success result, malformed JSON error with retention, no-workspace error, and SDK fallback path (`tests/unit/session-manager.test.ts`)
- [ ] [T010] Implement `async reloadHooks(channelId)` on `SessionManager`: resolve workspace, invalidate cache, reload from disk, SDK RPC or session-reload fallback, error handling with line/col detail (`src/core/session-manager.ts`)
- [ ] [T011] Route `/reload hooks` to `sessionManager.reloadHooks()` in command handler; reply with count confirmation or error; structured log at `info` level (`src/index.ts`)

### Phase 5 - User Story 3: SIGHUP Config Reload (Priority P2)

- [ ] [T012] Write failing integration tests for SIGHUP: handler called once within 100 ms, `botReloadNeeded` logged without auto-restart, Windows silent skip (`tests/integration/sighup.test.ts`)
- [ ] [T013] Register SIGHUP handler in `main()` guarded by `process.platform !== 'win32'`; triggers `reloadConfig()` and `setLogLevel()`; logs bot-reload advisory when `botReloadNeeded` is non-empty; does NOT auto-restart adapters (`src/index.ts`)

### Phase 6 - Seamless Token Rotation via reconnect()

- [ ] [T014] Write failing unit tests for `MattermostAdapter.reconnect()`: atomic swap sequence, timeout path preserving old connection, connection-error path (`tests/unit/adapters/mattermost.test.ts`)
- [ ] [T015] Write failing unit tests for `SlackAdapter.reconnect()`: handler re-registration, timeout path stopping new app only, old app never stopped on failure (`tests/unit/adapters/slack.test.ts`)
- [ ] [T016] Add `reconnect(newToken, options?)` to the `ChannelAdapter` interface (`src/types.ts`)
- [ ] [T017] Implement `MattermostAdapter.reconnect()` using 7-step atomic swap: new client, new WebSocket, subscribe, await confirmation (10 s timeout), atomic swap, close old WebSocket, update refs (`src/adapters/mattermost.ts`)
- [ ] [T018] Implement `SlackAdapter.reconnect()` using 7-step atomic swap: new App, re-register handlers, start, await connected (10 s timeout), atomic swap, stop old app, update refs (`src/adapters/slack.ts`)
- [ ] [T019] Upgrade `/reload bot` handler to attempt `reconnect()` before full teardown for credential-only changes; fall through to Phase 3 teardown path on `ReconnectTimeoutError` (`src/index.ts`)
- [ ] [T020] Add `telemetryReloadNeeded: boolean` to `ConfigDiff`; move `telemetry.*` from `restartNeeded`; wire handler to flush and reinitialize OTel pipeline on change (`src/config.ts`, `src/index.ts`)

### Phase 7 - User Story 4: Auto-Reload Hooks on File Change (Priority P3)

- [ ] [T021] Write failing integration tests for hooks file watcher: reload within 2000 ms, 500 ms debounce coalescing, malformed JSON retention, no watcher leak after workspace deregistration (`tests/integration/hooks-watcher.test.ts`)
- [ ] [T022] Implement `watchHooksFile()` and `unwatchHooksFile()` using `fs.watch` on parent directory, `hookWatchers: Map` for dedup, 500 ms debounce, per-channel `reloadHooks()` calls with error containment (`src/index.ts`)
- [ ] [T023] Call `watchHooksFile()` on session creation for channels with `workingDirectory` and `unwatchHooksFile()` on workspace deregistration to prevent watcher leaks (`src/index.ts` or `src/core/session-manager.ts`)

### Phase 8 - Polish and Cross-Cutting Concerns

- [ ] [T024] Update inline JSDoc on `ConfigDiff` documenting `botReloadNeeded` semantics and `telemetryReloadNeeded`; add `reloadDrainTimeoutMs` to config schema docs (`src/config.ts`)
- [ ] [T025] Add operator guide entries: SIGHUP is POSIX-only, drain timeout is configurable via `reloadDrainTimeoutMs` (default 10000 ms), `fs.watch` known-unreliable on Docker bind mounts (`README.md` or `docs/`)
- [ ] [T026] Run full Jest test suite (`npm test`) and confirm all new tests pass and no regressions in existing startup, config-reload, or adapter tests

## Acceptance Criteria

### Phase 1 (SC-001 to SC-005)

- **SC-001**: An operator can rotate a bot token or update a platform URL and restore service for only that bot's channels within 60 seconds - without restarting the bridge or disrupting any other bot's channels.
- **SC-002**: Users can reload `hooks.json` changes with a single `/reload hooks` command without losing their active Copilot session or any session context.
- **SC-003**: A `SIGHUP` signal triggers a config reload equivalent to a file-watch event within 1 second of signal receipt; all sessions remain active.
- **SC-004**: After Phase 1, the bridge log MUST NOT show `"restartNeeded"` for bot token, bot app token, platform URL, or bot add/remove changes. These MUST appear as `"botReloadNeeded"` instead.
- **SC-005**: Non-admin users issuing `/reload bot` receive a clear permission-denied message 100% of the time, with no unintended side effects.

### Phase 2 (SC-006 to SC-008)

- **SC-006**: Token rotation via `reconnect()` completes with zero message loss, verified by sending 100 messages at 1/second during rotation and confirming all 100 are received.
- **SC-007**: Reconnect latency from `reconnect()` invocation to confirmed new connection MUST be under 30 seconds on a normally functioning platform instance.
- **SC-008**: `telemetry.*` config changes take effect without a process restart; telemetry continues to export spans within 5 seconds of the reload.

### Phase 3 (SC-009)

- **SC-009**: Changes to `hooks.json` are detected and applied within 2 seconds of the file being saved, without any user command.

## Notes

### Key Design Decisions

- **`/reload bot` vs full restart**: Per-bot adapter restart (Phase 1) is the primary mitigation for the most common restart causes. The `reconnect()` path (Phase 2) upgrades credential-only changes to zero-disruption with an automatic fallback to the Phase 1 teardown path on timeout.
- **Atomic connection swap**: The Phase 2 reconnect sequence briefly holds both old and new connections open in parallel. The old connection is closed ONLY after the new connection is confirmed active and the atomic swap commits. If confirmation times out (10 s), the old connection is preserved and `ReconnectTimeoutError` is thrown.
- **`diffConfigs()` evolution**: The return type gains `botReloadNeeded: Map<string, string[]>` in Phase 1 and `telemetryReloadNeeded: boolean` in Phase 2. Fields in `botReloadNeeded` are removed from `restartNeeded`. This is additive and backward-compatible (NFR-004).
- **`/reload hooks` fallback**: If the Copilot SDK does not expose a hooks RPC, `reloadHooks()` falls back to a session-level reload that preserves the SDK connection. Full `/new` is never required.
- **SIGHUP**: Registered only on POSIX platforms (`process.platform !== 'win32'`). Does not auto-restart adapters even when `botReloadNeeded` is non-empty - it logs an advisory and leaves manual `/reload bot` to the operator.
- **Concurrency guard**: `reloadingBots: Set<string>` prevents concurrent `/reload bot` calls for the same bot. The second caller gets "Reload already in progress."
- **Drain timeout**: Configurable via `reloadDrainTimeoutMs` in `config.json` (default 10,000 ms). In-flight streams are aborted after timeout with a truncated-response notice.

### Risks

| Risk | Likelihood | Impact | Mitigation |
|------|-----------|--------|-----------|
| `@mattermost/client` singleton prevents parallel connections | Medium | Phase 2 blocked | Investigate source; fall back to stop/start cycle if needed |
| `@slack/bolt` stop/start loses event handler registrations | Medium | Phase 2 Slack reconnect complex | Test empirically; re-register handlers in reconnect sequence |
| Copilot SDK does not expose hooks RPC | High | `/reload hooks` requires full session reload | Implement session reload path as primary; document disruption level |
| `fs.watch` unreliable on container volumes (Phase 3) | Medium | Auto-reload fails silently in Docker | Add polling fallback or use `chokidar` for Phase 3 watcher |
| Concurrent `/reload bot` commands causing race condition | Low | `botAdapters` Map corrupted | Gate with per-bot lock (`reloadingBots` Set) |

### Open Questions

1. **Drain timeout**: Should `reloadDrainTimeoutMs` be configurable per-deployment (recommended) or hardcoded at 10,000 ms?
2. **Multi-bot atomicity**: If a platform URL change affects all bots on a platform, should `/reload bot` be run per-bot, or should a `/reload platform <name>` command be added? (Recommend: per-bot for now.)
3. **Phase 3 notifications**: Should file-watcher-triggered hook reloads notify users in-channel, or reload silently with a log entry? (Recommend: silent with optional verbose config flag.)
4. **Telemetry flush guarantee**: `TracerProvider.shutdown()` is async and may drop in-flight spans. Is an acceptable span loss window during telemetry reload tolerable? (Recommend: yes, document in operator guide.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Epic] Hot-Reload Config #166

Overview

Spec Reference

Phases and Tasks

Phase 1 - Setup

Phase 2 - Foundational (Blocking Prerequisites)

Phase 3 - User Story 1: Per-Bot Adapter Restart (Priority P1)

Phase 4 - User Story 2: Reload Hooks Without Session Loss (Priority P2)

Phase 5 - User Story 3: SIGHUP Config Reload (Priority P2)

Phase 6 - Seamless Token Rotation via reconnect()

Phase 7 - User Story 4: Auto-Reload Hooks on File Change (Priority P3)

Phase 8 - Polish and Cross-Cutting Concerns

Acceptance Criteria

Phase 1 (SC-001 to SC-005)

Phase 2 (SC-006 to SC-008)

Phase 3 (SC-009)

Notes

Key Design Decisions

Risks

Open Questions

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Risk	Likelihood	Impact	Mitigation
`@mattermost/client` singleton prevents parallel connections	Medium	Phase 2 blocked	Investigate source; fall back to stop/start cycle if needed
`@slack/bolt` stop/start loses event handler registrations	Medium	Phase 2 Slack reconnect complex	Test empirically; re-register handlers in reconnect sequence
Copilot SDK does not expose hooks RPC	High	`/reload hooks` requires full session reload	Implement session reload path as primary; document disruption level
`fs.watch` unreliable on container volumes (Phase 3)	Medium	Auto-reload fails silently in Docker	Add polling fallback or use `chokidar` for Phase 3 watcher
Concurrent `/reload bot` commands causing race condition	Low	`botAdapters` Map corrupted	Gate with per-bot lock (`reloadingBots` Set)

[Epic] Hot-Reload Config #166

Description

Overview

Spec Reference

Phases and Tasks

Phase 1 - Setup

Phase 2 - Foundational (Blocking Prerequisites)

Phase 3 - User Story 1: Per-Bot Adapter Restart (Priority P1)

Phase 4 - User Story 2: Reload Hooks Without Session Loss (Priority P2)

Phase 5 - User Story 3: SIGHUP Config Reload (Priority P2)

Phase 6 - Seamless Token Rotation via reconnect()

Phase 7 - User Story 4: Auto-Reload Hooks on File Change (Priority P3)

Phase 8 - Polish and Cross-Cutting Concerns

Acceptance Criteria

Phase 1 (SC-001 to SC-005)

Phase 2 (SC-006 to SC-008)

Phase 3 (SC-009)

Notes

Key Design Decisions

Risks

Open Questions

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions