Skip to content

feat: broadcast shutdown notification to active threads (RFC #78 §1d Phase 1)#182

Open
ruan330 wants to merge 4 commits intoopenabdev:mainfrom
ruan330:feat/shutdown-broadcast
Open

feat: broadcast shutdown notification to active threads (RFC #78 §1d Phase 1)#182
ruan330 wants to merge 4 commits intoopenabdev:mainfrom
ruan330:feat/shutdown-broadcast

Conversation

@ruan330
Copy link
Copy Markdown

@ruan330 ruan330 commented Apr 10, 2026

Summary

On SIGINT or SIGTERM the broker now posts a short notification to every active thread across all configured adapters (Discord, Slack, …) before closing shards. Users get a clear signal that the broker is going away instead of replies cutting off silently mid-stream.

Discord Discussion URL: https://discord.com/channels/1488041051187974246/1495050997461024879

Relates to: #78, #75

Problem

Today, when the broker is restarted (systemctl restart, docker stop, Ctrl+C), conversations just die silently mid-reply. Users have no signal that anything is happening until the broker comes back, and in-flight streams leave partial messages with no explanation. RFC #78 §1d calls this out as Phase 1.

The original #182 (rebased before v0.7.6) was Discord-specific and done at the main.rs level. The multi-platform refactor in #259 makes that approach stale: Slack threads also need the notification, and the AdapterRouter / ChatAdapter abstraction is the right home for platform-neutral broadcast. This revision rewrites the feature against that architecture.

Design

Pool owns addressing, not the router. SessionPool now stores each active session's ChannelRef + Arc<dyn ChatAdapter> alongside the connection (addresses: HashMap<thread_key, ...>), kept in lockstep with state.active so broadcast has a single source of truth. No parallel cache in the adapter layer, no pool.active_thread_ids() filter, no prune task.

Atomic admission and snapshot.

  • SessionPool::begin_shutdown() takes the state write lock, flips the shutdown flag, and snapshots addresses — all in one critical section. Any admission committed before us is in the snapshot.
  • get_or_create checks is_shutting_down() at four points so every session is either in the snapshot or rejected at admission:
    1. Fast-fail on entry.
    2. Inside the initial state.read() block.
    3. Before return Ok(()) on the existing-alive-session path, with a state.read() barrier synchronizing against begin_shutdown's state.write() flag flip (closes the race where shutdown starts while we wait on the per-connection mutex).
    4. Inside the final state.write() block before insert.

Platform-neutral broadcast. AdapterRouter::broadcast_shutdown(message, timeout) calls pool.begin_shutdown() and posts the notification to every snapshot entry in parallel via tokio::task::JoinSet. A tokio::select! with a configurable deadline (10s here) caps total broadcast time so shutdown itself never blocks on a slow platform.

Unified signal handling. main.rs uses a shared wait_for_shutdown_signal() helper for SIGINT + SIGTERM (#[cfg(unix)]-gated with a ctrl_c-only fallback for non-Unix). Both the Discord-enabled and Slack-only branches invoke broadcast_shutdown before tearing down adapters.

Behaviour notes

  • Delivery is best-effort: send errors and the 10-second deadline both fall through to normal pool teardown.
  • Wording is intentionally neutral ("Context will reset on return") — we don't promise automatic session resume. RFC RFC: Session Management — Design Proposal #78 Phase 2 persistence is a separate follow-up.
  • Per-channel parallel send is safe under Discord/Slack rate limits (limits are per-channel, not per-adapter) and is ~10× faster than sequential for deployments with several active threads.

Tests & verification

  • cargo test — 43 passed, 0 failed.
  • cargo clippy --all-targets -- -D warnings — clean.
  • Live-tested on bare-metal and Docker brokers: SIGTERM triggers broadcasting shutdown notification count=N log and the message arrives in every active thread.

Test plan

  • CI: cargo test, cargo clippy, Docker smoke tests.
  • Manual: start broker with Discord and/or Slack configured, open an active thread, send SIGTERM (docker stop / kill -TERM), verify the thread receives the "Bot is shutting down" message before the pool tears down.
  • Manual: broadcast timeout path — block network to Discord and verify shutdown still completes after 10s.

@ruan330 ruan330 requested a review from thepagent as a code owner April 10, 2026 11:25
@thepagent thepagent added the p2 Medium — planned work label Apr 15, 2026
@masami-agent
Copy link
Copy Markdown
Contributor

Review — PR #182

Good feature — users currently get no warning when the broker shuts down. This fixes that.

✅ What looks good

  1. SIGTERM supportmain only has ctrl_c() today. Adding SIGTERM via tokio::select! is important for k8s graceful shutdown.
  2. active_thread_ids() — clean, minimal addition to pool.
  3. Broadcast before shutdown — correct ordering. HTTP client is still alive at broadcast time (before shard_manager.shutdown_all()).
  4. Sequential broadcast — safer than parallel given Discord rate limits (50 req/s). No need to add join_all complexity.
  5. Neutral wording — doesn't promise auto-resume.
  6. Failure handling — per-thread warn on send failure, no crash.

🔴 Must fix before merge

1. Rebase onto latest main
The pool structure has changed:

  • PR uses self.connectionsmain uses self.state (field: RwLock<PoolState>, inner: state.active)
  • main.rs shutdown handler has been restructured (Slack adapter support, shutdown_tx channel)
  • Handler struct is completely different

active_thread_ids() will need to read self.state.read().await.active.keys() after rebase.

2. CI not triggered
Fork PR — needs maintainer approval to run CI.

🟡 Non-blocking

3. k8s terminationGracePeriodSeconds
Default is 30s. Sequential broadcast with 20+ active threads could take 2-10s (each channel.say() is ~100-500ms). If Discord API is slow, broadcast may get truncated by SIGKILL. Not a blocker — worst case is some threads miss the notification. But worth noting for operators with large session pools.

4. "Restarting" message on permanent shutdown
The message says "Broker restarting" but helm uninstall is a permanent shutdown. No good way to distinguish at runtime — acceptable as-is.

Summary

Valuable UX improvement. Rebase needed — pool and main.rs have both changed significantly.

@ruan330 ruan330 force-pushed the feat/shutdown-broadcast branch from 04d1e0e to 82c8a7e Compare April 16, 2026 08:57
@github-actions github-actions bot added the closing-soon PR missing Discord Discussion URL — will auto-close in 3 days label Apr 16, 2026
@thepagent thepagent removed the closing-soon PR missing Discord Discussion URL — will auto-close in 3 days label Apr 16, 2026
Copy link
Copy Markdown
Contributor

@masami-agent masami-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: #182

Summary

  • Problem: Broker restarts kill conversations silently — users get no notification that the broker is going down.
  • Approach: Add active_thread_ids() accessor to SessionPool, handle SIGTERM alongside SIGINT, and broadcast a neutral notification to each active Discord thread before shutting down shards.
  • Risk level: Low

Core Assessment

  1. Problem clearly stated: ✅ — well-motivated, linked to RFC #78 §1d
  2. Approach appropriate: ✅ — minimal, clean separation between pool (lifecycle) and main (Discord)
  3. Alternatives considered: ✅ — PR description covers three alternatives with clear reasoning
  4. Best approach for now: ✅ — Phase 1 scope is correct; no over-engineering

Findings

pool.rs — active_thread_ids()

Clean and minimal. Takes a read lock, returns an owned Vec<String>, releases immediately. No contention risk with the subsequent shutdown() write lock. ✅

main.rs — SIGTERM handling

The tokio::select! over ctrl_c() and sigterm.recv() is the standard pattern. The .expect("install SIGTERM handler") is acceptable here — if we cannot install a signal handler at startup, panicking is the right call since the broker would be unable to shut down gracefully anyway. ✅

main.rs — broadcast loop

The strip_prefix("discord:") + parse::<u64>() approach correctly handles the {platform}:{id} key format used by AdapterRouter. Non-Discord keys (e.g. slack:...) are silently skipped via the parse failure path, which is the right behavior for now. ✅

main.rs — error handling

channel.say() failures are logged as warnings and do not block shutdown. Correct — a dead Discord connection should never prevent pool cleanup. ✅

Review Summary

🔴 Blockers

  • Cargo.lock version bump is out of scope. The diff changes Cargo.lock from 0.7.60.7.7. On main, Cargo.toml is already 0.7.7 but Cargo.lock is stale at 0.7.6. While the sync is technically correct, it is unrelated to the shutdown broadcast feature and should not be bundled into this PR. Please revert the Cargo.lock change and let it be fixed in a separate housekeeping commit, or confirm this was an intentional rebase artifact that the maintainers are okay with.

💬 Questions

  1. Non-Discord shutdown path also lacks SIGTERM. The else branch (line ~194 on main: "running without discord, press ctrl+c to stop") still only awaits ctrl_c(). In a Slack-only or headless deployment, docker stop / kill would still bypass the graceful path. Was this intentionally left out of scope, or should it be addressed here for consistency?

  2. Slack adapter threads. When Slack support is active, thread_key will be slack:{id}. The broadcast loop silently skips non-Discord keys (the parse::<u64>() fails on slack:... after strip_prefix("discord:") returns the full key). This is fine for Phase 1, but worth noting — should there be a tracing::debug! for skipped non-Discord keys so operators know the broadcast was Discord-only?

🔧 Suggested Changes

  1. Consider a timeout on the broadcast loop. If there are many active sessions and Discord is slow or rate-limiting, the broadcast could delay shutdown significantly. A tokio::time::timeout wrapping the entire loop (e.g. 5–10 seconds) would provide a safety net. Not blocking, but worth considering for production robustness.

  2. The pool.shutdown() is called twice on the Discord path. After the spawned task runs shard_manager.shutdown_all(), execution continues past client.start().await? and hits the shutdown_pool.shutdown().await at line ~208. The broadcast task does NOT call pool.shutdown() — it only broadcasts and shuts down shards. So the pool is shut down once (at the bottom), which is correct. However, the broadcast task clones pool as broadcast_pool but never calls shutdown on it. This is fine, but the naming broadcast_pool could be clearer — maybe notification_pool or just pool_ref to signal it is read-only usage.

ℹ️ Info

  • The Cargo.lock discrepancy on main (Cargo.toml = 0.7.7, Cargo.lock = 0.7.6) is a pre-existing issue. It should be fixed, but in a separate commit.
  • RFC #78 §1d Phase 2 (session persistence) is correctly deferred. The neutral wording ("you can continue" vs "will resume") is the right call given no state is persisted yet.

⚪ Nits

  • Line length on the channel.say() string literal is ~90 chars — fine, but if you want to break it for readability, a const at the top of the block would work nicely:
    const SHUTDOWN_MSG: &str = "🔄 Broker restarting. You can continue the conversation when the broker is back.";

Verdict

COMMENT — One blocker (Cargo.lock scope) and two questions to resolve. The core implementation is clean and well-designed. Once the Cargo.lock issue is addressed and the questions are answered, this should be ready for approval.

Copy link
Copy Markdown
Contributor

@masami-agent masami-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ruan330 — leaving some inline comments on specific areas. The overall review is in my earlier comment above. Thanks for the solid work on this!

Comment thread Cargo.lock
[[package]]
name = "openab"
version = "0.7.6"
version = "0.7.7"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 BLOCKER — This version bump (0.7.60.7.7) is a pre-existing Cargo.lock / Cargo.toml sync issue on main, not related to the shutdown broadcast feature. Please revert this file from the PR to keep the scope clean. The lockfile sync can be fixed in a separate housekeeping PR or commit.

git checkout origin/main -- Cargo.lock

Comment thread src/main.rs Outdated
// systemctl stop / docker stop / kill deliver SIGTERM; without handling
// it, the broker would exit without running the broadcast below.
let shard_manager = client.shard_manager.clone();
let broadcast_pool = pool.clone();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💬 Question — The else branch below (non-Discord / Slack-only mode, around line 194 on main) still only awaits ctrl_c(). In a Slack-only or headless deployment, docker stop / kill would still bypass the graceful path. Was this intentionally left out of scope for this PR? If so, might be worth a // TODO: comment noting that SIGTERM handling should be added here too.

Comment thread src/main.rs Outdated
info!(count = thread_ids.len(), "broadcasting shutdown notification");
for thread_key in thread_ids {
// thread_key format: "platform:id" (e.g. "discord:1234567890").
let raw_id = thread_key
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion — Consider wrapping the broadcast loop in a timeout to guard against Discord rate limits or network issues delaying shutdown:

let _ = tokio::time::timeout(
    std::time::Duration::from_secs(10),
    async {
        for thread_key in thread_ids {
            // ... existing broadcast logic ...
        }
    },
).await;

Not blocking — just a safety net for production. If Discord is slow, you probably still want the process to exit within a bounded time.

Comment thread src/main.rs Outdated
.strip_prefix("discord:")
.unwrap_or(&thread_key);
if let Ok(id) = raw_id.parse::<u64>() {
let channel = serenity::model::id::ChannelId::new(id);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ℹ️ Info — When Slack adapter threads exist, their keys (slack:{id}) will pass through strip_prefix("discord:") unchanged (returns Noneunwrap_or gives the full key), then parse::<u64>() fails on slack:12345 and they're silently skipped. This is correct behavior for Phase 1, but a tracing::debug! here for skipped non-Discord keys would help operators understand why some sessions didn't get notifications:

let raw_id = thread_key
    .strip_prefix("discord:")
    .unwrap_or(&thread_key);
if let Ok(id) = raw_id.parse::<u64>() {
    // ... send notification ...
} else {
    tracing::debug!(thread_key = %thread_key, "skipping non-Discord thread for shutdown broadcast");
}

Comment thread src/main.rs Outdated
)
.await
{
tracing::warn!(thread_key = %thread_key, error = %e, "failed to post shutdown notification");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit — The message string is a bit long inline. Consider extracting it as a constant for readability:

const SHUTDOWN_MSG: &str =
    "🔄 Broker restarting. You can continue the conversation when the broker is back.";

Minor — skip if you prefer it inline.

@shaun-agent
Copy link
Copy Markdown
Contributor

OpenAB PR Screening

Screening report ## Intent

This PR tries to make broker shutdowns visible to users who are actively chatting in Discord threads. Today, a broker restart or stop can terminate an in-flight conversation without explanation, leaving a partial reply or a dead thread with no operator-visible or user-visible signal. The proposed change addresses that gap by sending a shutdown notice to all currently active Discord threads before the broker tears down the session pool.

Feat

This is a feature with some operational-hardening characteristics. In plain terms, when the OpenAB broker receives SIGINT or SIGTERM, it gathers the active thread IDs and posts a fixed Discord message to each thread before shutting down the shard manager and session pool. It does not preserve session state, recover streams, or resume work after restart; it only makes the interruption explicit.

Who It Serves

The primary beneficiary is Discord end users in active threads, with secondary value for maintainers and operators who want shutdown behavior to be less opaque and easier to reason about during deploys, restarts, or crashes followed by restart.

Rewritten Prompt

Implement Phase 1 of graceful shutdown notification for the Discord broker.

When the process receives either SIGINT or SIGTERM, enumerate all currently active Discord thread IDs from the session pool, post a short restart notice to each valid thread, log notification failures as warnings, and then continue normal shutdown. Keep Discord-specific delivery logic in the broker entrypoint rather than pushing it into the pool internals. Do not add persistence, resume semantics, or a drain window in this change. Add or update tests where practical to verify signal handling and that active-thread notification happens before pool teardown.

Merge Pitch

This is worth advancing because it closes a user-facing failure mode with low implementation surface area: silent thread death during broker restarts. The change is operationally useful, narrow in scope, and aligned with RFC #78 Phase 1.

The risk profile is moderate-low. The likely reviewer concern is not whether the feature is useful, but whether shutdown-path messaging is too coupled to Discord delivery, too optimistic under rate limits, or under-tested for edge cases like invalid thread IDs and partial shutdown failures.

Best-Practice Comparison

OpenClaw principles that are relevant here:

  • Explicit delivery routing is relevant. This PR makes shutdown communication an explicit broker action aimed at the affected threads rather than leaving behavior implicit.
  • Retry/backoff and run logs are somewhat relevant. The PR logs failures, but it does not retry notification delivery or produce durable shutdown event records.
  • Gateway-owned scheduling is only weakly relevant. This is shutdown orchestration, not scheduled work.
  • Durable job persistence is not addressed. The PR explicitly stops short of preserving or replaying active session state.
  • Isolated executions are mostly not relevant to this specific change.

Hermes Agent principles that are relevant here:

  • Fresh session per scheduled run is conceptually adjacent but not directly applicable. This PR does not create a new run model; it just acknowledges interruption.
  • Self-contained prompts for scheduled tasks are not especially relevant.
  • Atomic writes for persisted state and file locking to prevent overlap are not relevant because this phase intentionally avoids persistence.
  • Gateway daemon tick model is also not directly relevant, since this is signal-driven shutdown behavior rather than periodic coordination.

Net assessment:

  • The PR aligns best with the principle of explicit operational lifecycle handling.
  • It does not yet align with the durability and recovery patterns seen in stronger production-grade systems.
  • That is acceptable if the item is judged strictly as Phase 1 notification, not as full graceful shutdown.

Implementation Options

Option 1: Minimal signal-path broadcast

Keep the current approach: handle SIGINT and SIGTERM in main, fetch active thread IDs from the pool via a narrow accessor, post a fixed message sequentially, warn on failures, then continue shutdown.

This is the most conservative path. It preserves existing boundaries reasonably well and keeps the PR tightly scoped to user notification.

Option 2: Lifecycle notifier abstraction

Introduce a small shutdown notifier component or trait owned by the broker layer. The pool exposes active session identifiers, while the broker invokes a notifier responsible for formatting and delivering shutdown notices. The initial implementation still targets Discord threads, but the shutdown path becomes easier to test and less ad hoc.

This is the balanced option. It adds a small abstraction without dragging Discord concerns into SessionPool or overengineering the feature.

Option 3: Broader graceful-shutdown framework

Implement a more complete shutdown coordinator: optional drain window for in-flight streams, bounded concurrency for thread notifications, structured shutdown result logging, and groundwork for Phase 2 persistence or resumable session metadata.

This is the ambitious option. It better matches mature gateway/runtime patterns, but it expands scope and risks turning a clean Phase 1 merge into a broader design exercise.

Comparison Table

Option Speed to ship Complexity Reliability Maintainability User impact Fit for OpenAB right now
Minimal signal-path broadcast Fast Low Medium Medium Immediate and clear Strong
Lifecycle notifier abstraction Medium Medium Medium-High High Immediate and clear Strong if reviewers want cleaner boundaries
Broader graceful-shutdown framework Slow High High Medium-High Highest long-term value Weak for this PR, better as follow-up work

Recommendation

Recommend the balanced version of Option 1 leaning slightly toward Option 2: merge the shutdown notification behavior now, but keep the design disciplined enough that notification delivery is clearly broker-owned and easy to evolve.

That is the right step for follow-up discussion because it solves the real user problem immediately without pretending to solve persistence or resumability. If reviewers are comfortable with the current narrow accessor plus main-layer delivery, this can merge as-is with a request for tighter tests. If they want cleaner seams, the only acceptable expansion should be a very small notifier abstraction, not a full graceful-shutdown framework. Phase 2 persistence, drain timing, and rate-limit mitigation should stay split into later items.

@ruan330 ruan330 force-pushed the feat/shutdown-broadcast branch from 82c8a7e to fa106cd Compare April 18, 2026 13:31
@github-actions github-actions bot added closing-soon PR missing Discord Discussion URL — will auto-close in 3 days and removed closing-soon PR missing Discord Discussion URL — will auto-close in 3 days labels Apr 18, 2026
@ruan330
Copy link
Copy Markdown
Author

ruan330 commented Apr 18, 2026

Force-pushed fa106cd — apologies for the churn. The previous commit (82c8a7e) was a straight rebase of the original Discord-only design, which no longer matches the multi-adapter shape the codebase moved to after #259. This revision rewrites against it.

Mapping to the screening concerns:

  • Too coupled to Discord — broadcast now routes through ChatAdapter::send_message via a new AdapterRouter::broadcast_shutdown(), so Slack threads get the notification alongside Discord. Discord specifics stay in the adapter.
  • Rate limitsJoinSet parallel send bounded by a 10s tokio::select! deadline. Best-effort by design.
  • Edge cases — addressing moved into SessionPool next to state.active, and begin_shutdown() flips the flag + snapshots under one state write lock. Admission invariant: every session is either in the snapshot or rejected inline at get_or_create.

Design alternatives considered (before picking pool-owned addressing):

  • Router-only, track in-flight handle_message calls — rejected; narrows broadcast scope too much (recently-chatted threads that aren't mid-reply wouldn't be notified).
  • Router cache + pool filter — rejected; coordinating admission/snapshot atomicity across two data sources produced multiple race windows in review.
  • Pool owns addressing (chosen) — single source of truth, admission atomicity reuses pool's existing state lock, adapter layer stays stateless.

Tests: 43 passed, clippy clean. CI green.

…penabdev#78 §1d Phase 1)

On SIGINT or SIGTERM the broker now posts a short notification to every
active thread across all configured adapters (Discord, Slack, …) before
closing shards. Users get a clear signal that the broker is going away
instead of replies cutting off silently mid-stream.

Discord Discussion URL: https://discord.com/channels/1488041051187974246/1495050997461024879

## Design

- `SessionPool` stores each active session's addressing info alongside
  its connection (`addresses: HashMap<thread_key, (ChannelRef, Adapter)>`),
  kept in lockstep with `state.active` so broadcast has a single source of
  truth. No parallel cache in the adapter layer.
- `SessionPool::begin_shutdown()` flips the shutdown flag and snapshots
  `addresses` under the state write lock, making admission and snapshot
  atomic with respect to each other.
- `AdapterRouter::broadcast_shutdown(message, timeout)` calls
  `begin_shutdown`, then posts the notification to every snapshot entry in
  parallel via `tokio::task::JoinSet`. A `tokio::select!` with a 10-second
  deadline caps total broadcast time so shutdown itself never blocks on a
  slow platform.
- `main.rs` shutdown handler listens for SIGTERM in addition to SIGINT
  (`#[cfg(unix)]`-gated with a ctrl_c-only fallback for non-Unix), and
  runs `broadcast_shutdown` before `shard_manager.shutdown_all()`. Both
  the Discord branch and the Slack-only branch use the same flow via a
  shared `wait_for_shutdown_signal()` helper.

## Shutdown admission invariant

Every session in the pool at broadcast time is either:
1. In the broadcast snapshot (receives the notification), or
2. Rejected at admission (receives an inline "Bot is shutting down"
   notice through `handle_message`'s pool-error path).

`get_or_create` checks `is_shutting_down()` at four atomic points:
1. Fast-fail on entry (cheap).
2. Inside the initial `state.read()` block (catches shutdowns that ran
   before we reached this task).
3. Before the `return Ok(())` on the existing-alive-session path, with
   a `state.read()` barrier to synchronize with `begin_shutdown`'s
   `state.write()` flag flip (catches shutdowns that started while we
   were waiting on the per-connection mutex).
4. Inside the final `state.write()` block before insert (atomic with
   `begin_shutdown`'s flag flip + snapshot).

## Behaviour notes

- Notification is best-effort: delivery errors and the 10-second deadline
  both fall through to normal pool teardown.
- Wording is intentionally neutral ("Context will reset on return") —
  we don't promise automatic session resume; RFC openabdev#78 Phase 2 persistence
  is a separate follow-up.
- Per-channel parallel send is safe under Discord/Slack rate limits
  (limits are per-channel, not per-adapter) and is ~10× faster than
  sequential for deployments with several active threads.

## Testing

- `cargo test` passes: 43 passed, 0 failed.
- `cargo clippy --all-targets -- -D warnings` clean.
- Verified on bare-metal and Docker brokers: SIGTERM triggers
  `broadcasting shutdown notification count=N` log and the message
  arrives in every active thread.

Relates to: openabdev#78, openabdev#75
@ruan330 ruan330 force-pushed the feat/shutdown-broadcast branch from fa106cd to 12aa620 Compare April 18, 2026 14:06
@ruan330
Copy link
Copy Markdown
Author

ruan330 commented Apr 18, 2026

Force-pushed 12aa620 to address the remaining feedback:

  • Extracted SHUTDOWN_MSG + SHUTDOWN_BROADCAST_TIMEOUT to module-level consts (@masami-agent's ⚪ nit). The message is defined once and reused across the Discord and Slack-only shutdown paths.

  • Closed a Slack feedback-loop in the admission-reject path. The earlier revision sent an inline "Bot is shutting down" message back to any sender during the shutdown window. When the Slack Socket Mode loop is still running, our own broadcast post is delivered back as a message event — under allow_bot_messages = "all" it routes into handle_message, hits the reject branch, and generates another bot-authored reply, looping until the bot-turn cap. Fix: skip the inline rejection when sender.is_bot; human senders still get the notice. Discord is unaffected because the handler filters msg.author.id == bot_id up front.

Tests: 43 passed, clippy clean. CI green.

@shaun-agent
Copy link
Copy Markdown
Contributor

small maintainer note: the earlier CI failure was not a design objection to the shutdown feature itself. the real issue was branch drift against current main — the adapter interface had moved to sender_json, so the PR branch no longer compiled cleanly on the merge result.\n\nwe took over that compatibility cleanup on the PR branch, pushed the follow-up fixes (9e7deaa, 13563f9), and reran the workflows. current head 13563f9 is now green on CI and Docker smoke tests.\n\nso at this point the remaining discussion should be about the feature and review judgment, not merge-drift noise.

@shaun-agent
Copy link
Copy Markdown
Contributor

handoff note: the earlier CI failures were merge-drift noise, not a feature-design rejection. we rebased/fixed the branch against current main, pushed the compatibility cleanup, and current head 13563f9 is now green on Rust CI and Docker smoke tests.\n\nat this point the remaining question is review judgment on the shutdown-broadcast design and scope, not build breakage.

Copy link
Copy Markdown
Contributor

@masami-agent masami-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: #182

Summary

  • Problem: When the broker shuts down (SIGINT/SIGTERM), active conversations die silently mid-reply with no user-facing signal. RFC #78 §1d Phase 1.
  • Approach: Pool-owned addressing + atomic shutdown flag + parallel broadcast via JoinSet with timeout, platform-neutral through the ChatAdapter abstraction.
  • Risk level: Medium — touches core pool admission logic and shutdown path.

Core Assessment

  1. Problem clearly stated: ✅ — Excellent PR description with design rationale, behavior notes, and test plan.
  2. Approach appropriate: ✅ — Pool-owns-addressing is the right call. No parallel cache in the adapter layer, single source of truth.
  3. Alternatives considered: ✅ — PR description explicitly contrasts with the original Discord-specific approach and explains why the multi-platform refactor makes this the right home.
  4. Best approach for now: ✅ — Minimal, well-scoped. Deliberately avoids session persistence (Phase 2 follow-up).

Findings

Shutdown admission invariant (pool.rs) — The 4-point check in get_or_create is thorough and correct:

  1. Fast-fail on entry (cheap, avoids spawning ACP process).
  2. Inside state.read() block (catches shutdowns before this task ran).
  3. After conn.lock() with state.read() barrier (synchronizes with begin_shutdown's write-lock flag flip — this is the subtle one, and the comment explains it well).
  4. Inside final state.write() before insert (atomic with flag-flip + snapshot).

The begin_shutdown() method correctly holds the state write lock while flipping the AtomicBool and snapshotting addresses, so any admission that committed before the lock is in the snapshot, and any that comes after sees the flag. Sound.

addresses invariantaddresses.keys() == active.keys() is maintained across all mutation paths: insert, stale rebuild, eviction, cleanup_idle, and shutdown. I traced each one — all correct.

Broadcast timeout (adapter.rs) — The biased select with set.shutdown().await is correct. set.shutdown() cancels remaining tasks and waits for them to drop, which is the right behavior for HTTP calls that should be cancellable via tokio's cooperative cancellation.

Bot-loop prevention (adapter.rs) — Smart catch. During the shutdown window, the bot's own broadcast message could arrive as a new event on Slack, trigger handle_message, get rejected by the pool, and the rejection reply would loop. Deserializing sender_json to check is_bot breaks the cycle cleanly. The Deserialize derive on SenderContext is the minimal change needed.

Signal handling (main.rs)wait_for_shutdown_signal() with #[cfg(unix)] SIGTERM + ctrl_c, and a ctrl_c-only fallback for non-Unix, is clean and correct. Both the Discord and Slack-only branches use the same flow.

discord.rs / slack.rs — Purely rustfmt line wrapping. No logic changes. ✅

Review Summary

🔴 Blockers

(none)

💬 Questions

  1. get_or_create signature change — The new channel: &ChannelRef, adapter: &Arc<dyn ChatAdapter> parameters are passed through from handle_message. Is there any other call site for get_or_create outside of handle_message? If so, those callers would need updating. (I only see the one call site in adapter.rs, so this should be fine — just confirming.)

🔧 Suggested Changes

  1. Commit history cleanup — The branch has merge commits from "Codex Temp" (48034166, 9e7deaa9, 13563f9c). Since the project uses squash merge, this will be cleaned up at merge time, but if you're planning any further pushes to this branch, consider rebasing to a single commit for cleaner git log during review.

ℹ️ Info

  1. The _sync = self.state.read().await pattern in check #3 is unusual (acquiring a read lock purely for synchronization), but the comment explains the intent clearly. The RwLock acquire provides the happens-before relationship needed to make the AtomicBool store from begin_shutdown visible. This is a valid and well-documented pattern here.
  2. broadcast_shutdown sends via adapter.send_message() (Discord/Slack HTTP API), not through ACP connections, so pool.shutdown() clearing active connections afterward does not interfere with in-flight broadcast sends. The sequencing in main.rs (broadcast → shutdown shards → pool.shutdown()) is correct.
  3. The AtomicBool with Acquire/Release ordering is technically redundant with the RwLock-protected checks in points #2 and #4 (which already have happens-before from the lock), but it's needed for the fast-fail path (#1) and the read-barrier path (#3). Belt-and-suspenders for shutdown correctness is the right trade-off.

⚪ Nits

(none — code is clean, comments are thorough)

Verdict

APPROVE — This is a well-designed, well-documented feature. The shutdown admission invariant is sound, the race condition coverage is thorough, and the platform-neutral broadcast approach is the right architecture. No blockers.

Copy link
Copy Markdown
Collaborator

@obrutjack obrutjack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge checklist verified:

  • ✅ CI all green (cargo check + 7 Docker smoke tests)
  • ✅ masami-agent reviewed — no blockers, thorough analysis of shutdown admission invariant
  • ✅ Multi-platform broadcast via ChatAdapter (Discord + Slack)
  • ✅ Parallel broadcast with 10s timeout (JoinSet + deadline)
  • ✅ SIGTERM + SIGINT handling (unix + non-unix)
  • ✅ Bot-loop prevention during shutdown window
  • ✅ Pool admission 4-layer shutdown check — sound
  • ✅ Neutral wording (no "restarting" assumption)
  • ✅ No version regression

Well-designed feature. The pool-owned addressing approach is the right architecture. Pending @thepagent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

p2 Medium — planned work

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants