Skip to content

fix: graceful shutdown coordination for watcher tasks#757

Open
tipogi wants to merge 7 commits intofeat/dx-events-by-userfrom
fix/task-panic
Open

fix: graceful shutdown coordination for watcher tasks#757
tipogi wants to merge 7 commits intofeat/dx-events-by-userfrom
fix/task-panic

Conversation

@tipogi
Copy link
Collaborator

@tipogi tipogi commented Mar 2, 2026

Problem

The original watcher used let _ = tokio::try_join!(...), which could hide panics by discarding JoinError and left no coordinated signal for sibling tasks, so failures could propagate to runtime shutdown while other tasks were still running and risk partial/inconsistent processing.

1) Coordinated watcher tasks with JoinSet

NexusWatcher::run_tasks now creates a JoinSet and spawns:

  • default homeserver loop,
  • external homeservers loop,
  • shutdown forwarder loop.

Behavior:

  • Wait for the first task to finish (join_next).
  • Immediately signal internal shutdown to remaining tasks.
  • Drain all remaining tasks.
  • Return an error if any task failed, otherwise return graceful shutdown.

Result:

  • One failing task no longer leaves others unmanaged.
  • Shutdown semantics are explicit and centralized.
  • Panic/error visibility is improved because task failures are observed when draining the set.

2) Prevent burst catch-up with MissedTickBehavior::Skip

Both watcher loops now set:

interval.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Skip)

Behavior:

  • If one iteration runs longer than the tick period, missed ticks are dropped.
  • The loop resumes from the next scheduled tick instead of firing multiple immediate catch-up ticks.

Result:

  • Smoother load profile after slow rounds.
  • Avoids back-to-back burst processing that can spike CPU/DB work.
  • More predictable pacing for periodic indexing loops.

Validation (test + results)

New test module: nexus-watcher/tests/service/missed_tick_skip.rs verifies no burst-like near-zero spacing after a slow processing round.

Without MissedTickBehavior::Skip, repeated identical millisecond timestamps indicate immediate catch-up ticks after the slow round:

run_default_homeserver: t0_unix_ms: 1772605415342
run_default_homeserver: t0_unix_ms: 1772605415845
run_default_homeserver: t0_unix_ms: 1772605415845
run_default_homeserver: t0_unix_ms: 1772605415845
run_default_homeserver: t0_unix_ms: 1772605415845
run_default_homeserver: t0_unix_ms: 1772605415845
run_default_homeserver: t0_unix_ms: 1772605415942
run_default_homeserver: t0_unix_ms: 1772605416043
run_default_homeserver: t0_unix_ms: 1772605416143
run_default_homeserver: t0_unix_ms: 1772605416243
run_default_homeserver: t0_unix_ms: 1772605416342
run_default_homeserver: t0_unix_ms: 1772605416442
run_default_homeserver: t0_unix_ms: 1772605416543
run_default_homeserver: t0_unix_ms: 1772605416643
run_default_homeserver: t0_unix_ms: 1772605416742
run_default_homeserver: t0_unix_ms: 1772605416842

With MissedTickBehavior::Skip, timestamps resume with roughly regular spacing and no burst cluster:

run_default_homeserver: t0_unix_ms: 1772605358102
run_default_homeserver: t0_unix_ms: 1772605358604
run_default_homeserver: t0_unix_ms: 1772605358702
run_default_homeserver: t0_unix_ms: 1772605358803
run_default_homeserver: t0_unix_ms: 1772605358903
run_default_homeserver: t0_unix_ms: 1772605359002
run_default_homeserver: t0_unix_ms: 1772605359102
run_default_homeserver: t0_unix_ms: 1772605359202
run_default_homeserver: t0_unix_ms: 1772605359303
run_default_homeserver: t0_unix_ms: 1772605359402
run_default_homeserver: t0_unix_ms: 1772605359503
run_default_homeserver: t0_unix_ms: 1772605359602

Pre-submission Checklist

For tests to work you need a working neo4j and redis instance with the example dataset in docker/db-graph

  • Testing: Implement and pass new tests for the new features/fixes, cargo nextest run.
  • Performance: Ensure new code has relevant performance benchmarks, cargo bench -p nexus-webapi

@tipogi tipogi added this to the 2026-Q1 milestone Mar 2, 2026
@tipogi tipogi self-assigned this Mar 2, 2026
@tipogi tipogi added 📈 enhancement New feature or request 👀 watcher Nexus indexer related operations labels Mar 2, 2026
@tipogi tipogi changed the title draft fix: graceful shutdown coordination for watcher tasks Mar 2, 2026
@tipogi tipogi marked this pull request as ready for review March 4, 2026 08:57
@ok300 ok300 self-requested a review March 4, 2026 09:26
@tipogi tipogi requested review from aintnostressin March 4, 2026 09:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

📈 enhancement New feature or request 👀 watcher Nexus indexer related operations

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants