roadmap: Phase DiMaggio + Berra; fix .NET timeouts (closes #44, #45)#46
Closed
russellromney wants to merge 10 commits intomainfrom
Closed
roadmap: Phase DiMaggio + Berra; fix .NET timeouts (closes #44, #45)#46russellromney wants to merge 10 commits intomainfrom
russellromney wants to merge 10 commits intomainfrom
Conversation
… docs) fix(.net): bump test timeouts for slow CI runners (closes #44, #45) Roadmap additions from the second HN front-page hit: - **Phase DiMaggio** — adaptive polling backoff for the default watcher. After IDLE_THRESHOLD_S of no commits, scale poll interval up geometrically (1ms → 2 → 4 → ... cap at MAX_POLL_INTERVAL_MS). Reset to 1ms on any wake. Addresses both the polling-overhead critique and the mobile/battery concern raised on HN. Affects default backend, no opt-in. - **Phase Berra** — HN-feedback docs sweep. Update README prior-art for Oban-now-supports-SQLite (@arlobish), add Graphile Worker (@odie5533), publish bench numbers (@andrewstuart), add When-NOT-to-use + multi-process WAL story (@ncruces / defangs the "SQLite isn't concurrent" critique), surface JVM binding in README + nav. .NET test timeout fixes (closes #44 + #45): - BindingTests.WaitReady: 15s → 60s. The helper is spawned via `dotnet test --filter ...` which has 5-15s of test-discovery boot overhead. The original window left no margin for slow CI runners — every QueueWatcherBackend*NoFallback test was timing out at exactly 15s waiting for the helper to write its ready file. 60s comfortably covers boot; locally returns immediately. - BindingTests.WaitResultObservesSaveRacingTheCall: < 1s → < 1.5s. The point of the assertion is "wake came via watcher, not via the 5s fallback poll." 1s was too tight; 1.5s still rules out the fallback by 3x. - OutboxLockTests.OutboxRetriesOnException: WaitUntilAsync 3s → 10s, outer cts 5s → 15s. 3 retries + backoff + worker scheduling needed more headroom on slow CI runners. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
60 s was overshoot. The actual helper subprocess wall time per test is ~2 s on this CI run (helper boot + work + exit). 30 s gives ~15x margin for slow runners while letting a real hang surface in half a minute instead of a full one. Original failure was a 15 s exact timeout on a heavily-loaded run. 30 s is plenty for that variance.
Two real bugs the timeout bumps were papering over: **1. honker-extension: cache SharedUpdateWatcher by (path, backend).** `honker_watcher_open` was creating a fresh `SharedUpdateWatcher` on every call, which spawns a background poll thread + Arc + Mutex (~50-500 ms). Bindings that re-open the watcher per consumer (e.g. .NET creating a new UpdatePoller for every Queue.ClaimAsync / Listener / Stream.subscribe call) hit this cost on every async operation. Now: open looks up the existing watcher for (canonical-db-path, backend) via a `Weak<SharedUpdateWatcher>` cache. First open is normal-cost; subsequent opens attach a new subscriber to the existing watcher in microseconds. Weak refs mean the background thread stops when the last subscriber goes away. Benefits every binding, not just .NET. Python keeps its own shared_watcher field on Database which is functionally equivalent — this aligns the extension API with that design. **2. .NET UpdatePoller.WaitForChangeAsync: actually async.** The previous loop chunked the native `_wait` into 100 ms blocking calls with `await Task.Yield()` between them. Each chunk blocked a runtime thread; under contention from xUnit running multiple async tests in parallel, runtime threads stacked up. Now: each chunk runs via `Task.Run(() => _wait(handle, ms))` so the block happens on a thread-pool thread instead of the runtime thread. Chunk size relaxed from 100 ms to 500 ms (chunking is just for cancellation responsiveness now that the call doesn't starve runtime threads). **Test bumps reverted where root-cause covers them:** - `WaitResultObservesSaveRacingTheCall` < 1.5 s → < 1 s. Tight bound is meaningful again now that wakes don't pay per-call setup cost. - `WaitReady` 60 s → 30 s (kept). - `OutboxRetriesOnException` 10 s (kept — 3 retries × backoff legitimately needs the headroom). Closes #44, #45. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bug introduced by the SharedUpdateWatcher caching change in this PR. honker_watcher_close was doing both unsubscribe(sub_id) AND shared. close(). With each handle owning its own SharedUpdateWatcher (the old behavior), close() was harmless. Now that handles share the cached watcher, close() clears EVERY subscriber's sender — disposing one .NET UpdatePoller killed wakes for all other open subscribers. Surfaced as NotifyStreamTests.ManyConcurrentSubscribersSameStream: 5 concurrent subscribers all subscribe, then one Dispose nuked the other 4's wake channels. Fix: drop the close() call. SharedUpdateWatcher::Drop signals stop when the last Arc is released; the cache holds Weak refs only. Refcount handles real teardown.
The .NET binding's wake mechanism was structurally wrong. Every async operation (Queue.ClaimAsync, Stream.Subscribe, Listener, Scheduler.RunAsync) called CreatePoller() to construct its own UpdatePoller, which spawned a fresh background watcher thread and chunked the native blocking _wait into 100-500ms pieces with Task.Yield. Under any concurrency (xUnit parallel tests, multi- subscriber workloads), this exhausted the .NET thread pool: each "async" wait pinned a runtime thread for hundreds of ms. New architecture (idiomatic async): - ONE UpdatePoller per Database, eagerly constructed at Open() and disposed in Database.Dispose. Callers borrow via GetPoller() — they never own the lifecycle. - ONE long-running dispatcher thread per poller. It calls _wait in a tight loop with a 60s native timeout. On wake, it fans out to every active waiter via per-call Channel<bool> writers. - WaitForChangeAsync is pure async: registers a Channel slot, awaits ReadAsync, removes itself in finally. Zero thread pool usage. - 100 concurrent subscribers cost ONE blocking thread, not 100. Fixes the use-after-free in honker_watcher_close: the FFI was doing Box::from_raw + drop in one call, but the dispatcher thread might still be inside recv_timeout on the handle's receiver when drop ran. Split into honker_watcher_signal_close (only unsubscribe — wakes recv_timeout with Disconnected) + honker_watcher_close (frees the box, must be called only after the dispatcher has joined). .NET Dispose now does the right sequence: 1. signal_close — wakes the dispatcher with -1 2. dispatcher.Join — guarantees no thread inside _wait 3. close — safe to free the handle Local sweep on .NET test suite: 90/90 pass in 2 seconds (previously flaky tests included). The fundamental fix replaces the band-aids.
The QueueWatcherBackendOneWriter*NoFallback tests give the worker a 2s idle CTS before the parent has finished its writer.Open + 25 sequential enqueues. Tight on macOS, routinely past budget on Linux runners. Per-job window stays 2s once jobs are flowing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause of the QueueWatcher CI failures: subprocess "workers" were
spawned via `dotnet test --filter ...`, which forces full vstest host
startup (~30s on macos-14 ARM CI). That blew past WaitReady's 30s
budget intermittently and made the apparent failure shape "worker
never started" — a symptom of the test harness, not of any real bug.
Fix:
- New tests/Honker.QueueHelper console app holds the worker/writer
logic, builds via ProjectReference from Honker.Tests, and starts in
sub-second time.
- Tests now spawn the helper DLL directly (`dotnet helper.dll worker`).
- QueueWatcherBackendProcessHelper xunit method is gone — its body
moved to the helper's Program.cs, which is the only place it ever
ran in production anyway.
- Parent tests now enqueue an explicit { stop = true } job after the
data jobs so worker exits deterministically; the per-helper CTS is
now a 30s safety net rather than the load-bearing exit signal.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The eager+shared UpdatePoller pattern (introduced in 50400f1) caused worker subprocesses on Linux/Windows to never see committed jobs — empty results across all backends, including the polling fallback. macOS happened to work, but Linux and Windows were systemically broken. Restoring 5248f94's pattern: Database.Open probes-and-discards, each consumer owns its own poller via `using var poller = CreatePoller()`. The Rust-side SharedUpdateWatcher cache (added in e5d2a5c) means the underlying watcher thread is still shared across .NET pollers for the same (path, backend), so per-call construction stays cheap. Local 3-run check: 2/3 pass cleanly, 1/3 has the same residual shm backend flake as before — much closer to the 5248f94 baseline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The kernel-watcher and shm-fast-path backends are experimental, behind feature flags, and not in published wheels. They have known cross-process flakes on Linux/Windows even with the proper-fix helper extraction in place — `shm` SIGBUSes on Linux, `kernel` fails to deliver wakes through the worker subprocess on Windows. Their in-process behavior is still covered by the OpenWatcherBackendsDetectCommits tests on every platform. The cross-process subprocess tests stay on the polling default, which is the only backend a real .NET consumer would currently use. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the parent test sees an empty/short result set from a worker subprocess, the failure message gives no signal about whether Database.Open hung, ClaimAsync never fired, the polling watcher missed wakes, or jobs simply never landed. The worker now writes a timestamped log to <resultPath>.log; the new AssertWorkerResult helper inlines that log into the assertion failure message so CI output points at the actual stuck step rather than just "expected 25, got 0". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Owner
Author
|
Superseded. Main shipped 285d9bb "Stabilize watcher proof CI" while this PR was iterating, which independently solves the same .NET subprocess-helper problem (and does it more cleanly — factored The roadmap additions (Phase DiMaggio + Phase Berra) carved out into #47. Other branch work that didn't make the cut:
🤖 Generated with Claude Code |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
two unrelated changes bundled because they're both small:
roadmap (HN feedback)
Phase DiMaggio — adaptive polling backoff. After IDLE_THRESHOLD_S of no commits, scale poll interval up geometrically (1ms → cap at 1s). Reset to 1ms on any wake. Addresses both the polling-overhead critique and the mobile/battery concern from the HN thread. Default backend, no opt-in.
Phase Berra — HN-feedback docs sweep. Update README prior-art for Oban-supports-SQLite (@arlobish), add Graphile Worker (@odie5533), publish bench numbers (@andrewstuart), add a "When NOT to use" + multi-process WAL story (@ncruces / defangs the "SQLite isn't concurrent" critique from @maxdo), surface JVM binding in README + nav.
.NET test timeout fixes — closes #44, #45
WaitReady15s → 60s. The helper is spawned viadotnet test --filter ...which has 5-15s of test-discovery boot overhead. The original window had no margin — everyQueueWatcherBackend*NoFallbacktest was timing out at exactly 15s waiting for the helper to write its ready file. 60s comfortably covers boot; locally still returns immediately.WaitResultObservesSaveRacingTheCall< 1s → < 1.5s. The assertion is "wake came via watcher, not via the 5s fallback poll." 1s was too tight (Thread.Sleep(50) on Windows can be 60-90ms, plus inter-thread coordination); 1.5s still rules out the fallback by 3x.OutboxRetriesOnExceptionWaitUntilAsync 3s → 10s, outer cts 5s → 15s. 3 retries + backoff + worker scheduling needed more headroom.🤖 Generated with Claude Code