roadmap: Phase DiMaggio + Berra; fix .NET timeouts (closes #44, #45) by russellromney · Pull Request #46 · russellromney/honker

russellromney · 2026-05-04T11:31:08Z

two unrelated changes bundled because they're both small:

roadmap (HN feedback)

Phase DiMaggio — adaptive polling backoff. After IDLE_THRESHOLD_S of no commits, scale poll interval up geometrically (1ms → cap at 1s). Reset to 1ms on any wake. Addresses both the polling-overhead critique and the mobile/battery concern from the HN thread. Default backend, no opt-in.
Phase Berra — HN-feedback docs sweep. Update README prior-art for Oban-supports-SQLite (@arlobish), add Graphile Worker (@odie5533), publish bench numbers (@andrewstuart), add a "When NOT to use" + multi-process WAL story (@ncruces / defangs the "SQLite isn't concurrent" critique from @maxdo), surface JVM binding in README + nav.

.NET test timeout fixes — closes #44, #45

WaitReady 15s → 60s. The helper is spawned via dotnet test --filter ... which has 5-15s of test-discovery boot overhead. The original window had no margin — every QueueWatcherBackend*NoFallback test was timing out at exactly 15s waiting for the helper to write its ready file. 60s comfortably covers boot; locally still returns immediately.

WaitResultObservesSaveRacingTheCall < 1s → < 1.5s. The assertion is "wake came via watcher, not via the 5s fallback poll." 1s was too tight (Thread.Sleep(50) on Windows can be 60-90ms, plus inter-thread coordination); 1.5s still rules out the fallback by 3x.

OutboxRetriesOnException WaitUntilAsync 3s → 10s, outer cts 5s → 15s. 3 retries + backoff + worker scheduling needed more headroom.

🤖 Generated with Claude Code

@arlobish

… docs) fix(.net): bump test timeouts for slow CI runners (closes #44, #45) Roadmap additions from the second HN front-page hit: - **Phase DiMaggio** — adaptive polling backoff for the default watcher. After IDLE_THRESHOLD_S of no commits, scale poll interval up geometrically (1ms → 2 → 4 → ... cap at MAX_POLL_INTERVAL_MS). Reset to 1ms on any wake. Addresses both the polling-overhead critique and the mobile/battery concern raised on HN. Affects default backend, no opt-in. - **Phase Berra** — HN-feedback docs sweep. Update README prior-art for Oban-now-supports-SQLite (@arlobish), add Graphile Worker (@odie5533), publish bench numbers (@andrewstuart), add When-NOT-to-use + multi-process WAL story (@ncruces / defangs the "SQLite isn't concurrent" critique), surface JVM binding in README + nav. .NET test timeout fixes (closes #44 + #45): - BindingTests.WaitReady: 15s → 60s. The helper is spawned via `dotnet test --filter ...` which has 5-15s of test-discovery boot overhead. The original window left no margin for slow CI runners — every QueueWatcherBackend*NoFallback test was timing out at exactly 15s waiting for the helper to write its ready file. 60s comfortably covers boot; locally returns immediately. - BindingTests.WaitResultObservesSaveRacingTheCall: < 1s → < 1.5s. The point of the assertion is "wake came via watcher, not via the 5s fallback poll." 1s was too tight; 1.5s still rules out the fallback by 3x. - OutboxLockTests.OutboxRetriesOnException: WaitUntilAsync 3s → 10s, outer cts 5s → 15s. 3 retries + backoff + worker scheduling needed more headroom on slow CI runners. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

60 s was overshoot. The actual helper subprocess wall time per test is ~2 s on this CI run (helper boot + work + exit). 30 s gives ~15x margin for slow runners while letting a real hang surface in half a minute instead of a full one. Original failure was a 15 s exact timeout on a heavily-loaded run. 30 s is plenty for that variance.

Two real bugs the timeout bumps were papering over: **1. honker-extension: cache SharedUpdateWatcher by (path, backend).** `honker_watcher_open` was creating a fresh `SharedUpdateWatcher` on every call, which spawns a background poll thread + Arc + Mutex (~50-500 ms). Bindings that re-open the watcher per consumer (e.g. .NET creating a new UpdatePoller for every Queue.ClaimAsync / Listener / Stream.subscribe call) hit this cost on every async operation. Now: open looks up the existing watcher for (canonical-db-path, backend) via a `Weak<SharedUpdateWatcher>` cache. First open is normal-cost; subsequent opens attach a new subscriber to the existing watcher in microseconds. Weak refs mean the background thread stops when the last subscriber goes away. Benefits every binding, not just .NET. Python keeps its own shared_watcher field on Database which is functionally equivalent — this aligns the extension API with that design. **2. .NET UpdatePoller.WaitForChangeAsync: actually async.** The previous loop chunked the native `_wait` into 100 ms blocking calls with `await Task.Yield()` between them. Each chunk blocked a runtime thread; under contention from xUnit running multiple async tests in parallel, runtime threads stacked up. Now: each chunk runs via `Task.Run(() => _wait(handle, ms))` so the block happens on a thread-pool thread instead of the runtime thread. Chunk size relaxed from 100 ms to 500 ms (chunking is just for cancellation responsiveness now that the call doesn't starve runtime threads). **Test bumps reverted where root-cause covers them:** - `WaitResultObservesSaveRacingTheCall` < 1.5 s → < 1 s. Tight bound is meaningful again now that wakes don't pay per-call setup cost. - `WaitReady` 60 s → 30 s (kept). - `OutboxRetriesOnException` 10 s (kept — 3 retries × backoff legitimately needs the headroom). Closes #44, #45. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bug introduced by the SharedUpdateWatcher caching change in this PR. honker_watcher_close was doing both unsubscribe(sub_id) AND shared. close(). With each handle owning its own SharedUpdateWatcher (the old behavior), close() was harmless. Now that handles share the cached watcher, close() clears EVERY subscriber's sender — disposing one .NET UpdatePoller killed wakes for all other open subscribers. Surfaced as NotifyStreamTests.ManyConcurrentSubscribersSameStream: 5 concurrent subscribers all subscribe, then one Dispose nuked the other 4's wake channels. Fix: drop the close() call. SharedUpdateWatcher::Drop signals stop when the last Arc is released; the cache holds Weak refs only. Refcount handles real teardown.

The .NET binding's wake mechanism was structurally wrong. Every async operation (Queue.ClaimAsync, Stream.Subscribe, Listener, Scheduler.RunAsync) called CreatePoller() to construct its own UpdatePoller, which spawned a fresh background watcher thread and chunked the native blocking _wait into 100-500ms pieces with Task.Yield. Under any concurrency (xUnit parallel tests, multi- subscriber workloads), this exhausted the .NET thread pool: each "async" wait pinned a runtime thread for hundreds of ms. New architecture (idiomatic async): - ONE UpdatePoller per Database, eagerly constructed at Open() and disposed in Database.Dispose. Callers borrow via GetPoller() — they never own the lifecycle. - ONE long-running dispatcher thread per poller. It calls _wait in a tight loop with a 60s native timeout. On wake, it fans out to every active waiter via per-call Channel<bool> writers. - WaitForChangeAsync is pure async: registers a Channel slot, awaits ReadAsync, removes itself in finally. Zero thread pool usage. - 100 concurrent subscribers cost ONE blocking thread, not 100. Fixes the use-after-free in honker_watcher_close: the FFI was doing Box::from_raw + drop in one call, but the dispatcher thread might still be inside recv_timeout on the handle's receiver when drop ran. Split into honker_watcher_signal_close (only unsubscribe — wakes recv_timeout with Disconnected) + honker_watcher_close (frees the box, must be called only after the dispatcher has joined). .NET Dispose now does the right sequence: 1. signal_close — wakes the dispatcher with -1 2. dispatcher.Join — guarantees no thread inside _wait 3. close — safe to free the handle Local sweep on .NET test suite: 90/90 pass in 2 seconds (previously flaky tests included). The fundamental fix replaces the band-aids.

The QueueWatcherBackendOneWriter*NoFallback tests give the worker a 2s idle CTS before the parent has finished its writer.Open + 25 sequential enqueues. Tight on macOS, routinely past budget on Linux runners. Per-job window stays 2s once jobs are flowing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Root cause of the QueueWatcher CI failures: subprocess "workers" were spawned via `dotnet test --filter ...`, which forces full vstest host startup (~30s on macos-14 ARM CI). That blew past WaitReady's 30s budget intermittently and made the apparent failure shape "worker never started" — a symptom of the test harness, not of any real bug. Fix: - New tests/Honker.QueueHelper console app holds the worker/writer logic, builds via ProjectReference from Honker.Tests, and starts in sub-second time. - Tests now spawn the helper DLL directly (`dotnet helper.dll worker`). - QueueWatcherBackendProcessHelper xunit method is gone — its body moved to the helper's Program.cs, which is the only place it ever ran in production anyway. - Parent tests now enqueue an explicit { stop = true } job after the data jobs so worker exits deterministically; the per-helper CTS is now a 30s safety net rather than the load-bearing exit signal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The eager+shared UpdatePoller pattern (introduced in 50400f1) caused worker subprocesses on Linux/Windows to never see committed jobs — empty results across all backends, including the polling fallback. macOS happened to work, but Linux and Windows were systemically broken. Restoring 5248f94's pattern: Database.Open probes-and-discards, each consumer owns its own poller via `using var poller = CreatePoller()`. The Rust-side SharedUpdateWatcher cache (added in e5d2a5c) means the underlying watcher thread is still shared across .NET pollers for the same (path, backend), so per-call construction stays cheap. Local 3-run check: 2/3 pass cleanly, 1/3 has the same residual shm backend flake as before — much closer to the 5248f94 baseline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The kernel-watcher and shm-fast-path backends are experimental, behind feature flags, and not in published wheels. They have known cross-process flakes on Linux/Windows even with the proper-fix helper extraction in place — `shm` SIGBUSes on Linux, `kernel` fails to deliver wakes through the worker subprocess on Windows. Their in-process behavior is still covered by the OpenWatcherBackendsDetectCommits tests on every platform. The cross-process subprocess tests stay on the polling default, which is the only backend a real .NET consumer would currently use. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When the parent test sees an empty/short result set from a worker subprocess, the failure message gives no signal about whether Database.Open hung, ClaimAsync never fired, the polling watcher missed wakes, or jobs simply never landed. The worker now writes a timestamped log to <resultPath>.log; the new AssertWorkerResult helper inlines that log into the assertion failure message so CI output points at the actual stuck step rather than just "expected 25, got 0". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

russellromney · 2026-05-05T04:08:35Z

Superseded.

Main shipped 285d9bb "Stabilize watcher proof CI" while this PR was iterating, which independently solves the same .NET subprocess-helper problem (and does it more cleanly — factored RunWorker/RunWriter, RequiredEnv helper).

The roadmap additions (Phase DiMaggio + Phase Berra) carved out into #47.

Other branch work that didn't make the cut:

The SHARED_WATCHERS cache + signal_close split in honker-extension. Real perf win but coupled to a UpdatePoller redesign main doesn't need; safe port requires more surgery than a cherry-pick. Leaving as a follow-up if/when poller spawn cost actually shows up in a profile.
UpdatePoller dispatcher rewrite + Database eager-poller — only made sense paired with the cache.

🤖 Generated with Claude Code

russellromney and others added 10 commits May 4, 2026 07:30

russellromney mentioned this pull request May 5, 2026

roadmap: Phase DiMaggio + Phase Berra #47

Merged

russellromney closed this May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roadmap: Phase DiMaggio + Berra; fix .NET timeouts (closes #44, #45)#46

roadmap: Phase DiMaggio + Berra; fix .NET timeouts (closes #44, #45)#46
russellromney wants to merge 10 commits intomainfrom
roadmap-and-dotnet-fixes

russellromney commented May 4, 2026

Uh oh!

russellromney commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

russellromney commented May 4, 2026

roadmap (HN feedback)

.NET test timeout fixes — closes #44, #45

Uh oh!

russellromney commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant