fix(analytics): run AccountStore blocking calls on std::thread (#1476)#1506
fix(analytics): run AccountStore blocking calls on std::thread (#1476)#1506bingran-you wants to merge 1 commit intodevfrom
Conversation
Replaces the three remaining `tokio::task::spawn_blocking` call sites in `service::analytics` (the `/analytics/track` and `/analytics/dashboard` handlers) with a new `AccountStore::run_blocking` helper that offloads the sync `postgres` / r2d2 work onto a dedicated `std::thread` and awaits the result via `tokio::sync::oneshot`. This closes the latent abort path carried over from PR #1454 (see issue #1476 for the audit and #1451 for the original production crash): when r2d2 recycles a broken Supabase connection, the sync `postgres::Client::drop` impl calls `Runtime::block_on`, which panics with "Cannot start a runtime from within a runtime" on tokio blocking-pool threads (they still carry a runtime Handle) and aborts the whole worker process because the panic fires inside a destructor. - Add `AccountStore::run_blocking<F, T>` as the await-able counterpart to the existing `record_analytics_event_detached` helper. - Switch `track_event` (both `get_account_by_auth_user` and `record_analytics_event`) and `get_dashboard` to use it. - Drop the now-unused `tokio::task` import in `service/analytics.rs`. - Document the rule at each replaced call site so future edits keep the pattern. The follow-up `service/billing.rs` (5 sites) and `service/auth.rs` (40+ sites) listed in #1476 are intentionally deferred to keep this change focused.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
bingran-you
left a comment
There was a problem hiding this comment.
One concern before merge: the new helper in DoWhiz_service/scheduler_module/src/account_store.rs:1266 documents that synchronous AccountStore r2d2/postgres work should never run on tokio::task::spawn_blocking, because recycling a postgres::Client there can abort the worker. This PR migrates service/analytics.rs, but the same spawn_blocking(move || store.*) pattern is still present in other async HTTP paths such as DoWhiz_service/scheduler_module/src/service/auth.rs:707 and DoWhiz_service/scheduler_module/src/service/billing.rs:140.
If the root cause is general to AccountStore, the crash path is still reachable through those handlers, so the invariant introduced here is only partially enforced. If analytics is the only affected surface, I think the new helper/docs should be scoped down to say that explicitly; otherwise I’d prefer to migrate the remaining request-path call sites or centralize the safe wrapper before calling this fixed.
This reply was drafted by breeze, an autonomous agent running on behalf of the account owner.
bingran-you
left a comment
There was a problem hiding this comment.
One concern before merge: the new helper in DoWhiz_service/scheduler_module/src/account_store.rs documents that synchronous AccountStore r2d2/postgres work should never run on tokio::task::spawn_blocking, because recycling a postgres::Client there can abort the worker. This PR migrates service/analytics.rs, but the same spawn_blocking(move || store.*) pattern is still present in other async HTTP paths such as DoWhiz_service/scheduler_module/src/service/auth.rs and DoWhiz_service/scheduler_module/src/service/billing.rs.
If the root cause is general to AccountStore, the crash path is still reachable through those handlers, so the invariant introduced here is only partially enforced. If analytics is the only affected surface, I think the new helper/docs should be scoped down to say that explicitly; otherwise I’d prefer to migrate the remaining request-path call sites or centralize the safe wrapper before calling this fixed.
This reply was drafted by breeze, an autonomous agent running on behalf of the account owner.
Summary
Closes the latent abort path flagged in #1476 (follow-up to #1454 / #1451).
The three remaining
tokio::task::spawn_blocking(move || store.…)call sites inscheduler_module::service::analytics—/analytics/track(account resolution + event insert) and/analytics/dashboard— now run on a dedicatedstd::threadvia a newAccountStore::run_blockinghelper that awaits the result throughtokio::sync::oneshot.Without this, any Supabase hiccup that causes r2d2 to recycle a broken connection on these endpoints will:
postgres::Clienton a tokio blocking-pool thread (they carry a runtime Handle).DropcallsRuntime::block_on(close_rendezvous())→ panics with "Cannot start a runtime from within a runtime".abort()→ PM2 restart.Worker log evidence from the post-#1454 production / staging scan today (2026-04-21):
/home/azureuser/.pm2/logs/dw-worker-error.logstill contains this exact panic, with proddw_workershowing 118 PM2 restarts and stagingdw_worker151 restarts accumulated since the last hot-swap window.Changes
account_store.rs: addpub async fn run_blocking<F, T>— the await-able counterpart of the existing fire-and-forgetrecord_analytics_event_detachedhelper. Same thread model, same safety comment calling out thepostgres::Client::droptrap.service/analytics.rs:track_event→ bothget_account_by_auth_userandrecord_analytics_eventuseAccountStore::run_blocking.get_dashboard→ same treatment for the batch r2d2 query closure.use tokio::task;import.NOTE:comments at each site pointing back to [P0] dw_worker aborts: sync postgres Client::drop panics in AccountStore::record_analytics_event (prod) #1451 / [P1] analytics.rs /analytics/track still uses tokio::spawn_blocking for AccountStore — latent postgres Drop crash #1476 so future edits don't regress tospawn_blocking.The follow-up
service/billing.rs(5 sites) andservice/auth.rs(40+ sites) listed in #1476 are intentionally deferred to keep this PR focused and reviewable.Test plan
cargo check -p scheduler_module— clean (pre-existing warnings only).cargo test -p scheduler_module analytics(integration tests in CI)./home/azureuser/.pm2/logs/dw-worker-error.logstops accumulating newpostgres-0.19.12/src/connection.rs:66:22panics on prod / staging.Refs: #1451, #1454, #1476.