Skip to content

fix(gateway): exponential backoff for Google Drive watch-channel rate limits#1529

Open
bingran-you wants to merge 1 commit intodevfrom
bry/fix-gdrive-watch-backoff
Open

fix(gateway): exponential backoff for Google Drive watch-channel rate limits#1529
bingran-you wants to merge 1 commit intodevfrom
bry/fix-gdrive-watch-backoff

Conversation

@bingran-you
Copy link
Copy Markdown
Contributor

Summary

Fixes #1528 — the inbound gateway's Google Workspace poller was hammering Google's files.watch API on every 30s poll cycle and on every restart, repeatedly hitting subscriptionRateLimitExceeded and silently losing file-change notifications.

Changes

  • Replace failed_watch_files: HashSet<String> with failed_watch_backoff: HashMap<String, (Instant, u32)> that tracks a retry-not-before instant and an exponential-backoff step per file.
  • Classify 403 subscriptionRateLimitExceeded as a rate-limited failure with a 5m base delay capped at 6h; other failures use a 2m base capped at 1h.
  • Cap new watch registrations at 5 per poll cycle to avoid burst behavior on a cold start.
  • On a rate-limit hit, stop further registrations within the same cycle to avoid accelerating the storm.
  • Clear a file's backoff entry on successful registration.

Not in this PR (intentionally)

Test plan

  • cargo check -p scheduler_module --bin inbound_gateway passes cleanly
  • Deploy to staging VM, confirm pm2 logs dw_gateway shows backoff <N>s (step <K>) instead of - will not retry
  • Monitor for 1 hour; confirm no more than 5 registration attempts per poll cycle and no subscriptionRateLimitExceeded floods
  • Tail logs after a manual pm2 restart dw_gateway — confirm cold-start does not immediately trigger 403s

Filed by the scheduled dowhiz-service-debug task.

… limits

The previous implementation tracked failed watch-channel registrations in
an in-memory HashSet that was cleared on every gateway restart. When
Google Drive rejected registrations with subscriptionRateLimitExceeded,
the gateway continued hammering the endpoint on every poll cycle (30s)
and every restart, silently losing file-change notifications for users'
shared docs/sheets.

- Replace failed_watch_files HashSet with a backoff map that records a
  retry-not-before instant and an exponential step per file.
- Classify 403 subscriptionRateLimitExceeded (and similar) as a rate-
  limited failure with a 5m base delay capped at 6h. Other failures use
  a 2m base capped at 1h.
- Cap new registrations at 5 per poll cycle to avoid cold-start bursts.
- On a rate-limit hit, stop further registrations within the same cycle.
- Clear a file's backoff entry on successful registration.

Refs #1528.
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 22, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
dowhiz Ready Ready Preview, Comment Apr 22, 2026 0:15am

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breeze:done Breeze finished handling this item

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Google Drive watch-channel registration hits rate limit; no persistence/backoff

1 participant