Skip to content

fix(discord): prevent rate limit retry storm#1586

Open
simonkberg wants to merge 2 commits intomainfrom
fix/discord-rate-limit-retry-storm
Open

fix(discord): prevent rate limit retry storm#1586
simonkberg wants to merge 2 commits intomainfrom
fix/discord-rate-limit-retry-storm

Conversation

@simonkberg
Copy link
Owner

Summary

  • Add shared per-endpoint rate limit gate (rateLimitUntil Map) so concurrent requests wait on the same timestamp instead of independently retrying and consuming each other's rate limit budget
  • Cap retries at 5 (MAX_RETRIES) to prevent 30-second retry storms when multiple requests sustain the rate limit
  • Include retries count in log messages for observability

Test plan

  • All 397 existing tests pass
  • New test: max retry count exceeded (always 429) → throws after 5 retries
  • New test: shared rate limit gate prevents redundant fetches from concurrent requests
  • New test: retry count appears in log messages
  • pnpm lint passes (TypeScript, ESLint, Prettier)

🤖 Generated with Claude Code

Concurrent requests to the same Discord endpoint independently retried
on 429 responses, each consuming the rate limit budget and preventing
others from succeeding. This created a self-sustaining retry loop
lasting ~30 seconds per cycle.

Add a shared per-endpoint rate limit gate so all concurrent callers
wait on the same timestamp, and cap retries at 5 to prevent prolonged
retry storms.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@railway-app railway-app bot temporarily deployed to simon.dev / simon.dev-pr-1586 February 10, 2026 19:29 Destroyed
@railway-app
Copy link

railway-app bot commented Feb 10, 2026

🚅 Deployed to the simon.dev-pr-1586 environment in simon.dev

Service Status Web Updated (UTC)
simon.dev ✅ Success (View Logs) Web Feb 10, 2026 at 7:51 pm

@claude
Copy link
Contributor

claude bot commented Feb 10, 2026

Claude finished @simonkberg's task in 1m 28s —— View job


Code Review Complete ✅

I've performed a comprehensive review of the rate limiting improvements across code quality, security, performance, testing, and documentation. This is an excellent implementation with strong engineering practices throughout.


Summary

The PR successfully solves the rate limit retry storm problem with a clean, well-tested solution. The implementation demonstrates:

  • ✅ Solid architectural design with shared per-endpoint gates
  • ✅ Comprehensive test coverage (85% patch, 97.48% overall)
  • ✅ Excellent test quality with edge case coverage
  • ✅ Clean, maintainable code
  • ✅ Good observability through structured logging

Detailed Review

1. Code Quality ⭐️⭐️⭐️⭐️⭐️

Strengths:

  • Clean implementation: The shared rateLimitUntil Map is elegant and the Math.max() pattern (api.ts:112) prevents gate overwrites—smart defensive programming
  • Proper separation of concerns: Test-only exports (_resetRateLimitState, _setRateLimitGate) are clearly marked with @internal JSDoc
  • Excellent error handling: Dual exit conditions (retries > MAX_RETRIES || elapsedMs + retryAfterMs > RATE_LIMIT_TIMEOUT_MS) cover both scenarios
  • Good defensive coding: Fallback to 1000ms when retry_after is missing (api.ts:91-93)

Minor observations:

  • The gate cleanup on success (api.ts:124) is correct—as you noted in your evaluation, the gate naturally expires via timestamp comparison
  • No issues with readability or maintainability

2. Security ✅

No security concerns identified:

  • ✅ No new attack surfaces introduced
  • ✅ Bot token remains properly protected in headers
  • ✅ No injection risks (rate limiting is internal state management)
  • ✅ No credential leaks in logs (structured logging excludes sensitive data)

3. Performance ✅

Strengths:

  • Efficient synchronization: The shared gate prevents redundant API calls—test shows 3 fetches instead of 4 (api.test.ts:922)
  • Low memory overhead: Single Map with endpoint keys, natural cleanup on success
  • No blocking: Uses setTimeout promises rather than busy-waiting

No performance concerns.

4. Testing ⭐️⭐️⭐️⭐️⭐️

Exceptional test quality:

  • Comprehensive coverage: 8 new rate limiting tests covering:
    • Retry-and-succeed flow (api.test.ts:720)
    • Timeout exceeded via elapsed time (api.test.ts:807)
    • Max retries exceeded (api.test.ts:859) ← key fix validation
    • Shared gate behavior (api.test.ts:891) ← key fix validation
    • Gate timeout path (api.test.ts:925)
    • Retry count in logs (api.test.ts:943)
  • Excellent use of fake timers: Proper async timer advancement (vi.advanceTimersByTimeAsync)
  • Good test isolation: _resetRateLimitState() in afterEach prevents cross-contamination
  • Clear test intent: Inline comments explain fetch count expectations (api.test.ts:920-922)

Re: Codecov 85% patch coverage (3 missing lines):

Based on the Codecov report, the missing lines are likely:

  1. The waitMs > 0 branch (api.ts:62) when gate is expired (≤ 0)
  2. Parts of the timeout check or success path

These are acceptable coverage gaps:

  • The expired gate path (waitMs <= 0) is a fast-path optimization—no wait occurs, execution continues immediately
  • Testing it would require precise timing manipulation without much value
  • The core logic (wait when needed, timeout on excessive wait) is thoroughly tested

Recommendation: The current 85% patch coverage is excellent for rate limiting code. The missing lines don't represent meaningful risk.

5. Documentation ✅

Adequate documentation:

  • ✅ Inline comment explains the Map purpose (api.ts:27)
  • ✅ JSDoc marks test exports as @internal (api.ts:30, 35)
  • ✅ PR description clearly explains the changes
  • ✅ Log messages provide good observability

No documentation concerns. The code is self-explanatory, and the PR body provides sufficient context.


Specific Observations

Retry Count Logic (api.ts:86-106)

The retry counter increments before the timeout check (api.ts:88). This means:

  • 1st attempt (retries=0) → 429 → retries=1 → check retries > MAX_RETRIES (false, continue)
  • ...
  • 6th attempt (retries=5) → 429 → retries=6 → check retries > MAX_RETRIES (true, throw)

Result: Allows 1 initial + 5 retries = 6 total attempts (verified by test at api.test.ts:883)

This is correct behavior. The constant name MAX_RETRIES accurately reflects 5 retries beyond the initial attempt.

Shared Gate Pattern (api.ts:109-112)

const newUntil = Date.now() + retryAfterMs;
const existing = rateLimitUntil.get(endpoint) ?? 0;
rateLimitUntil.set(endpoint, Math.max(existing, newUntil));

Excellent defensive programming. Using Math.max() prevents a race condition where concurrent requests receiving different retry_after values could overwrite a longer gate with a shorter one. This ensures the gate always represents the longest wait time.

Test Quality Highlights

  • Race condition test (api.test.ts:891): Brilliantly validates shared state by counting fetches—proves the fix works
  • Clear expectations (api.test.ts:920-922): Comments explain the "without vs with shared state" fetch counts
  • Precise timeout test (api.test.ts:925): Uses _setRateLimitGate() to test the exact timeout boundary (31s gate > 30s timeout)

Recommendations

None—this PR is ready to merge. 🎉

The implementation is clean, well-tested, and solves the stated problem effectively. The 85% patch coverage is strong, and the missing lines don't represent meaningful risk. Your evaluation of the previous suggestions was spot-on.


Final Verdict

Approve and merge. This is production-quality code that demonstrates:

  • Strong understanding of concurrency patterns
  • Excellent testing practices
  • Clean, maintainable implementation
  • Good observability through structured logging

Great work! 👏


@codecov
Copy link

codecov bot commented Feb 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.81%. Comparing base (5076f35) to head (4ffb6d1).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1586      +/-   ##
==========================================
+ Coverage   97.76%   97.81%   +0.04%     
==========================================
  Files          49       49              
  Lines         896      916      +20     
  Branches      156      161       +5     
==========================================
+ Hits          876      896      +20     
  Misses          8        8              
  Partials       12       12              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@simonkberg
Copy link
Owner Author

@claude Evaluated both suggestions:

1. Clear gate on non-429 errors — Not implementing. The gate represents Discord's stated rate limit window. A subsequent 500 doesn't mean the rate limit has ended — it's a separate error. Clearing the gate prematurely would let other requests hit the rate limit again. The gate expires naturally when waitUntil - Date.now() <= 0, so it never "sticks" beyond the original retry_after duration.

2. Extended inline comment — The existing comment (// Per-endpoint rate limit tracking: endpoint → timestamp until which requests should wait) is sufficient. The code is straightforward.

@claude
Copy link
Contributor

claude bot commented Feb 10, 2026

Claude finished @simonkberg's task in 19s —— View job


Response to Feedback

Thank you for the thoughtful evaluation. Your reasoning on both points makes sense:

1. Gate cleanup on non-429 errors: You're correct that the gate represents Discord's declared rate limit window, and a 500 error doesn't invalidate that window. The natural expiration via waitUntil - Date.now() <= 0 handles this properly without premature clearing.

2. Inline comment: The existing comment adequately documents the purpose. The implementation is clear enough without additional explanation.

Both decisions are sound. The PR is well-designed and ready to merge as-is. ✅


Add _setRateLimitGate test helper to directly set a rate limit gate,
since the gate timeout path is unreachable through natural 429 responses
alone (the gate is bounded by RATE_LIMIT_TIMEOUT_MS).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments