fix(discord): prevent rate limit retry storm by simonkberg · Pull Request #1586 · simonkberg/simon.dev

simonkberg · 2026-02-10T19:29:51Z

Summary

Add shared per-endpoint rate limit gate (rateLimitUntil Map) so concurrent requests wait on the same timestamp instead of independently retrying and consuming each other's rate limit budget
Cap retries at 5 (MAX_RETRIES) to prevent 30-second retry storms when multiple requests sustain the rate limit
Include retries count in log messages for observability

Test plan

All 397 existing tests pass
New test: max retry count exceeded (always 429) → throws after 5 retries
New test: shared rate limit gate prevents redundant fetches from concurrent requests
New test: retry count appears in log messages
pnpm lint passes (TypeScript, ESLint, Prettier)

🤖 Generated with Claude Code

Concurrent requests to the same Discord endpoint independently retried on 429 responses, each consuming the rate limit budget and preventing others from succeeding. This created a self-sustaining retry loop lasting ~30 seconds per cycle. Add a shared per-endpoint rate limit gate so all concurrent callers wait on the same timestamp, and cap retries at 5 to prevent prolonged retry storms. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

railway-app · 2026-02-10T19:29:58Z

🚅 Deployed to the simon.dev-pr-1586 environment in simon.dev

Service	Status	Web	Updated (UTC)
simon.dev	✅ Success (View Logs)	Web	Feb 10, 2026 at 7:51 pm

claude · 2026-02-10T19:30:27Z

Claude finished @simonkberg's task in 1m 28s —— View job

Code Review Complete ✅

I've performed a comprehensive review of the rate limiting improvements across code quality, security, performance, testing, and documentation. This is an excellent implementation with strong engineering practices throughout.

Summary

The PR successfully solves the rate limit retry storm problem with a clean, well-tested solution. The implementation demonstrates:

✅ Solid architectural design with shared per-endpoint gates
✅ Comprehensive test coverage (85% patch, 97.48% overall)
✅ Excellent test quality with edge case coverage
✅ Clean, maintainable code
✅ Good observability through structured logging

Detailed Review

1. Code Quality ⭐️⭐️⭐️⭐️⭐️

Strengths:

Clean implementation: The shared rateLimitUntil Map is elegant and the Math.max() pattern (api.ts:112) prevents gate overwrites—smart defensive programming
Proper separation of concerns: Test-only exports (_resetRateLimitState, _setRateLimitGate) are clearly marked with @internal JSDoc
Excellent error handling: Dual exit conditions (retries > MAX_RETRIES || elapsedMs + retryAfterMs > RATE_LIMIT_TIMEOUT_MS) cover both scenarios
Good defensive coding: Fallback to 1000ms when retry_after is missing (api.ts:91-93)

Minor observations:

The gate cleanup on success (api.ts:124) is correct—as you noted in your evaluation, the gate naturally expires via timestamp comparison
No issues with readability or maintainability

2. Security ✅

No security concerns identified:

✅ No new attack surfaces introduced
✅ Bot token remains properly protected in headers
✅ No injection risks (rate limiting is internal state management)
✅ No credential leaks in logs (structured logging excludes sensitive data)

3. Performance ✅

Strengths:

Efficient synchronization: The shared gate prevents redundant API calls—test shows 3 fetches instead of 4 (api.test.ts:922)
Low memory overhead: Single Map with endpoint keys, natural cleanup on success
No blocking: Uses setTimeout promises rather than busy-waiting

No performance concerns.

4. Testing ⭐️⭐️⭐️⭐️⭐️

Exceptional test quality:

Comprehensive coverage: 8 new rate limiting tests covering:
- Retry-and-succeed flow (api.test.ts:720)
- Timeout exceeded via elapsed time (api.test.ts:807)
- Max retries exceeded (api.test.ts:859) ← key fix validation
- Shared gate behavior (api.test.ts:891) ← key fix validation
- Gate timeout path (api.test.ts:925)
- Retry count in logs (api.test.ts:943)
Excellent use of fake timers: Proper async timer advancement (vi.advanceTimersByTimeAsync)
Good test isolation: _resetRateLimitState() in afterEach prevents cross-contamination
Clear test intent: Inline comments explain fetch count expectations (api.test.ts:920-922)

Re: Codecov 85% patch coverage (3 missing lines):

Based on the Codecov report, the missing lines are likely:

The waitMs > 0 branch (api.ts:62) when gate is expired (≤ 0)
Parts of the timeout check or success path

These are acceptable coverage gaps:

The expired gate path (waitMs <= 0) is a fast-path optimization—no wait occurs, execution continues immediately
Testing it would require precise timing manipulation without much value
The core logic (wait when needed, timeout on excessive wait) is thoroughly tested

Recommendation: The current 85% patch coverage is excellent for rate limiting code. The missing lines don't represent meaningful risk.

5. Documentation ✅

Adequate documentation:

✅ Inline comment explains the Map purpose (api.ts:27)
✅ JSDoc marks test exports as @internal (api.ts:30, 35)
✅ PR description clearly explains the changes
✅ Log messages provide good observability

No documentation concerns. The code is self-explanatory, and the PR body provides sufficient context.

Specific Observations

Retry Count Logic (api.ts:86-106)

The retry counter increments before the timeout check (api.ts:88). This means:

1st attempt (retries=0) → 429 → retries=1 → check retries > MAX_RETRIES (false, continue)
...
6th attempt (retries=5) → 429 → retries=6 → check retries > MAX_RETRIES (true, throw)

Result: Allows 1 initial + 5 retries = 6 total attempts (verified by test at api.test.ts:883)

This is correct behavior. The constant name MAX_RETRIES accurately reflects 5 retries beyond the initial attempt.

Shared Gate Pattern (api.ts:109-112)

const newUntil = Date.now() + retryAfterMs;
const existing = rateLimitUntil.get(endpoint) ?? 0;
rateLimitUntil.set(endpoint, Math.max(existing, newUntil));

Excellent defensive programming. Using Math.max() prevents a race condition where concurrent requests receiving different retry_after values could overwrite a longer gate with a shorter one. This ensures the gate always represents the longest wait time.

Test Quality Highlights

Race condition test (api.test.ts:891): Brilliantly validates shared state by counting fetches—proves the fix works
Clear expectations (api.test.ts:920-922): Comments explain the "without vs with shared state" fetch counts
Precise timeout test (api.test.ts:925): Uses _setRateLimitGate() to test the exact timeout boundary (31s gate > 30s timeout)

Recommendations

None—this PR is ready to merge. 🎉

The implementation is clean, well-tested, and solves the stated problem effectively. The 85% patch coverage is strong, and the missing lines don't represent meaningful risk. Your evaluation of the previous suggestions was spot-on.

Final Verdict

Approve and merge. This is production-quality code that demonstrates:

Strong understanding of concurrency patterns
Excellent testing practices
Clean, maintainable implementation
Good observability through structured logging

Great work! 👏

codecov · 2026-02-10T19:40:22Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.81%. Comparing base (5076f35) to head (4ffb6d1).
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1586      +/-   ##
==========================================
+ Coverage   97.76%   97.81%   +0.04%     
==========================================
  Files          49       49              
  Lines         896      916      +20     
  Branches      156      161       +5     
==========================================
+ Hits          876      896      +20     
  Misses          8        8              
  Partials       12       12

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

simonkberg · 2026-02-10T19:41:19Z

@claude Evaluated both suggestions:

1. Clear gate on non-429 errors — Not implementing. The gate represents Discord's stated rate limit window. A subsequent 500 doesn't mean the rate limit has ended — it's a separate error. Clearing the gate prematurely would let other requests hit the rate limit again. The gate expires naturally when waitUntil - Date.now() <= 0, so it never "sticks" beyond the original retry_after duration.

2. Extended inline comment — The existing comment (// Per-endpoint rate limit tracking: endpoint → timestamp until which requests should wait) is sufficient. The code is straightforward.

claude · 2026-02-10T19:41:43Z

Claude finished @simonkberg's task in 19s —— View job

Response to Feedback

Thank you for the thoughtful evaluation. Your reasoning on both points makes sense:

1. Gate cleanup on non-429 errors: You're correct that the gate represents Discord's declared rate limit window, and a 500 error doesn't invalidate that window. The natural expiration via waitUntil - Date.now() <= 0 handles this properly without premature clearing.

2. Inline comment: The existing comment adequately documents the purpose. The implementation is clear enough without additional explanation.

Both decisions are sound. The PR is well-designed and ready to merge as-is. ✅

Add _setRateLimitGate test helper to directly set a rate limit gate, since the gate timeout path is unreachable through natural 429 responses alone (the gate is bounded by RATE_LIMIT_TIMEOUT_MS). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

railway-app bot temporarily deployed to simon.dev / simon.dev-pr-1586 February 10, 2026 19:29 Destroyed

railway-app bot deployed to simon.dev / simon.dev-pr-1586 February 10, 2026 19:50 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(discord): prevent rate limit retry storm#1586

fix(discord): prevent rate limit retry storm#1586
simonkberg wants to merge 2 commits intomainfrom
fix/discord-rate-limit-retry-storm

simonkberg commented Feb 10, 2026

Uh oh!

railway-app bot commented Feb 10, 2026 •

edited

Loading

Uh oh!

claude bot commented Feb 10, 2026 •

edited

Loading

Uh oh!

codecov bot commented Feb 10, 2026 •

edited

Loading

Uh oh!

simonkberg commented Feb 10, 2026

Uh oh!

claude bot commented Feb 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

simonkberg commented Feb 10, 2026

Summary

Test plan

Uh oh!

railway-app bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Complete ✅

Summary

Detailed Review

1. Code Quality ⭐️⭐️⭐️⭐️⭐️

2. Security ✅

3. Performance ✅

4. Testing ⭐️⭐️⭐️⭐️⭐️

5. Documentation ✅

Specific Observations

Retry Count Logic (api.ts:86-106)

Shared Gate Pattern (api.ts:109-112)

Test Quality Highlights

Recommendations

Final Verdict

Uh oh!

codecov bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

simonkberg commented Feb 10, 2026

Uh oh!

claude bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Response to Feedback

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

railway-app bot commented Feb 10, 2026 •

edited

Loading

claude bot commented Feb 10, 2026 •

edited

Loading

codecov bot commented Feb 10, 2026 •

edited

Loading

claude bot commented Feb 10, 2026 •

edited

Loading