fix(engine): ensure first keepalive tick is not skipped to prevent timeout on ws hibernation #3507

NathanFlurry · 2025-11-21T00:11:53Z

No description provided.

vercel · 2025-11-21T00:11:57Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
rivetkit-serverless	Error			Nov 21, 2025 0:27am

3 Skipped Deployments

Project	Deployment	Preview	Updated (UTC)
rivet-cloud	Ignored		Nov 21, 2025 0:27am
rivet-inspector	Ignored	Preview	Nov 21, 2025 0:27am
rivet-site	Ignored	Preview	Nov 21, 2025 0:27am

NathanFlurry · 2025-11-21T00:12:24Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

_{An organization admin has enabled the Graphite Merge Queue in this repository.} _{Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.}

This stack of pull requests is managed by Graphite. Learn more about stacking.

claude · 2025-11-21T00:14:09Z

Code Review: WebSocket Hibernation Keepalive Fix

Summary

This PR addresses a critical bug where the first keepalive tick was being skipped in the WebSocket hibernation mechanism, which could cause premature timeouts. The fix ensures that keepalive updates happen immediately and consistently throughout the hibernation period.

✅ Positive Changes

Critical Bug Fix - Removing the initial ping_interval.tick().await call in keepalive_task.rs:33 is the correct fix. The previous code was discarding the first tick, which meant the first keepalive update wouldn't happen until the second interval (45 seconds by default), potentially causing hibernation timeout (90 seconds default).
Removed Redundant Code - The removal of the premature hibernating_request::upsert call from lib.rs:578-588 is correct. This operation is now handled by the keepalive task, which is a better separation of concerns.
Improved Observability - Added debug logging in both files:
- keepalive_task.rs:45-50 - Logs each keepalive update with actor, gateway, and request IDs
- shared_state.rs:516-520 - Logs elapsed time vs timeout threshold when checking hibernation state

⚠️ Issues & Concerns

Critical: Missing Space (Compilation Error)

Location: shared_state.rs:521

if hs_elapsed> hibernation_timeout {

There's a missing space between hs_elapsed and >. This should be:

if hs_elapsed > hibernation_timeout {

This will cause a compilation error and must be fixed before merge.

Code Style: Import Organization

The PR includes several import reordering changes (e.g., lib.rs:8-19, shared_state.rs:2-16) that don't follow the instructions in CLAUDE.md which state:

Do not run ./scripts/cargo/fix.sh. Do not format the code yourself.

While these changes are cosmetically fine (sorting imports alphabetically), they shouldn't have been made manually according to the project's guidelines. The pre-commit hooks should handle this.

📊 Analysis

Timing Analysis

Default hibernation timeout: 90,000ms (90 seconds)
Keepalive interval: 45,000ms (45 seconds) - half of threshold
Old behavior: First keepalive at ~90s (skipping first tick) → timeout risk
New behavior: First keepalive at ~45s → proper coverage

The fix ensures that the keepalive task will update the hibernating request entry at 45 seconds, well before the 90-second timeout, preventing false timeouts.

Race Condition Consideration

The removal of the early hibernating_request::upsert call in lib.rs means the first database entry is now created by the keepalive task rather than before spawning it. This appears safe because:

The keepalive task starts immediately (no initial tick skip)
The first tick fires immediately when the interval is created
The entry will be created within milliseconds of hibernation starting

However, there's a potential edge case: if the keepalive task experiences significant scheduling delay, the entry might not exist when the garbage collection checks for hibernating requests. This seems unlikely in practice but worth noting.

🧪 Test Coverage

Concern: No test files found for pegboard-gateway package. Given this is a timing-sensitive bug fix for a critical hibernation mechanism, integration or unit tests would be valuable to prevent regression.

Recommendation: Consider adding tests that verify:

Keepalive updates happen at expected intervals
The first keepalive occurs before timeout
Hibernation doesn't timeout during normal operation

🔒 Security & Performance

Security: No concerns identified
Performance:
- Added logging is debug-level, minimal impact
- Jitter remains in place to prevent thundering herd (good)
- No performance regressions expected

✏️ Minor Observations

Logging Style: The new debug logs follow the project's structured logging conventions correctly (using ? and % prefixes, lowercase messages). ✅
Error Handling: No changes to error handling, existing patterns maintained. ✅

📝 Action Items

Must Fix Before Merge:

Fix syntax error in shared_state.rs:521 - add space before >

Consider:

Revert import reordering changes to align with CLAUDE.md guidelines
Add test coverage for hibernation keepalive timing
Document the timing behavior in code comments (why interval is threshold/2)

Verdict

Requires Changes - The core fix is sound and addresses a real bug, but the compilation error must be fixed. Once corrected, this PR will properly prevent WebSocket hibernation timeouts.

…meout on ws hibernation

claude · 2025-11-21T00:28:31Z

Code Review

Summary

This PR fixes a WebSocket hibernation timeout issue by ensuring the first keepalive tick is not skipped and refactors the MessageId type from a binary format to a structured type. Overall, the changes are solid and address a real issue, but there are a few concerns to address.

Critical Issues

1. Missing Space in Comparison Operator (packages/pegboard-gateway/src/shared_state.rs:521)

if hs_elapsed> hibernation_timeout {

Missing space before > operator. Should be:

if hs_elapsed > hibernation_timeout {

2. Race Condition Risk (packages/pegboard-gateway/src/lib.rs:581-590)

The removal of the upfront hibernating_request::upsert call introduces a potential race condition:

Before: The hibernating request entry was created immediately, ensuring it existed even if the actor rewoke quickly.

After: If there are pending messages, the function returns early without creating the entry, and the keepalive task (which creates the entry) hasn't started yet.

Risk: If an actor hibernates, immediately rewakes due to pending messages, then hibernates again quickly, the second hibernation might not have a valid keepalive entry yet.

Recommendation: Consider keeping the upsert call before the pending message check, or add a comment explaining why this is safe.

Design Questions

3. MessageId Schema Change Implications

The change from type MessageId data[10] to a structured MessageId type is a breaking protocol change.

Questions:

Is this change backwards compatible with existing deployed actors/gateways?
Should this be documented in a migration guide?
Are there any version checks to handle mixed protocol versions?

4. Keepalive Timing Behavior

The PR removes the initial ping_interval.tick().await that was skipping the first immediate tick.

Analysis:

Before: First keepalive happened at ~15s (half of 30s timeout)
After: First keepalive happens immediately, then at ~15s intervals

Question: Is the immediate first tick intentional? The comment says "first keepalive tick is not skipped" but doesn't explain why we want it to fire immediately. Consider adding a comment explaining the reasoning (e.g., to establish the hibernating request entry ASAP).

Code Quality

5. Debug Logging Added (packages/pegboard-gateway/src/keepalive_task.rs:45-50)

✅ Good addition for debugging hibernation issues.

6. Removed Unused File (engine/sdks/typescript/runner/src/tunnel-id.ts)

✅ Clean removal of 104 lines of now-unused code.

7. Consistent ID Formatting

✅ Good refactoring to use idToStr() utility function consistently throughout the codebase instead of custom base64 encoding.

8. Import Organization

Minor inconsistency: Some files have imports reordered (e.g., Mutex, watch → watch, Mutex). While not wrong, these seem like incidental formatting changes that add noise to the diff.

Performance Considerations

9. String Allocation in Hot Path

In shared_state.rs:516-519, debug logging now formats elapsed time as floats on every GC check:

tracing::debug!(
    hs_elapsed=%hs_elapsed.as_secs_f64(),
    timeout=%hibernation_timeout.as_secs_f64(),
    "checking hibernating state elapsed time"
);

Impact: This runs in the GC loop which runs every 15 seconds. The impact is likely negligible, but consider using the ?hs_elapsed format instead if these logs become noisy.

Security Considerations

No security concerns identified. The changes are primarily internal refactoring and timing adjustments.

Test Coverage

Missing: No test changes or new tests added.

Recommendations:

Add a test that verifies the first keepalive tick fires immediately after hibernation
Add a test that verifies the hibernation timeout is calculated correctly with the new timing
Add an integration test for the pending messages + quick rewake scenario to verify no race condition

Style & Conventions

Following CLAUDE.md conventions:

✅ Uses structured logging with tracing::debug!
✅ Log messages are lowercase
✅ Uses anyhow::Result correctly
⚠️ Spacing issue on line 521 (mentioned above)

Recommendations

Must Fix:

Fix the spacing issue in shared_state.rs:521
Verify/document the race condition safety in lib.rs:581-590

Should Consider:
3. Add tests for the timing behavior changes
4. Document the protocol version compatibility
5. Add a comment explaining why the first tick should fire immediately

Nice to Have:
6. Consider using ?hs_elapsed instead of %hs_elapsed.as_secs_f64() to reduce allocations in debug logs

Overall: Approve with minor changes required. The core fix is sound, but please address the spacing issue and clarify the race condition safety before merging.

🤖 Generated with Claude Code

graphite-app · 2025-11-21T02:40:53Z

Merge activity

Nov 21, 2:40 AM UTC: NathanFlurry added this pull request to the Graphite merge queue.
Nov 21, 2:42 AM UTC: CI is running for this pull request on a draft pull request (#3515) due to your merge queue CI optimization settings.
Nov 21, 2:43 AM UTC: Merged by the Graphite merge queue via draft PR: #3515.

…meout on ws hibernation (#3507)

vercel bot temporarily deployed to Preview – rivetkit-serverless November 21, 2025 00:11 Inactive

NathanFlurry mentioned this pull request Nov 21, 2025

chore(examples): add counter-next-js example #3506

Closed

NathanFlurry force-pushed the 11-20-fix_engine_ensure_first_keepalive_tick_is_not_skipped_to_prevent_timeout_on_ws_hibernation branch from 8d11332 to 79e5409 Compare November 21, 2025 00:27

fix(engine): ensure first keepalive tick is not skipped to prevent ti…

7e275b0

…meout on ws hibernation

NathanFlurry force-pushed the 11-20-fix_engine_ensure_first_keepalive_tick_is_not_skipped_to_prevent_timeout_on_ws_hibernation branch from 79e5409 to 7e275b0 Compare November 21, 2025 00:27

vercel bot had a problem deploying to Preview – rivetkit-serverless November 21, 2025 00:27 Failure

MasterPtato mentioned this pull request Nov 21, 2025

chore: remove msg id parts from protocol, move id utils to runner protocol crate #3508

Closed

NathanFlurry requested a review from MasterPtato November 21, 2025 02:35

graphite-app bot pushed a commit that referenced this pull request Nov 21, 2025

fix(engine): ensure first keepalive tick is not skipped to prevent ti…

dbcce0a

…meout on ws hibernation (#3507)

graphite-app bot closed this Nov 21, 2025

graphite-app bot deleted the 11-20-fix_engine_ensure_first_keepalive_tick_is_not_skipped_to_prevent_timeout_on_ws_hibernation branch November 21, 2025 02:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(engine): ensure first keepalive tick is not skipped to prevent timeout on ws hibernation #3507

fix(engine): ensure first keepalive tick is not skipped to prevent timeout on ws hibernation #3507

Uh oh!

NathanFlurry commented Nov 21, 2025

Uh oh!

vercel bot commented Nov 21, 2025 •

edited

Loading

Uh oh!

NathanFlurry commented Nov 21, 2025 •

edited

Loading

Uh oh!

claude bot commented Nov 21, 2025

Uh oh!

claude bot commented Nov 21, 2025

Uh oh!

graphite-app bot commented Nov 21, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix(engine): ensure first keepalive tick is not skipped to prevent timeout on ws hibernation #3507

fix(engine): ensure first keepalive tick is not skipped to prevent timeout on ws hibernation #3507

Uh oh!

Conversation

NathanFlurry commented Nov 21, 2025

Uh oh!

vercel bot commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NathanFlurry commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to use the Graphite Merge Queue

Uh oh!

claude bot commented Nov 21, 2025

Code Review: WebSocket Hibernation Keepalive Fix

Summary

✅ Positive Changes

⚠️ Issues & Concerns

Critical: Missing Space (Compilation Error)

Code Style: Import Organization

📊 Analysis

Timing Analysis

Race Condition Consideration

🧪 Test Coverage

🔒 Security & Performance

✏️ Minor Observations

📝 Action Items

Verdict

Uh oh!

claude bot commented Nov 21, 2025

Code Review

Summary

Critical Issues

1. Missing Space in Comparison Operator (packages/pegboard-gateway/src/shared_state.rs:521)

2. Race Condition Risk (packages/pegboard-gateway/src/lib.rs:581-590)

Design Questions

3. MessageId Schema Change Implications

4. Keepalive Timing Behavior

Code Quality

5. Debug Logging Added (packages/pegboard-gateway/src/keepalive_task.rs:45-50)

6. Removed Unused File (engine/sdks/typescript/runner/src/tunnel-id.ts)

7. Consistent ID Formatting

8. Import Organization

Performance Considerations

9. String Allocation in Hot Path

Security Considerations

Test Coverage

Style & Conventions

Recommendations

Uh oh!

graphite-app bot commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge activity

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel bot commented Nov 21, 2025 •

edited

Loading

NathanFlurry commented Nov 21, 2025 •

edited

Loading

graphite-app bot commented Nov 21, 2025 •

edited

Loading