Skip to content

Conversation

@NathanFlurry
Copy link
Member

No description provided.

@vercel
Copy link

vercel bot commented Nov 21, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Error Error Nov 21, 2025 0:27am
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 21, 2025 0:27am
rivet-inspector Ignored Ignored Preview Nov 21, 2025 0:27am
rivet-site Ignored Ignored Preview Nov 21, 2025 0:27am

Copy link
Member Author

NathanFlurry commented Nov 21, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Nov 21, 2025

Code Review: WebSocket Hibernation Keepalive Fix

Summary

This PR addresses a critical bug where the first keepalive tick was being skipped in the WebSocket hibernation mechanism, which could cause premature timeouts. The fix ensures that keepalive updates happen immediately and consistently throughout the hibernation period.

✅ Positive Changes

  1. Critical Bug Fix - Removing the initial ping_interval.tick().await call in keepalive_task.rs:33 is the correct fix. The previous code was discarding the first tick, which meant the first keepalive update wouldn't happen until the second interval (45 seconds by default), potentially causing hibernation timeout (90 seconds default).

  2. Removed Redundant Code - The removal of the premature hibernating_request::upsert call from lib.rs:578-588 is correct. This operation is now handled by the keepalive task, which is a better separation of concerns.

  3. Improved Observability - Added debug logging in both files:

    • keepalive_task.rs:45-50 - Logs each keepalive update with actor, gateway, and request IDs
    • shared_state.rs:516-520 - Logs elapsed time vs timeout threshold when checking hibernation state

⚠️ Issues & Concerns

Critical: Missing Space (Compilation Error)

Location: shared_state.rs:521

if hs_elapsed> hibernation_timeout {

There's a missing space between hs_elapsed and >. This should be:

if hs_elapsed > hibernation_timeout {

This will cause a compilation error and must be fixed before merge.

Code Style: Import Organization

The PR includes several import reordering changes (e.g., lib.rs:8-19, shared_state.rs:2-16) that don't follow the instructions in CLAUDE.md which state:

Do not run ./scripts/cargo/fix.sh. Do not format the code yourself.

While these changes are cosmetically fine (sorting imports alphabetically), they shouldn't have been made manually according to the project's guidelines. The pre-commit hooks should handle this.

📊 Analysis

Timing Analysis

  • Default hibernation timeout: 90,000ms (90 seconds)
  • Keepalive interval: 45,000ms (45 seconds) - half of threshold
  • Old behavior: First keepalive at ~90s (skipping first tick) → timeout risk
  • New behavior: First keepalive at ~45s → proper coverage

The fix ensures that the keepalive task will update the hibernating request entry at 45 seconds, well before the 90-second timeout, preventing false timeouts.

Race Condition Consideration

The removal of the early hibernating_request::upsert call in lib.rs means the first database entry is now created by the keepalive task rather than before spawning it. This appears safe because:

  1. The keepalive task starts immediately (no initial tick skip)
  2. The first tick fires immediately when the interval is created
  3. The entry will be created within milliseconds of hibernation starting

However, there's a potential edge case: if the keepalive task experiences significant scheduling delay, the entry might not exist when the garbage collection checks for hibernating requests. This seems unlikely in practice but worth noting.

🧪 Test Coverage

Concern: No test files found for pegboard-gateway package. Given this is a timing-sensitive bug fix for a critical hibernation mechanism, integration or unit tests would be valuable to prevent regression.

Recommendation: Consider adding tests that verify:

  • Keepalive updates happen at expected intervals
  • The first keepalive occurs before timeout
  • Hibernation doesn't timeout during normal operation

🔒 Security & Performance

  • Security: No concerns identified
  • Performance:
    • Added logging is debug-level, minimal impact
    • Jitter remains in place to prevent thundering herd (good)
    • No performance regressions expected

✏️ Minor Observations

  1. Logging Style: The new debug logs follow the project's structured logging conventions correctly (using ? and % prefixes, lowercase messages). ✅

  2. Error Handling: No changes to error handling, existing patterns maintained. ✅

📝 Action Items

Must Fix Before Merge:

  • Fix syntax error in shared_state.rs:521 - add space before >

Consider:

  • Revert import reordering changes to align with CLAUDE.md guidelines
  • Add test coverage for hibernation keepalive timing
  • Document the timing behavior in code comments (why interval is threshold/2)

Verdict

Requires Changes - The core fix is sound and addresses a real bug, but the compilation error must be fixed. Once corrected, this PR will properly prevent WebSocket hibernation timeouts.

@NathanFlurry NathanFlurry force-pushed the 11-20-fix_engine_ensure_first_keepalive_tick_is_not_skipped_to_prevent_timeout_on_ws_hibernation branch from 8d11332 to 79e5409 Compare November 21, 2025 00:27
@NathanFlurry NathanFlurry force-pushed the 11-20-fix_engine_ensure_first_keepalive_tick_is_not_skipped_to_prevent_timeout_on_ws_hibernation branch from 79e5409 to 7e275b0 Compare November 21, 2025 00:27
@claude
Copy link

claude bot commented Nov 21, 2025

Code Review

Summary

This PR fixes a WebSocket hibernation timeout issue by ensuring the first keepalive tick is not skipped and refactors the MessageId type from a binary format to a structured type. Overall, the changes are solid and address a real issue, but there are a few concerns to address.


Critical Issues

1. Missing Space in Comparison Operator (packages/pegboard-gateway/src/shared_state.rs:521)

if hs_elapsed> hibernation_timeout {

Missing space before > operator. Should be:

if hs_elapsed > hibernation_timeout {

2. Race Condition Risk (packages/pegboard-gateway/src/lib.rs:581-590)

The removal of the upfront hibernating_request::upsert call introduces a potential race condition:

Before: The hibernating request entry was created immediately, ensuring it existed even if the actor rewoke quickly.

After: If there are pending messages, the function returns early without creating the entry, and the keepalive task (which creates the entry) hasn't started yet.

Risk: If an actor hibernates, immediately rewakes due to pending messages, then hibernates again quickly, the second hibernation might not have a valid keepalive entry yet.

Recommendation: Consider keeping the upsert call before the pending message check, or add a comment explaining why this is safe.


Design Questions

3. MessageId Schema Change Implications

The change from type MessageId data[10] to a structured MessageId type is a breaking protocol change.

Questions:

  • Is this change backwards compatible with existing deployed actors/gateways?
  • Should this be documented in a migration guide?
  • Are there any version checks to handle mixed protocol versions?

4. Keepalive Timing Behavior

The PR removes the initial ping_interval.tick().await that was skipping the first immediate tick.

Analysis:

  • Before: First keepalive happened at ~15s (half of 30s timeout)
  • After: First keepalive happens immediately, then at ~15s intervals

Question: Is the immediate first tick intentional? The comment says "first keepalive tick is not skipped" but doesn't explain why we want it to fire immediately. Consider adding a comment explaining the reasoning (e.g., to establish the hibernating request entry ASAP).


Code Quality

5. Debug Logging Added (packages/pegboard-gateway/src/keepalive_task.rs:45-50)

✅ Good addition for debugging hibernation issues.

6. Removed Unused File (engine/sdks/typescript/runner/src/tunnel-id.ts)

✅ Clean removal of 104 lines of now-unused code.

7. Consistent ID Formatting

✅ Good refactoring to use idToStr() utility function consistently throughout the codebase instead of custom base64 encoding.

8. Import Organization

Minor inconsistency: Some files have imports reordered (e.g., Mutex, watchwatch, Mutex). While not wrong, these seem like incidental formatting changes that add noise to the diff.


Performance Considerations

9. String Allocation in Hot Path

In shared_state.rs:516-519, debug logging now formats elapsed time as floats on every GC check:

tracing::debug!(
    hs_elapsed=%hs_elapsed.as_secs_f64(),
    timeout=%hibernation_timeout.as_secs_f64(),
    "checking hibernating state elapsed time"
);

Impact: This runs in the GC loop which runs every 15 seconds. The impact is likely negligible, but consider using the ?hs_elapsed format instead if these logs become noisy.


Security Considerations

No security concerns identified. The changes are primarily internal refactoring and timing adjustments.


Test Coverage

Missing: No test changes or new tests added.

Recommendations:

  1. Add a test that verifies the first keepalive tick fires immediately after hibernation
  2. Add a test that verifies the hibernation timeout is calculated correctly with the new timing
  3. Add an integration test for the pending messages + quick rewake scenario to verify no race condition

Style & Conventions

Following CLAUDE.md conventions:

  • ✅ Uses structured logging with tracing::debug!
  • ✅ Log messages are lowercase
  • ✅ Uses anyhow::Result correctly
  • ⚠️ Spacing issue on line 521 (mentioned above)

Recommendations

Must Fix:

  1. Fix the spacing issue in shared_state.rs:521
  2. Verify/document the race condition safety in lib.rs:581-590

Should Consider:
3. Add tests for the timing behavior changes
4. Document the protocol version compatibility
5. Add a comment explaining why the first tick should fire immediately

Nice to Have:
6. Consider using ?hs_elapsed instead of %hs_elapsed.as_secs_f64() to reduce allocations in debug logs


Overall: Approve with minor changes required. The core fix is sound, but please address the spacing issue and clarify the race condition safety before merging.

🤖 Generated with Claude Code

@graphite-app
Copy link
Contributor

graphite-app bot commented Nov 21, 2025

Merge activity

  • Nov 21, 2:40 AM UTC: NathanFlurry added this pull request to the Graphite merge queue.
  • Nov 21, 2:42 AM UTC: CI is running for this pull request on a draft pull request (#3515) due to your merge queue CI optimization settings.
  • Nov 21, 2:43 AM UTC: Merged by the Graphite merge queue via draft PR: #3515.

graphite-app bot pushed a commit that referenced this pull request Nov 21, 2025
@graphite-app graphite-app bot closed this Nov 21, 2025
@graphite-app graphite-app bot deleted the 11-20-fix_engine_ensure_first_keepalive_tick_is_not_skipped_to_prevent_timeout_on_ws_hibernation branch November 21, 2025 02:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants