Skip to content

Conversation

@NathanFlurry
Copy link
Member

No description provided.

@vercel
Copy link

vercel bot commented Nov 20, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Error Error Nov 20, 2025 8:52pm
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 20, 2025 8:52pm
rivet-inspector Ignored Ignored Preview Nov 20, 2025 8:52pm
rivet-site Ignored Ignored Preview Nov 20, 2025 8:52pm

Copy link
Member Author

NathanFlurry commented Nov 20, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Nov 20, 2025

Code Review: Hibernation Implementation Fix

This PR addresses critical issues in the hibernation implementation for WebSocket connections. Overall, the changes improve the architecture and fix race conditions, though there are some areas that need attention.


🟢 Strengths

1. Critical Bug Fix in Rust Gateway (pegboard-gateway/src/lib.rs:489)

-if lifecycle_res
+if !&lifecycle_res

Excellent catch! This was a logic inversion bug - the code was sending close frames when it should hibernate, and vice versa. This fix is critical for proper hibernation behavior.

2. Race Condition Prevention (runner/src/mod.ts:915-962)

The addition of actorStartPromise and the careful sequencing in #handleCommandStartActor is well-designed:

  • Setting up the actor synchronously before any async operations
  • Populating requestToActor map before onActorStart runs
  • Using promises to coordinate actor initialization

The comment at line 916 is particularly valuable:

// IMPORTANT: Make sure no async code runs before inserting #actors and
// calling addRequestToActor in order to prevent race conditions with
// subsequence commands

3. Improved API Design (runner/src/mod.ts:109-146)

The refactored hibernatableWebSocket configuration is much clearer:

  • Removes the confusing loadAll method from the config
  • Makes the caller responsible for loading and passing metadata to restoreHibernatingRequests
  • Better separation of concerns with comprehensive documentation

4. Better State Management (runner/src/actor.ts:65-99)

Replacing setPendingRequest with separate createPendingRequest and createPendingRequestWithStreamController methods:

  • More explicit intent
  • Better type safety
  • Warns on duplicate requests instead of silently replacing

🟡 Issues & Concerns

1. Resource Leak on Actor Start Failure (runner/src/mod.ts:959)

// TODO: Where is this GC'd if something fails?
for (const hr of startCommand.hibernatingRequests) {
    this.#tunnel.addRequestToActor(hr.gatewayId, hr.requestId, actorId);
}

Critical: If onActorStart throws an exception (line 984), the requestToActor mappings added at line 961 are never cleaned up. The code calls forceStopActor (line 998), but this might not remove these mappings properly.

Recommendation:

try {
    await this.#config.onActorStart(actorId, generation, actorConfig);
    instance.actorStartPromise.resolve();
} catch (err) {
    this.log?.error({
        msg: "error starting runner actor",
        actorId,
        err,
    });
    
    instance.actorStartPromise.reject(err);
    
    // Clean up requestToActor mappings
    for (const hr of startCommand.hibernatingRequests) {
        this.#tunnel.removeRequestToActor(hr.gatewayId, hr.requestId);
    }
    
    await this.forceStopActor(actorId, generation);
}

2. Duplicate WebSocket Warning Without Action (runner/src/actor.ts:176-180)

const exists = this.getWebSocket(gatewayId, requestId) !== undefined;
if (exists) {
    logger()?.warn({ msg: "attempting to set websocket twice" });
    return; // Just returns without setting!
}

Issue: This silently fails when trying to add a duplicate WebSocket. Should this throw an error instead? Or delete the existing one first? The behavior differs from createPendingRequest which explicitly deletes and warns.

Recommendation: Be consistent with createPendingRequest pattern or document why this should fail silently.

3. Missing Error Handling in Async Message Handler (runner/src/mod.ts:852-869)

#handleCommands(commands: protocol.ToClientCommands) {
    for (const commandWrapper of commands) {
        if (commandWrapper.inner.tag === "CommandStartActor") {
            // Spawn background promise
            this.#handleCommandStartActor(commandWrapper);
        } else if (commandWrapper.inner.tag === "CommandStopActor") {
            // Spawn background promise
            this.#handleCommandStopActor(commandWrapper);
        }
    }
}

Issue: These background promises are fire-and-forget. If they throw unhandled exceptions, they could cause silent failures.

Recommendation:

this.#handleCommandStartActor(commandWrapper).catch(err => {
    this.log?.error({ msg: "unhandled error in CommandStartActor", err });
});

4. Potential State Inconsistency (runner/src/tunnel.ts:116-119)

if (actor.hibernationRestored) {
    throw new Error(
        `Actor ${actorId} already restored hibernating requests`,
    );
}

Good: This prevents double-restoration, but...

Question: What happens if restoreHibernatingRequests partially completes and then fails? The flag won't be set (line 282), so it could be retried, but some connections might already be partially restored. Consider adding more granular tracking or cleanup on failure.

5. Removed Message Index Reset (pegboard-gateway/src/shared_state.rs:127)

-entry.message_index = 0;

Question: Why was this line removed? Is the message index intentionally preserved across hibernation cycles now? This should be documented or explained in the commit message if it's an intentional behavior change.


🔵 Minor Issues

1. Logging Improvements Are Good

The added structured logging throughout (especially in actor.ts and tunnel.ts) will make debugging much easier. Good practice!

2. Removed Debug Log (pegboard-gateway/src/shared_state.rs:395)

-request_id=?tunnel_id::request_id_to_string(&request_id),

Was this field removed to reduce log verbosity? Consider if this information is still useful for debugging.

3. Test Changes (driver-test-suite/tests/actor-conn-hibernation.ts)

The test file only has minor changes (toBe(1)toBe(1) on lines 57, 84). Are there additional tests needed to cover the new behavior?


🔒 Security Considerations

No obvious security issues detected. The changes maintain proper separation between actors and don't introduce new attack vectors.


📊 Performance Considerations

  1. Synchronous Setup Before Async: Good pattern (line 915-962) prevents race conditions without performance overhead
  2. Background Promise Settling: Using Promise.allSettled (tunnel.ts:279) is correct but could delay hibernation if background ops hang. Consider adding timeouts.

🧪 Test Coverage

Existing tests in actor-conn-hibernation.ts cover:

  • ✅ Basic hibernation
  • ✅ State persistence through hibernation
  • ✅ Connection closure during hibernation

Missing coverage:

  • ❌ Actor start failure with hibernating requests
  • ❌ Partial restoration failure scenarios
  • ❌ Duplicate restoration attempts

📝 Documentation

The JSDoc comments added to restoreHibernatingRequests (runner/src/mod.ts:1582-1614) are excellent and clearly explain the API contract.


Recommendations Summary

Must Fix (Before Merge):

  1. ⚠️ Resource leak on actor start failure (mod.ts:959)
  2. ⚠️ Unhandled promise rejections in background commands (mod.ts:863-867)

Should Fix:

  1. Inconsistent duplicate WebSocket handling (actor.ts:176-180)
  2. Consider failure recovery in partial hibernation restoration

Nice to Have:

  1. Additional test coverage for error cases
  2. Document the message_index behavior change

Overall, this is a solid improvement that fixes critical bugs and improves the architecture. The main concerns are around error handling and resource cleanup in failure scenarios.

@claude
Copy link

claude bot commented Nov 20, 2025

Code Review: Hibernation Implementation Fix

This PR fixes critical issues in the hibernation implementation for RivetKit.

Summary

Addresses race conditions and synchronization issues in the WebSocket hibernation flow, ensuring actors are fully initialized before processing hibernating requests.

Positive Observations

1. Critical Bug Fix in Rust (pegboard-gateway/src/lib.rs:489)
The logic inversion fix changing if lifecycle_res to if !&lifecycle_res is excellent. This would have caused close frames to be sent when hibernating and not sent when not hibernating.

2. Proper Race Condition Handling
The introduction of actorStartPromise and getAndWaitForActor prevents race conditions between actor initialization and incoming tunnel messages.

3. Improved API Design
The refactored restoreHibernatingRequests approach provides clearer separation of concerns with better documentation.

4. Enhanced Logging
Good additions for debugging: message index tracking, gateway IDs in logs, pending request counts.

Issues & Concerns

1. CRITICAL: Potential Resource Leak (mod.ts:959)
If onActorStart throws, the requestToActor mappings are never cleaned up. The TODO comment acknowledges this. This could cause memory leaks and incorrect request routing.

Recommendation: Add cleanup in the catch block to remove all hibernatingRequests from the tunnel's requestToActor map.

2. Warning Suppression (actor.ts:177-180)
setWebSocket silently ignores duplicate WebSocket registration with just a warning. Should this throw an error to fail fast instead?

3. Typo (tunnel.ts:762)
"catught" should be "caught"

4. Code Duplication (actor.ts:65-137)
createPendingRequest and createPendingRequestWithStreamController have ~80% overlapping code. Consider merging with an optional parameter.

5. Missing Validation
Why is replacing a pending request valid behavior? This could indicate request ID collision or race conditions.

Performance Considerations

Linear Search in Arrays
Multiple findIndex operations on pendingRequests, webSockets, and requestToActor arrays could become O(n) bottlenecks. Consider using Maps with composite keys for O(1) lookups.

Test Coverage

Missing tests for:

  • Actor start failure cleanup (resource leak scenario)
  • Duplicate WebSocket registration
  • Race conditions between actor start and incoming messages
  • The Rust logic inversion fix

Recommendations

Must Fix Before Merge:

  1. Address the resource leak in error path
  2. Fix typo

Should Fix:

  1. Add test coverage for critical bug fixes
  2. Refactor duplicate code
  3. Consider throwing errors for duplicate registrations in dev mode

Nice to Have:

  1. Optimize array-based lookups to Map-based
  2. Add JSDoc for complex flows
  3. Add metrics for hibernation success/failure

Overall Assessment

Rating: 7.5/10

This PR fixes critical bugs and improves architecture. Main concerns are the resource leak (must fix) and missing test coverage. With those addressed, this would be production-ready.

Great catch on the Rust logic bug!

@claude
Copy link

claude bot commented Nov 20, 2025

Code Review: Hibernation Implementation Fix

Thanks for working on this hibernation fix! I've reviewed the changes across the Rust gateway code and TypeScript SDK. Here's my feedback:


🐛 Critical Bug Fix (pegboard-gateway/src/lib.rs:489)

Excellent catch! The negation operator fix is critical:

-if lifecycle_res
+if !&lifecycle_res

This was closing WebSocket connections when they should be hibernating. This is the core fix and correctly addresses the hibernation behavior.


🏗️ Architecture Improvements

1. Actor Start Synchronization (runner/src/mod.ts)

The new actorStartPromise pattern is solid:

  • Prevents race conditions between actor creation and message handling
  • Ensures hibernating requests wait for full actor initialization before processing
  • Good use of promiseWithResolvers pattern

However, there's a potential resource leak (line 959 TODO):

// TODO: Where is this GC'd if something fails?
for (const hr of startCommand.hibernatingRequests) {
    this.#tunnel.addRequestToActor(hr.gatewayId, hr.requestId, actorId);
}

Recommendation: Add cleanup in the catch block at line 987-999 to remove the request-to-actor mappings if initialization fails:

} catch (err) {
    // Clean up request mappings if actor start failed
    for (const hr of startCommand.hibernatingRequests) {
        this.#tunnel.removeRequestToActor(hr.gatewayId, hr.requestId);
    }
    // ... rest of error handling
}

2. Pending Request Management (runner/src/actor.ts)

Good refactoring with createPendingRequest and createPendingRequestWithStreamController:

  • Eliminates code duplication
  • Adds helpful debug logging
  • Includes duplicate detection with warnings

Minor concern (line 82): The duplicate request warning at line 66 might mask bugs rather than expose them. Consider if this should be an error instead of a warning, or at least add metrics to track how often this occurs in production.

3. Hibernation State Contract

The hibernationRestored flag (actor.ts:35) is a good defensive pattern to catch implementation errors. However, the check only validates it hasn't been called twice - not that it was called at all. Consider adding a check before actor shutdown to ensure restoreHibernatingRequests was called if hibernating requests existed.


🔍 Code Quality

Positive:

  • ✅ Comprehensive logging throughout (though verify log levels in production)
  • ✅ Proper error handling with try-catch blocks
  • ✅ Good use of structured logging with context
  • ✅ Type safety maintained throughout

Concerns:

1. Excessive Logging (state-manager.ts)
The state manager file is untouched in this PR but shows info level logs for routine operations:

this.#actor.rLog.info({ msg: "savePersistInner called", ... });
this.#actor.rLog.info({ msg: "prepared entries for kvBatchPut", ... });
this.#actor.rLog.info({ msg: "calling kvBatchPut", ... });
this.#actor.rLog.info({ msg: "kvBatchPut completed successfully" });

Recommendation: Downgrade these to debug level to avoid log spam in production.

2. Import Ordering (pegboard-gateway/src/lib.rs:9-20)
The reformatting changes break from the standard import grouping pattern. Imports should be:

  1. External crates (rivet_*, pegboard)
  2. Internal modules (use crate::)
    Current ordering mixes these. Consider reverting to alphabetical or logical grouping.

3. Removed Debug Context

-  request_id=?tunnel_id::request_id_to_string(&request_id),

This removes helpful context from the log at shared_state.rs:398. Was this intentional? The ack_index alone may not be sufficient for debugging.

4. Deleted Message Index Reset (shared_state.rs:127)

-entry.message_index = 0;

The removal of this line needs explanation. Is message index now preserved across reconnections? If so, add a comment explaining why.


⚡ Performance Considerations

  • The getAndWaitForActor method (mod.ts:348) could potentially create thundering herd if many requests arrive simultaneously during actor startup. Consider if there's a way to batch these or add metrics.
  • Multiple awaits in the tunnel message handling path (tunnel.ts:638, 668, 779, 808) - ensure these don't introduce latency issues under load.

🧪 Testing

Missing Test Coverage:

  • No test changes included for this critical bug fix
  • Consider adding tests for:
    • WebSocket hibernation/restoration cycle
    • Actor start race conditions
    • Duplicate pending request handling
    • Error cleanup paths

Existing Tests:
The only test file change is a parameter order fix in actor-conn-hibernation.ts - this suggests the test may not have caught the original bug. Review test coverage.


🔒 Security

No immediate security concerns identified. The hibernation state management appears sound.


📝 Documentation

The restoreHibernatingRequests JSDoc (mod.ts:1582) is excellent and clearly explains:

  • When to call it
  • What it does
  • Important constraints

Great documentation!


🎯 Summary

Critical Issues:

  1. FIXED: Core hibernation bug in gateway close logic
  2. ⚠️ TODO: Resource leak cleanup in actor start error path (mod.ts:959)

Recommendations:

  1. Add cleanup for request mappings in error path
  2. Downgrade routine logs to debug level
  3. Add test coverage for hibernation scenarios
  4. Restore removed debug context or document why removed
  5. Document message_index behavior change

Overall Assessment: This is a solid fix for a critical bug. The architectural improvements around actor start synchronization are well-designed. The main concern is the unhandled TODO around resource cleanup in error cases. I'd recommend addressing that before merge.

Would you like me to help with any of these issues or have questions about the feedback?

@claude
Copy link

claude bot commented Nov 20, 2025

PR Review: chore(rivetkit): fix hibernation implementation

This PR refactors the hibernation implementation for WebSocket connections. Overall, the changes improve clarity, but there are several critical areas that need attention.

Critical Issues

1. Inverted Boolean Logic in Rust (lib.rs:489)

The logic 'if !&lifecycle_res' appears incorrect. The negation '!' is being applied to a reference, which will always be true for any non-null reference. This will cause the gateway to send close frames at the wrong time.

2. Race Condition in Actor Start (mod.ts:933-945)

The PR adds hibernatingRequests to the requestToActor map BEFORE onActorStart completes. If incoming tunnel messages arrive during this time and the actor fails during onActorStart, there's a potential race condition.

3. Actor Removed After Stop (mod.ts:288-310)

closeActiveRequests is called BEFORE removing the actor. If another CommandStartActor arrives during this window, you could have duplicate actors briefly.

Potential Bugs

4. Missing Cleanup on Actor Start Failure

When onActorStart fails, requestToActor mappings were already added. Verify these are properly cleaned up.

5. Empty resolve/reject Functions (actor.ts:85-86)

The resolve and reject callbacks are empty no-ops. Document why this is intentional.

6. Duplicate WebSocket Detection (actor.ts:176-180)

setWebSocket returns early without replacing duplicates, while createPendingRequest replaces them. This inconsistency could cause bugs.

Code Quality

7. Typo in Log Message (tunnel.ts:762)

'catught' should be 'caught'

8. Inconsistent Logging

Standardize on including both gatewayId and requestId in WebSocket logs.

9. Removed message_index Reset (shared_state.rs:127)

Add comment explaining why removing this reset is safe.

Performance

10. Linear Search

Consider using Map instead of array for pendingRequests and webSockets if many concurrent connections become a bottleneck.

Test Coverage

11. Missing Tests

Add tests for: the critical Rust logic fix, race conditions during actor start, duplicate handling, and failure scenarios.

Documentation

12. Excellent Improvements

The JSDoc additions for hibernatableWebSocket, onActorStart, and restoreHibernatingRequests significantly improve API clarity.

Summary

Approve with Changes Required

The architectural changes are sound improvements. However, critical issues must be addressed:

  1. CRITICAL: Fix inverted boolean logic in lib.rs:489
  2. IMPORTANT: Address race condition in actor start
  3. IMPORTANT: Verify cleanup on actor start failure
  4. Add test coverage
  5. Fix typo and logging inconsistencies

@claude
Copy link

claude bot commented Nov 20, 2025

PR Review: Hibernation Implementation Fix

Critical Issues

1. Logic Inversion Bug (pegboard-gateway/src/lib.rs:489)

The negation operator is applied to a reference rather than the boolean:

if !&lifecycle_res.as_ref().map_or_else(is_ws_hibernate, |_| false)

Should be:

if !lifecycle_res.as_ref().map_or_else(is_ws_hibernate, |_| false)

Major Concerns

2. Duplicate WebSocket Prevention (runner/src/actor.ts:176-180)

Silent failure when duplicate detected - caller may expect WebSocket was registered. Either replace existing, throw error, or document why silent failure is OK.

3. Actor Removal Timing (runner/src/mod.ts:308-311)

Moving removeActor after closeActiveRequests changes timing. Need detailed comments on race condition prevention.


Code Quality

  • message_index field reset removed (shared_state.rs:127) - verify if field should be removed entirely
  • Missing gatewayId in error logs (runner/src/tunnel.ts:142-158)
  • Logging level changed error to warn in getActor

Performance

Promise.allSettled silently swallows errors. Consider logging which WebSockets failed to restore.


Security

No input validation on hibernation metadata. Add validation to prevent crashes from corrupted data.


Testing Gaps

Need tests for edge cases like already-restored state, mismatched metadata, concurrent operations.


Summary

Must Fix:

  1. Rust logic inversion bug
  2. Clarify duplicate WebSocket behavior

Should Address:

  • Error handling for partial restoration
  • Input validation
  • Test coverage

Positive: Good architectural improvements with actorStartPromise pattern and better separation of concerns.

Overall: Solid refactoring. Main blocker is Rust bug. Otherwise ready after clarifications.

@MasterPtato MasterPtato force-pushed the 11-19-chore_rivetkit_fix_hibernation_implementation branch from 193f67a to 8145fc3 Compare November 20, 2025 22:38
@MasterPtato MasterPtato force-pushed the 11-19-fix_tunnel_implement_ping_pong branch from d287067 to 40a9f30 Compare November 20, 2025 22:38
@claude
Copy link

claude bot commented Nov 20, 2025

Pull Request Review: Fix Hibernation Implementation

Summary

This PR fixes critical issues in the hibernation implementation for WebSocket connections. The changes primarily affect the TypeScript runner SDK and the Rust pegboard gateway, with a focus on proper lifecycle management and race condition prevention during actor startup and WebSocket restoration.


Code Quality & Best Practices

Strengths

  1. Race Condition Fixes: The PR introduces proper synchronization with actorStartPromise to prevent race conditions between actor startup and incoming requests. This is a critical improvement.

  2. Better Separation of Concerns: Moving hibernation restoration responsibility to the user (via restoreHibernatingRequests) with clear documentation is a good architectural decision.

  3. Improved Logging: Added structured debug/info logs throughout help with observability (e.g., tracking pending request counts, message indices).

  4. State Tracking: The hibernationRestored flag prevents double-restoration, which is good defensive programming.


Potential Bugs & Issues

🔴 Critical Issues

  1. Negation Bug Fix in pegboard-gateway/src/lib.rs:489

    // Before: if lifecycle_res
    // After:  if !&lifecycle_res

    This fixes inverted logic for hibernation detection. However, the & reference operator seems unnecessary here:

    if !lifecycle_res  // This should be sufficient

    The &lifecycle_res creates a temporary reference that is then negated, which works but is unconventional.

  2. Removed entry.message_index = 0 in shared_state.rs:127
    This line was removed during hibernation restoration. Need to verify this doesn't break message ordering - the removal suggests message_index is now managed elsewhere, but this should be explicitly validated.

⚠️ Medium Priority Issues

  1. Actor Replacement Warning in mod.ts:943

    if (existingActor) {
        this.log?.warn({
            msg: "replacing existing actor in actors map",
            // ...
        });
    }

    While logging is good, silently replacing actors could indicate a serious issue. Should this throw an error instead, or at minimum, clean up the old actor's resources first?

  2. Duplicate Prevention vs. Error Handling in actor.ts:176-180

    if (exists) {
        logger()?.warn({ msg: "attempting to set websocket twice" });
        return;  // Silent return
    }

    This silently returns on duplicate WebSocket registration. Should this throw an error to surface potential bugs in the calling code?

  3. Error Handling in mod.ts:988

    instance.actorStartPromise.reject(err);
    
    // TODO: Mark as crashed
    await this.forceStopActor(actorId, generation);

    The TODO comment suggests incomplete error handling. What happens to actors that fail to start? Are they properly cleaned up?


Performance Considerations

🟡 Minor Performance Notes

  1. Sequential Processing in tunnel.ts:134-276
    The hibernation restoration processes WebSockets sequentially in a for-loop. For actors with many hibernating connections, this could be slow. Consider:

    // Current: sequential processing
    for (const { gatewayId, requestId } of actor.hibernatingRequests) { ... }
    
    // Potential: parallel processing with Promise.allSettled
    await Promise.allSettled(
        actor.hibernatingRequests.map(async ({ gatewayId, requestId }) => { ... })
    );

    However, the current approach may be intentional for ordered processing.

  2. Linear Search in Pending Requests
    actor.ts:58-62 uses linear search through pendingRequests array. For actors with many concurrent requests, this could become a bottleneck. Consider using a Map<string, PendingRequest> keyed by gatewayId + requestId.


Security Concerns

🔒 Security Review

  1. No Security Issues Identified: The changes don't introduce obvious security vulnerabilities like XSS, injection attacks, or authentication bypasses.

  2. Resource Cleanup: The PR properly cleans up WebSocket connections that fail to restore, preventing resource leaks.

  3. Error Messages: Error messages don't leak sensitive information.


Test Coverage

📋 Testing Notes

  1. Limited Test Changes: Only minor changes to test files:

    • driver-test-suite/tests/actor-conn-hibernation.ts (2 line change)
    • driver-test-suite/mod.ts (2 line change)
  2. Missing Test Cases: Consider adding tests for:

    • Race conditions during actor startup with concurrent requests
    • Duplicate actor creation scenarios
    • Failed hibernation restoration
    • Actor start failure cleanup
    • Message index tracking across hibernation/restoration cycles

Specific Code Review Comments

engine/sdks/typescript/runner/src/mod.ts

Line 913-932: Actor creation and request mapping race condition fix

// IMPORTANT: Make sure no async code runs before inserting #actors and
// calling addRequestToActor in order to prevent race conditions with
// subsequence commands

Good: The comment clearly explains the critical ordering requirement. The code correctly populates requestToActor before any async operations.

Line 966-982: Actor startup error handling

} catch (err) {
    this.log?.error({
        msg: "error starting runner actor",
        actorId,
        err,
    });

    instance.actorStartPromise.reject(err);

    // TODO: Mark as crashed
    await this.forceStopActor(actorId, generation);
}

⚠️ Question: What happens to actors in the #actors map when startup fails? They're added at line 954 but may not be properly cleaned if forceStopActor doesn't handle this case.

engine/sdks/typescript/runner/src/actor.ts

Line 65-99: createPendingRequest method

if (exists) {
    logger()?.warn({
        msg: "attempting to set pending request twice, replacing existing",
        // ...
    });
    this.deletePendingRequest(gatewayId, requestId);
}

Good: The method handles duplicates gracefully by logging and cleaning up.

⚠️ Note: The duplicate object creation between createPendingRequest and createPendingRequestWithStreamController suggests these could be refactored to share common logic.

engine/packages/pegboard-gateway/src/lib.rs

Line 489: Fixed negation logic

if !&lifecycle_res

⚠️ Style: Consider simplifying to if !lifecycle_res (remove the & reference operator).


Documentation

📚 Documentation Quality

Excellent: The new restoreHibernatingRequests method in mod.ts:1587-1614 has comprehensive JSDoc comments explaining:

  • What the method does
  • When to call it
  • What it handles
  • Important ordering requirements

Recommendations

  1. Fix the reference operator in pegboard-gateway/src/lib.rs:489 - change !&lifecycle_res to !lifecycle_res

  2. Improve actor replacement handling - Either throw an error when replacing an existing actor, or ensure proper cleanup of the old actor's resources

  3. Complete the TODO at mod.ts:988 - Implement proper "crashed" state marking for failed actors

  4. Add test coverage for the race condition fixes and error scenarios

  5. Consider refactoring duplicate code in createPendingRequest methods

  6. Document the message_index removal in shared_state.rs with a comment explaining why it's no longer needed

  7. Verify message ordering still works correctly after removing message_index = 0 reset


Conclusion

This PR makes important fixes to the hibernation implementation, particularly around race conditions and lifecycle management. The code quality is generally good with excellent documentation. The main concerns are:

  1. A few minor code style issues (unnecessary & operator)
  2. Incomplete error handling (TODO comments)
  3. Potential resource cleanup issues when actors are replaced
  4. Limited test coverage for the new scenarios

Overall, the changes improve the system's reliability, but I recommend addressing the critical issues before merging.


Review generated by Claude Code

@claude
Copy link

claude bot commented Nov 20, 2025

Code Review: Hibernation Implementation Fix

Overview

This PR fixes the hibernation implementation for RivetKit by addressing race conditions, improving WebSocket restoration flow, and enhancing error handling. The changes span both Rust (pegboard-gateway) and TypeScript (runner SDK) components.


Critical Bug Fix ✅

Location: engine/packages/pegboard-gateway/src/lib.rs:489

- if lifecycle_res
+ if \!&lifecycle_res

Issue: This was a critical logic bug - the condition was inverted. The code was sending close frames when hibernating (should not) and not sending them when not hibernating (should).

Impact: This fix ensures WebSocket close frames are only sent to the runner when NOT hibernating, which is the correct behavior.


Architecture Changes 🏗️

1. Actor Lifecycle Management (mod.ts:915-990)

Improvements:

  • Added actorStartPromise to synchronize actor initialization
  • Pre-populate requestToActor map BEFORE async operations to prevent race conditions
  • Better handling of existing actors with warning logs

Concern:

// TODO: Where is this GC'd if something fails?
for (const hr of startCommand.hibernatingRequests) {
    this.#tunnel.addRequestToActor(hr.gatewayId, hr.requestId, actorId);
}

This TODO raises a valid concern - if actor initialization fails after populating the map, these entries may leak. Consider adding cleanup in the error handler.

Recommendation: Add cleanup in the catch block:

} catch (err) {
    // ... existing error handling ...
    
    // Cleanup requestToActor mappings on failure
    for (const hr of startCommand.hibernatingRequests) {
        this.#tunnel.removeRequestToActor(hr.gatewayId, hr.requestId);
    }
    
    instance.actorStartPromise.reject(err);
    await this.forceStopActor(actorId, generation);
}

2. Hibernation Restoration Flow (tunnel.ts:105-280)

Good:

  • Clear separation of concerns - caller loads metadata, tunnel restores connections
  • Added hibernationRestored flag to prevent double restoration
  • Proper error handling for missing actors or duplicate restoration attempts

Issue - API Design:
The restoreHibernatingRequests method has shifted responsibility:

  • Before: Tunnel loaded metadata internally
  • After: Caller must load metadata and pass it in

While this improves separation of concerns, the documentation in mod.ts:73-108 needs updating to reflect that onActorStart is now responsible for both loading AND calling restoreHibernatingRequests.

Current documentation says:

loadAll will be called from Tunnel::restoreHibernatingRequests

But now:

The user of this library is responsible for loading all persisted hibernatable WebSocket metadata

This is better design, but ensure all consumers are updated.


3. Race Condition Prevention (tunnel.ts:444-454)

Excellent addition:

async getAndWaitForRequestActor(
    gatewayId: GatewayId,
    requestId: RequestId,
): Promise<RunnerActor | undefined> {
    const actor = this.getRequestActor(gatewayId, requestId);
    if (\!actor) return;
    await actor.actorStartPromise.promise;
    return actor;
}

This ensures tunnel messages wait for actor initialization before processing, preventing race conditions when messages arrive during startup.


Code Quality Issues 🔍

1. Pending Request Duplication (actor.ts:65-133)

Issue:

const exists = this.getPendingRequest(gatewayId, requestId) \!== undefined;
if (exists) {
    logger()?.warn({
        msg: "attempting to set pending request twice, replacing existing",
        // ...
    });
    this.deletePendingRequest(gatewayId, requestId);
}

Concern: Why would we attempt to set a pending request twice? This warning suggests a potential logic error in the caller. Consider:

  1. Is this expected behavior or a bug being masked?
  2. Should this throw an error instead of warning?
  3. Are there race conditions causing duplicate calls?

Recommendation: Investigate why duplicates occur. If it's truly expected, document why. If it's a bug, fix the caller instead of working around it.


2. WebSocket Duplication Handling (actor.ts:173-181)

const exists = this.getWebSocket(gatewayId, requestId) \!== undefined;
if (exists) {
    logger()?.warn({ msg: "attempting to set websocket twice" });
    return;  // Silent failure
}

Issue: Silent failures can mask bugs. This should either:

  1. Throw an error if duplicates indicate a bug
  2. Document why duplicates are expected and safe to ignore
  3. Return a success/failure indicator

3. Removed Field (shared_state.rs:127)

- entry.message_index = 0;

Question: Where is message_index used? Ensure removing this reset doesn't cause index tracking issues across reconnections.


Logging Improvements ✅

Good additions:

  • tunnel.ts:510: Added messageIndex to send logs
  • shared_state.rs:196: Added detailed pending message tracking
  • Consistent use of structured logging with context

Minor issue:

tracing::debug\!(
-   request_id=?tunnel_id::request_id_to_string(&request_id),
    ack_index,
    removed_count = len_before - len_after,

Removing request_id from logs may make debugging harder. Consider keeping it.


Performance Considerations ⚡

  1. Background Operations (tunnel.ts:129): Using Promise.allSettled(backgroundOperations) is good for handling restoration concurrently, but be mindful of:

    • Potential memory pressure with many hibernating connections
    • Order dependencies between operations
  2. Linear Search (actor.ts:58-62): getPendingRequest uses linear search with find(). Consider using a Map keyed by gatewayId+requestId if pendingRequests arrays grow large.


Security Concerns 🔒

No major security issues identified. Good practices:

  • Proper error handling without exposing internals
  • Validation of actor existence before operations
  • Close codes used appropriately (1000 for normal, error codes for abnormal)

Test Coverage 📊

Observation: No test file changes in this PR.

Recommendations:

  1. Add tests for the inverted condition fix (lib.rs:489)
  2. Test race condition scenarios with concurrent actor start + message arrival
  3. Test hibernation restoration with missing/extra metadata
  4. Test duplicate request/websocket detection paths

Example test scenarios:

  • Actor receives tunnel message during initialization (before actorStartPromise resolves)
  • Restoration with metadata that doesn't match gateway's hibernating requests
  • Actor initialization failure after requestToActor population

Documentation 📚

Good:

  • Comprehensive JSDoc updates in mod.ts
  • Clear comments about race condition prevention

Needs improvement:

  • Update hibernatableWebSocket.loadAll documentation (now removed from interface)
  • Document why duplicate pending requests might occur
  • Add inline comments explaining the actor lifecycle state machine

Summary

Strengths:
✅ Critical bug fix (inverted condition)
✅ Improved race condition handling with promise synchronization
✅ Better separation of concerns in hibernation flow
✅ Enhanced logging for debugging

Areas for improvement:
⚠️ Memory leak potential in requestToActor map on actor init failure
⚠️ Investigate root cause of duplicate pending request warnings
⚠️ Add test coverage for critical paths
⚠️ Consider error handling improvements (throw vs warn+return)

Overall verdict: This is a significant improvement to the hibernation implementation. The critical bug fix alone justifies merging, but please address the memory leak concern and consider adding tests for the new synchronization logic.


Priority fixes before merge:

  1. Add cleanup for requestToActor mappings in actor init error handler
  2. Add at least one test for the inverted condition bug fix

Follow-up tasks:

  1. Investigate duplicate pending request warnings
  2. Add comprehensive test coverage for hibernation restoration
  3. Profile performance with many concurrent hibernating connections

@NathanFlurry NathanFlurry force-pushed the 11-19-fix_tunnel_implement_ping_pong branch from 40a9f30 to d287067 Compare November 20, 2025 23:32
@NathanFlurry NathanFlurry force-pushed the 11-19-chore_rivetkit_fix_hibernation_implementation branch from 8145fc3 to 193f67a Compare November 20, 2025 23:32
@claude
Copy link

claude bot commented Nov 20, 2025

Code Review - Hibernation Implementation Fix

This PR addresses critical issues in the WebSocket hibernation implementation. I've conducted a thorough review focusing on code quality, potential bugs, performance, and security.


🎯 Critical Issues

1. CRITICAL: Logic Error in Rust (pegboard-gateway/src/lib.rs:489)

-if lifecycle_res
+if \!&lifecycle_res

Status: ✅ FIXED

This is the core fix! The original logic was inverted - it would send close frames when it should hibernate. However, the syntax \!&lifecycle_res is concerning. You're negating a reference, not the value.

Recommendation: Change to \!lifecycle_res.as_ref().map_or_else(is_ws_hibernate, |_| false) for clarity, or verify that the current implementation correctly evaluates to what you intend.

2. CRITICAL: Resource Leak on Actor Start Failure (mod.ts:959)

// TODO: Where is this GC'd if something fails?
for (const hr of startCommand.hibernatingRequests) {
    this.#tunnel.addRequestToActor(hr.gatewayId, hr.requestId, actorId);
}

If onActorStart throws (line 983), the requestToActor mappings added above are never cleaned up. The catch block calls forceStopActor, but this may not remove these specific mappings.

Impact: Memory leak, incorrect routing of future requests

Fix Required:

} catch (err) {
    this.log?.error({ msg: "error starting runner actor", actorId, err });
    instance.actorStartPromise.reject(err);
    
    // Clean up request-to-actor mappings
    for (const hr of startCommand.hibernatingRequests) {
        this.#tunnel.removeRequestToActor(hr.gatewayId, hr.requestId);
    }
    
    await this.forceStopActor(actorId, generation);
}

Note: You'll need to make removeRequestToActor public or add a cleanup method to Tunnel.

3. Race Condition: Actor Removal Timing (mod.ts:305-310)

this.#tunnel?.closeActiveRequests(actor);

// Remove actor after stopping
this.#removeActor(actorId, generation);

this.#sendActorStateUpdate(actorId, actor.generation, "stopped");

closeActiveRequests is called while the actor is still in #actors map. If a new CommandStartActor arrives during this time, you could briefly have duplicate actors. Consider moving #removeActor before closeActiveRequests.


⚠️ Important Issues

4. Inconsistent Duplicate Handling (actor.ts:176-180 vs 65-80)

setWebSocket silently returns without adding duplicates:

if (exists) {
    logger()?.warn({ msg: "attempting to set websocket twice" });
    return; // Silently fails
}

But createPendingRequest deletes the existing one and replaces:

if (exists) {
    logger()?.warn({ msg: "attempting to set pending request twice, replacing existing", ... });
    this.deletePendingRequest(gatewayId, requestId);
}

Impact: Inconsistent behavior could cause subtle bugs

Recommendation: Either make both consistent, or document why they differ.

5. Missing Message Index Reset Documentation (shared_state.rs:127)

The line entry.message_index = 0; was removed. This appears to be an intentional behavior change where message indices are now preserved across reconnections.

Recommendation: Add a comment explaining this behavior change and why it's correct for hibernation.

6. Unhandled Background Promises (mod.ts:863-867)

if (commandWrapper.inner.tag === "CommandStartActor") {
    this.#handleCommandStartActor(commandWrapper);
} else if (commandWrapper.inner.tag === "CommandStopActor") {
    this.#handleCommandStopActor(commandWrapper);
}

These async functions are fire-and-forget. Unhandled rejections could cause silent failures.

Fix:

this.#handleCommandStartActor(commandWrapper).catch(err => {
    this.log?.error({ msg: "unhandled error in CommandStartActor", err });
});

📝 Code Quality

7. Typo (tunnel.ts:762)

"catught" should be "caught"

8. Code Duplication (actor.ts:65-137)

createPendingRequest and createPendingRequestWithStreamController share ~80% of their code. Consider refactoring:

createPendingRequest(
    gatewayId: protocol.GatewayId,
    requestId: protocol.RequestId,
    clientMessageIndex: number,
    streamController?: ReadableStreamDefaultController<Uint8Array>,
) {
    // Unified implementation
}

9. Removed Debug Context (shared_state.rs:395)

-request_id=?tunnel_id::request_id_to_string(&request_id),

This removes helpful debugging context. Was this intentional? Consider keeping it.


✅ Positive Changes

  1. Excellent synchronization pattern with actorStartPromise - prevents race conditions
  2. Improved API clarity - restoreHibernatingRequests is now the caller's responsibility with clear documentation
  3. Better logging - structured logging with context throughout
  4. Defensive programming - hibernationRestored flag prevents double-restoration
  5. Good use of TypeScript - proper typing maintained throughout

🔍 Testing Recommendations

The test file shows minimal changes. Consider adding tests for:

  1. ✅ Actor start failure cleanup (addresses issue [SVC-2479] Send cluster events to PostHog #2)
  2. ✅ Hibernation state restoration after actor restart
  3. ✅ Duplicate WebSocket/request handling
  4. ✅ Message index preservation across hibernation
  5. ✅ The Rust close frame logic (integration test)

🏎️ Performance Considerations

Linear searches in arrays (actor.ts, tunnel.ts):
Multiple findIndex operations on pendingRequests, webSockets, and requestToActor arrays could become O(n) bottlenecks under load.

Recommendation: For production at scale, consider using Maps with composite keys for O(1) lookups.


🔒 Security

No security vulnerabilities identified. Proper isolation between actors is maintained.


Summary

Must Fix Before Merge:

  1. Resource leak in actor start error path ([SVC-2479] Send cluster events to PostHog #2)
  2. Rust boolean logic syntax verification ([SVC-2555] Set up issue templates #1)
  3. Unhandled promise rejections (Remove servers not in salt #6)

Should Fix:

  1. Race condition in actor removal ([SVC-2504] Fix 5 GB upload limit for local development from Cloudflare #3)
  2. Inconsistent duplicate handling ([SVC-2483] Remove hardcoded uses of rivet.gg #4)
  3. Typo ([SVC-2404] Improve internal documentation #7)

Nice to Have:

  1. Document message_index behavior ([SVC-2358] Enable Redis memory overcommit #5)
  2. Refactor duplicate code (Remove fallible uses of as in api services #8)
  3. Restore debug context ([SOC-15] Fix error when running user-delete #9)
  4. Add comprehensive tests
  5. Performance optimizations for high-load scenarios

Overall Assessment: 7/10 - Solid architectural improvements and critical bug fixes, but needs cleanup in error handling paths before merge.

Great work on identifying and fixing the hibernation logic bug! 🎉

@claude
Copy link

claude bot commented Nov 21, 2025

Pull Request Review: Hibernation Implementation Fix

I've reviewed this PR which fixes the hibernation implementation. Overall, the changes look good and address important race conditions and lifecycle management issues. Here are my findings:


✅ Positive Changes

  1. Critical Bug Fix in Rust Code (pegboard-gateway/src/lib.rs:489)

    • Fixed the hibernation check from if lifecycle_res to if !&lifecycle_res
    • This was causing WebSocket close frames to be sent incorrectly when hibernating
    • This is a critical fix that properly prevents closing hibernating connections
  2. Better Race Condition Handling (runner/src/mod.ts)

    • Added actorStartPromise to coordinate actor initialization with incoming tunnel messages
    • Pre-populates requestToActor map BEFORE any async code runs (lines 956-962)
    • Ensures incoming messages wait for actor initialization to complete
    • Good use of the promiseWithResolvers pattern for coordination
  3. Improved API Design (runner/src/mod.ts)

    • Made restoreHibernatingRequests a public method on Runner (line 1607)
    • Clear documentation explaining when and how to call this method
    • Better separation of concerns - caller is now responsible for loading metadata
  4. Better Request Management (runner/src/actor.ts)

    • Refactored setPendingRequest into createPendingRequest and createPendingRequestWithStreamController
    • Checks for duplicates and logs warnings appropriately
    • Cleaner separation of concerns
  5. Enhanced Logging Throughout

    • Added structured logging with relevant context (gateway IDs, request IDs, message indices)
    • Helps with debugging hibernation-related issues
    • Follows the structured logging conventions from CLAUDE.md

🔍 Potential Issues & Concerns

1. TODO Not Addressed (runner/src/mod.ts:959)

// TODO: Where is this GC'd if something fails?
for (const hr of startCommand.hibernatingRequests) {
    this.#tunnel.addRequestToActor(hr.gatewayId, hr.requestId, actorId);
}

Issue: If actor start fails after populating the requestToActor map, these entries may leak.

Recommendation: Consider wrapping the try-catch to clean up these mappings on failure:

try {
    // ... actor start logic
} catch (err) {
    // Clean up request mappings
    for (const hr of startCommand.hibernatingRequests) {
        this.#tunnel.removeRequestToActor(hr.gatewayId, hr.requestId);
    }
    // ... rest of error handling
}

2. Actor Replacement Warning (runner/src/mod.ts:943-952)

The code logs a warning when replacing an existing actor but doesn't clean up the old actor's resources:

if (existingActor) {
    this.log?.warn({
        msg: "replacing existing actor in actors map",
        // ...
    });
}

Recommendation: Consider calling cleanup on the existing actor before replacement, or validate that this shouldn't happen at all.

3. Inconsistent Error Severity (runner/src/mod.ts:327,337)

Changed from error to warn for missing actors and generation mismatches:

this.log?.warn({ msg: "actor not found" });

Question: Is this intentional? If these scenarios are expected during normal operation (e.g., race conditions), warn is appropriate. Otherwise, error might be better.

4. SetWebSocket Duplicate Handling (runner/src/actor.ts:176-180)

const exists = this.getWebSocket(gatewayId, requestId) !== undefined;
if (exists) {
    logger()?.warn({ msg: "attempting to set websocket twice" });
    return;  // Silently returns without throwing
}

Concern: Unlike createPendingRequest which replaces the existing entry, setWebSocket silently returns. This inconsistency could hide bugs.

Recommendation: Either throw an error or document why silent failure is the correct behavior here.

5. Message Index Tracking (pegboard-gateway/src/shared_state.rs:127)

The line entry.message_index = 0; was removed.

Question: Is this intentional? The comment at line 193 suggests message indices are being tracked. Need to verify this doesn't break message ordering.

6. Logging in Production (rivetkit/src/actor/instance/state-manager.ts:431-445)

Very verbose logging at info level for every connection persistence:

this.#actor.rLog.info({
    msg: "persisting connection",
    connId,
    gatewayId: tunnelId.gatewayIdToString(hibernatableDataRaw.requestId),
    // ... lots of details
});

Recommendation: Consider downgrading some of these to debug level to reduce log volume in production.


🔒 Security Considerations

No significant security issues identified. The changes properly validate state and handle error cases.


⚡ Performance Considerations

  1. Array Linear Search (runner/src/actor.ts:58-62, 164-168)

    • Using find() on arrays for pending requests and WebSockets
    • If these arrays grow large, consider using Map with a composite key
    • Current implementation is probably fine for typical use cases
  2. Message Index Logging (runner/src/tunnel.ts:510)

    • Debug logging on every message send could be expensive at high volume
    • Consider using sampling or conditional logging based on log level

🧪 Test Coverage

Observation: The PR modifies test files:

  • driver-test-suite/tests/actor-conn-hibernation.ts
  • driver-test-suite/mod.ts

The changes appear to be minor adjustments (line 2 changes).

Recommendation: Ensure test coverage for:

  1. Actor start failure cleanup (the TODO mentioned above)
  2. Duplicate WebSocket handling
  3. Race conditions between actor start and incoming messages
  4. Restoration of hibernating connections with stale metadata

📝 Code Quality

  1. Good adherence to CLAUDE.md conventions:

    • ✅ Uses structured logging correctly
    • ✅ Log messages are lowercase
    • ✅ Proper TypeScript patterns
  2. Documentation:

    • ✅ Good JSDoc comments on public methods
    • ✅ Clear explanation of the hibernation restoration flow
  3. Code organization:

    • ✅ Clean separation of concerns
    • ✅ Meaningful method names

🎯 Recommendations Summary

High Priority:

  1. Address the resource leak TODO (line 959)
  2. Clean up existing actor resources before replacement
  3. Clarify the behavior when setting duplicate WebSockets

Medium Priority:
4. Verify message_index removal doesn't break ordering
5. Review log level choices (info vs debug vs warn)
6. Add test coverage for edge cases

Low Priority:
7. Consider using Map instead of Array for connection tracking
8. Consider log sampling for high-frequency debug logs


✅ Conclusion

This is a solid PR that fixes a critical hibernation bug and improves the overall robustness of the system. The race condition handling and API improvements are well thought out. The main concerns are around resource cleanup in failure scenarios and some minor inconsistencies in error handling.

Recommendation: Approve with minor changes requested (address the high-priority items above).

Great work on improving the hibernation implementation! 🚀

@claude
Copy link

claude bot commented Nov 21, 2025

Code Review: Hibernation Implementation Fix

I've reviewed the changes in this PR. Overall, the fix addresses critical issues in the hibernation implementation across both Rust (pegboard-gateway) and TypeScript (runner SDK and RivetKit). Here are my findings:


Strengths

1. Critical Bug Fix in Rust

  • File: engine/packages/pegboard-gateway/src/lib.rs:489
  • Fix: Changed if lifecycle_res to if !&lifecycle_res (negation added)
  • Impact: This fixes the logic for determining when to send close frames to runners. Previously, close frames were sent when hibernating (incorrect), now they're properly sent when NOT hibernating.

2. Improved Synchronization in TypeScript Runner

  • Files: engine/sdks/typescript/runner/src/mod.ts, engine/sdks/typescript/runner/src/actor.ts
  • Enhancement: Introduction of actorStartPromise to ensure proper sequencing:
    • Hibernating requests are registered BEFORE async onActorStart runs
    • Incoming messages wait for actor initialization via await actor.actorStartPromise.promise
    • This prevents race conditions where messages arrive before the actor is ready

3. Better API Design for Hibernation Restoration

  • File: engine/sdks/typescript/runner/src/mod.ts:1607-1614
  • Change: Moved restoreHibernatingRequests to public API with excellent documentation
  • Benefit: Clear contract for users - they must call this at the end of onActorStart, preventing misuse

4. Enhanced Logging and Debugging

  • Added structured logging throughout with context (gatewayId, requestId, messageIndex)
  • Removed unnecessary request_id parameter from ack logging (line 398 in shared_state.rs)
  • Added hibernation restoration count tracking

⚠️ Issues & Concerns

1. Removed message_index Reset - Potential Issue

  • File: engine/packages/pegboard-gateway/src/shared_state.rs:127
  • Change: Removed entry.message_index = 0; when reopening a request
  • Concern: If a WebSocket reconnects after hibernation, should the message index continue from where it left off or reset? The removal suggests continuation, but this needs verification:
    • Does the gateway track message indices across hibernation cycles?
    • Could this cause message ID conflicts or ordering issues?
  • Recommendation: Add a comment explaining why this line was removed and the expected behavior for message indices across hibernation boundaries.

2. Missing GC for Failed Actor Start

  • File: engine/sdks/typescript/runner/src/mod.ts:959
  • TODO Comment: // TODO: Where is this GC'd if something fails?
  • Issue: When onActorStart fails, the requestToActor mappings are populated but may not be cleaned up properly
  • Current Mitigation: forceStopActor is called on failure (line 998), which should close active requests
  • Recommendation: Verify that closeActiveRequests properly cleans up the requestToActor map, or explicitly handle cleanup in the error path

3. Code Duplication in createPendingRequest Methods

  • File: engine/sdks/typescript/runner/src/actor.ts:65-138
  • Issue: createPendingRequest and createPendingRequestWithStreamController share significant duplicate code
  • Recommendation: Refactor to reduce duplication:
createPendingRequest(
  gatewayId: protocol.GatewayId,
  requestId: protocol.RequestId,
  clientMessageIndex: number,
  streamController?: ReadableStreamDefaultController<Uint8Array>,
) {
  // ... shared validation and deletion logic ...
  this.pendingRequests.push({
    gatewayId,
    requestId,
    request: {
      resolve: () => {},
      reject: () => {},
      actorId: this.actorId,
      gatewayId,
      requestId,
      clientMessageIndex,
      ...(streamController && { streamController }),
    },
  });
  // ... shared logging ...
}

4. Potential Race Condition in setWebSocket

  • File: engine/sdks/typescript/runner/src/actor.ts:176-180
  • Change: Early return instead of delete-then-add when WebSocket exists
  • Concern: If the same WebSocket is being set twice, returning early might hide a bug rather than fixing it. The warning log is good, but:
    • Why would this happen in correct operation?
    • Should this be an error/throw instead of a warning?
  • Recommendation: Consider if this indicates a bug in the caller that should be fixed rather than worked around

5. Hibernation State Validation Missing

  • File: engine/sdks/typescript/runner/src/tunnel.ts:110-116
  • Good: Added hibernationRestored flag and assertion
  • Missing: The flag is checked but there's no enforcement that restoreHibernatingRequests is called for actors with hibernating requests
  • Recommendation: Consider adding a check in actor stop/cleanup to warn if an actor had hibernatingRequests but never called restore

🔍 Security Considerations

No Immediate Security Issues

  • WebSocket close frame handling is properly secured
  • No injection vulnerabilities introduced
  • Proper error handling prevents information leakage

🧪 Testing

Positive:

  • Existing test actor-conn-hibernation.ts was minimally modified (just imports)
  • Test coverage appears to validate the hibernation flow

Recommendations:

  • Add specific test for the negation fix (!&lifecycle_res) to ensure close frames aren't sent during hibernation
  • Add test for actor start failures to verify requestToActor cleanup
  • Add test for duplicate setWebSocket calls to document expected behavior

📊 Performance Considerations

Positive:

  • getAndWaitForActor properly awaits initialization, preventing wasted work
  • Structured logging uses lazy evaluation (good for production)

Considerations:

  • Multiple arraysEqual calls for searching pending requests could be optimized with a Map if this becomes a bottleneck
  • The actorStartPromise adds minimal overhead but provides critical correctness

📝 Documentation

Excellent:

  • The restoreHibernatingRequests JSDoc is comprehensive and clear
  • Comments about race condition prevention (line 916-918) are valuable

Could Improve:

  • Add inline comments explaining the message_index removal in shared_state.rs
  • Document the expected behavior when setWebSocket is called twice

Summary

This PR makes important fixes to the hibernation implementation, particularly the critical negation bug in the Rust code. The TypeScript changes significantly improve synchronization and API clarity.

Recommended Actions Before Merge:

  1. ✅ Verify and document the message_index reset removal
  2. ✅ Confirm GC behavior for failed actor starts
  3. ⚠️ Consider refactoring duplicate code in createPendingRequest methods
  4. ⚠️ Evaluate if duplicate setWebSocket should be an error

Risk Assessment: Low - Changes fix actual bugs and improve robustness, with minimal risk of regression given existing test coverage.

Great work on improving the hibernation reliability! 🚀

@NathanFlurry NathanFlurry force-pushed the 11-19-fix_tunnel_implement_ping_pong branch from 40a9f30 to d287067 Compare November 21, 2025 01:27
@NathanFlurry NathanFlurry force-pushed the 11-19-chore_rivetkit_fix_hibernation_implementation branch from 8145fc3 to 193f67a Compare November 21, 2025 01:27
@claude
Copy link

claude bot commented Nov 21, 2025

Pull Request Review: Hibernation Implementation Fix

Overview

This PR fixes critical issues with the hibernation implementation across the pegboard gateway, TypeScript runner SDK, and RivetKit.

Critical Bug Fixes

1. Negation Logic Fix in pegboard-gateway/src/lib.rs:489

  • Previously sending close frames to hibernating connections
  • Now correctly skips close frames only when hibernating
  • This was likely causing hibernated WebSocket connections to be prematurely closed

2. Message Index Reset Removed in shared_state.rs:127

  • Previously reset message index on reconnection, breaking message ordering
  • Now preserves message index continuity across hibernation cycles
  • Critical for maintaining proper WebSocket message sequencing

Architecture Improvements

3. Runner Actor Lifecycle Management
The new actorStartPromise pattern prevents race conditions between actor start and incoming tunnel messages.

Question: What happens if onActorStart rejects? The promise is rejected but actor remains in the map until forceStopActor completes. Could this leave zombie actors?

4. Hibernation Restoration API Redesign
The move from automatic loadAll to explicit restoreHibernatingRequests is a significant improvement with better control flow and error handling.

Potential Issues

5. WebSocket Duplication Check (actor.ts:176-180)
Silently returns instead of throwing when attempting to set websocket twice. Could hide bugs.

6. Pending Request Duplication (actor.ts:70-80, 108-117)
Automatically replaces existing pending requests without calling old callbacks. Should throw an error or call the old callbacks with an error.

Test Coverage Concerns

No new tests for:

  • Hibernation restoration flow
  • Race condition fixes
  • Message index preservation across hibernation
  • Duplicate WebSocket/pending request scenarios

Recommendations

High Priority:

  1. Consider throwing instead of silent returns in setWebSocket duplicate case
  2. Handle old callbacks when replacing pending requests
  3. Add comprehensive tests for the new hibernation flow

Medium Priority:
4. Document error recovery for actor start failures
5. Consider adding metrics for hibernation restoration success/failure rates

Overall Assessment

This is a solid PR that fixes critical bugs in the hibernation implementation.

Strengths:

  • Fixes critical negation bug that would break hibernation
  • Fixes message index reset bug
  • Improves API design significantly
  • Good race condition handling

Recommendation: Approve with suggestions. The core fixes are correct and necessary. Suggested improvements are mostly defensive programming and testing.

@graphite-app
Copy link
Contributor

graphite-app bot commented Nov 21, 2025

Merge activity

  • Nov 21, 2:40 AM UTC: NathanFlurry added this pull request to the Graphite merge queue.
  • Nov 21, 2:42 AM UTC: CI is running for this pull request on a draft pull request (#3515) due to your merge queue CI optimization settings.
  • Nov 21, 2:43 AM UTC: Merged by the Graphite merge queue via draft PR: #3515.

@graphite-app graphite-app bot closed this Nov 21, 2025
@graphite-app graphite-app bot deleted the 11-19-chore_rivetkit_fix_hibernation_implementation branch November 21, 2025 02:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants