chore(rivetkit): fix hibernation implementation #3494

NathanFlurry · 2025-11-20T04:52:13Z

No description provided.

vercel · 2025-11-20T04:52:19Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
rivetkit-serverless	Error			Nov 20, 2025 8:52pm

3 Skipped Deployments

Project	Deployment	Preview	Updated (UTC)
rivet-cloud	Ignored		Nov 20, 2025 8:52pm
rivet-inspector	Ignored	Preview	Nov 20, 2025 8:52pm
rivet-site	Ignored	Preview	Nov 20, 2025 8:52pm

NathanFlurry · 2025-11-20T04:52:28Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

fix(pegboard-gateway): ping hibernating requests both during open hws connections and during hibernation #3498 : 2 dependent PRs (#3499 , #3501 )
fix(rivetkit): fix graceful runner shutdown #3497
fix(pegboard-gateway): write hws immediately on socket hibernate #3496
chore(rivetkit): fix hibernation implementation #3494 👈 (View in Graphite)
fix(tunnel): implement ping pong #3491 : 1 other dependent PR (#3475 )
chore(pegboard-gateway): add new message id format & add deprecated tunnel ack #3492
chore(pegboard-gateway): remove TunnelAck #3488
chore(engine-runner): remove echo close on ToClientWebSocketClose #3487
chore(rivetkit): implement new hibernating ws protocol #3464 : 1 other dependent PR (#3452 )
fix(actor-kv): fix listing keys #3484
fix(guard): handle actor stopped during ws/req #3481
chore: rename retry -> hibernate in runner protocol #3473
fix(udb): update backoff algo #3469
chore: write/update docs #3468
chore: add debug scripts #3467
main

How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

_{An organization admin has enabled the Graphite Merge Queue in this repository.} _{Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.}

This stack of pull requests is managed by Graphite. Learn more about stacking.

claude · 2025-11-20T04:55:39Z

Code Review: Hibernation Implementation Fix

This PR addresses critical issues in the hibernation implementation for WebSocket connections. Overall, the changes improve the architecture and fix race conditions, though there are some areas that need attention.

🟢 Strengths

1. Critical Bug Fix in Rust Gateway (pegboard-gateway/src/lib.rs:489)

-if lifecycle_res
+if !&lifecycle_res

Excellent catch! This was a logic inversion bug - the code was sending close frames when it should hibernate, and vice versa. This fix is critical for proper hibernation behavior.

2. Race Condition Prevention (runner/src/mod.ts:915-962)

The addition of actorStartPromise and the careful sequencing in #handleCommandStartActor is well-designed:

Setting up the actor synchronously before any async operations
Populating requestToActor map before onActorStart runs
Using promises to coordinate actor initialization

The comment at line 916 is particularly valuable:

// IMPORTANT: Make sure no async code runs before inserting #actors and
// calling addRequestToActor in order to prevent race conditions with
// subsequence commands

3. Improved API Design (runner/src/mod.ts:109-146)

The refactored hibernatableWebSocket configuration is much clearer:

Removes the confusing loadAll method from the config
Makes the caller responsible for loading and passing metadata to restoreHibernatingRequests
Better separation of concerns with comprehensive documentation

4. Better State Management (runner/src/actor.ts:65-99)

Replacing setPendingRequest with separate createPendingRequest and createPendingRequestWithStreamController methods:

More explicit intent
Better type safety
Warns on duplicate requests instead of silently replacing

🟡 Issues & Concerns

1. Resource Leak on Actor Start Failure (runner/src/mod.ts:959)

// TODO: Where is this GC'd if something fails?
for (const hr of startCommand.hibernatingRequests) {
    this.#tunnel.addRequestToActor(hr.gatewayId, hr.requestId, actorId);
}

Critical: If onActorStart throws an exception (line 984), the requestToActor mappings added at line 961 are never cleaned up. The code calls forceStopActor (line 998), but this might not remove these mappings properly.

Recommendation:

try {
    await this.#config.onActorStart(actorId, generation, actorConfig);
    instance.actorStartPromise.resolve();
} catch (err) {
    this.log?.error({
        msg: "error starting runner actor",
        actorId,
        err,
    });
    
    instance.actorStartPromise.reject(err);
    
    // Clean up requestToActor mappings
    for (const hr of startCommand.hibernatingRequests) {
        this.#tunnel.removeRequestToActor(hr.gatewayId, hr.requestId);
    }
    
    await this.forceStopActor(actorId, generation);
}

2. Duplicate WebSocket Warning Without Action (runner/src/actor.ts:176-180)

const exists = this.getWebSocket(gatewayId, requestId) !== undefined;
if (exists) {
    logger()?.warn({ msg: "attempting to set websocket twice" });
    return; // Just returns without setting!
}

Issue: This silently fails when trying to add a duplicate WebSocket. Should this throw an error instead? Or delete the existing one first? The behavior differs from createPendingRequest which explicitly deletes and warns.

Recommendation: Be consistent with createPendingRequest pattern or document why this should fail silently.

3. Missing Error Handling in Async Message Handler (runner/src/mod.ts:852-869)

#handleCommands(commands: protocol.ToClientCommands) {
    for (const commandWrapper of commands) {
        if (commandWrapper.inner.tag === "CommandStartActor") {
            // Spawn background promise
            this.#handleCommandStartActor(commandWrapper);
        } else if (commandWrapper.inner.tag === "CommandStopActor") {
            // Spawn background promise
            this.#handleCommandStopActor(commandWrapper);
        }
    }
}

Issue: These background promises are fire-and-forget. If they throw unhandled exceptions, they could cause silent failures.

Recommendation:

this.#handleCommandStartActor(commandWrapper).catch(err => {
    this.log?.error({ msg: "unhandled error in CommandStartActor", err });
});

4. Potential State Inconsistency (runner/src/tunnel.ts:116-119)

if (actor.hibernationRestored) {
    throw new Error(
        `Actor ${actorId} already restored hibernating requests`,
    );
}

Good: This prevents double-restoration, but...

Question: What happens if restoreHibernatingRequests partially completes and then fails? The flag won't be set (line 282), so it could be retried, but some connections might already be partially restored. Consider adding more granular tracking or cleanup on failure.

5. Removed Message Index Reset (pegboard-gateway/src/shared_state.rs:127)

-entry.message_index = 0;

Question: Why was this line removed? Is the message index intentionally preserved across hibernation cycles now? This should be documented or explained in the commit message if it's an intentional behavior change.

🔵 Minor Issues

1. Logging Improvements Are Good ✅

The added structured logging throughout (especially in actor.ts and tunnel.ts) will make debugging much easier. Good practice!

2. Removed Debug Log (pegboard-gateway/src/shared_state.rs:395)

-request_id=?tunnel_id::request_id_to_string(&request_id),

Was this field removed to reduce log verbosity? Consider if this information is still useful for debugging.

3. Test Changes (driver-test-suite/tests/actor-conn-hibernation.ts)

The test file only has minor changes (toBe(1) → toBe(1) on lines 57, 84). Are there additional tests needed to cover the new behavior?

🔒 Security Considerations

No obvious security issues detected. The changes maintain proper separation between actors and don't introduce new attack vectors.

📊 Performance Considerations

Synchronous Setup Before Async: Good pattern (line 915-962) prevents race conditions without performance overhead
Background Promise Settling: Using Promise.allSettled (tunnel.ts:279) is correct but could delay hibernation if background ops hang. Consider adding timeouts.

🧪 Test Coverage

Existing tests in actor-conn-hibernation.ts cover:

✅ Basic hibernation
✅ State persistence through hibernation
✅ Connection closure during hibernation

Missing coverage:

❌ Actor start failure with hibernating requests
❌ Partial restoration failure scenarios
❌ Duplicate restoration attempts

📝 Documentation

The JSDoc comments added to restoreHibernatingRequests (runner/src/mod.ts:1582-1614) are excellent and clearly explain the API contract.

✅ Recommendations Summary

Must Fix (Before Merge):

⚠️ Resource leak on actor start failure (mod.ts:959)
⚠️ Unhandled promise rejections in background commands (mod.ts:863-867)

Should Fix:

Inconsistent duplicate WebSocket handling (actor.ts:176-180)
Consider failure recovery in partial hibernation restoration

Nice to Have:

Additional test coverage for error cases
Document the message_index behavior change

Overall, this is a solid improvement that fixes critical bugs and improves the architecture. The main concerns are around error handling and resource cleanup in failure scenarios.

claude · 2025-11-20T08:02:31Z

Code Review: Hibernation Implementation Fix

This PR fixes critical issues in the hibernation implementation for RivetKit.

Summary

Addresses race conditions and synchronization issues in the WebSocket hibernation flow, ensuring actors are fully initialized before processing hibernating requests.

Positive Observations

1. Critical Bug Fix in Rust (pegboard-gateway/src/lib.rs:489)
The logic inversion fix changing if lifecycle_res to if !&lifecycle_res is excellent. This would have caused close frames to be sent when hibernating and not sent when not hibernating.

2. Proper Race Condition Handling
The introduction of actorStartPromise and getAndWaitForActor prevents race conditions between actor initialization and incoming tunnel messages.

3. Improved API Design
The refactored restoreHibernatingRequests approach provides clearer separation of concerns with better documentation.

4. Enhanced Logging
Good additions for debugging: message index tracking, gateway IDs in logs, pending request counts.

Issues & Concerns

1. CRITICAL: Potential Resource Leak (mod.ts:959)
If onActorStart throws, the requestToActor mappings are never cleaned up. The TODO comment acknowledges this. This could cause memory leaks and incorrect request routing.

Recommendation: Add cleanup in the catch block to remove all hibernatingRequests from the tunnel's requestToActor map.

2. Warning Suppression (actor.ts:177-180)
setWebSocket silently ignores duplicate WebSocket registration with just a warning. Should this throw an error to fail fast instead?

3. Typo (tunnel.ts:762)
"catught" should be "caught"

4. Code Duplication (actor.ts:65-137)
createPendingRequest and createPendingRequestWithStreamController have ~80% overlapping code. Consider merging with an optional parameter.

5. Missing Validation
Why is replacing a pending request valid behavior? This could indicate request ID collision or race conditions.

Performance Considerations

Linear Search in Arrays
Multiple findIndex operations on pendingRequests, webSockets, and requestToActor arrays could become O(n) bottlenecks. Consider using Maps with composite keys for O(1) lookups.

Test Coverage

Missing tests for:

Actor start failure cleanup (resource leak scenario)
Duplicate WebSocket registration
Race conditions between actor start and incoming messages
The Rust logic inversion fix

Recommendations

Must Fix Before Merge:

Address the resource leak in error path
Fix typo

Should Fix:

Add test coverage for critical bug fixes
Refactor duplicate code
Consider throwing errors for duplicate registrations in dev mode

Nice to Have:

Optimize array-based lookups to Map-based
Add JSDoc for complex flows
Add metrics for hibernation success/failure

Overall Assessment

Rating: 7.5/10

This PR fixes critical bugs and improves architecture. Main concerns are the resource leak (must fix) and missing test coverage. With those addressed, this would be production-ready.

Great catch on the Rust logic bug!

claude · 2025-11-20T19:40:24Z

Code Review: Hibernation Implementation Fix

Thanks for working on this hibernation fix! I've reviewed the changes across the Rust gateway code and TypeScript SDK. Here's my feedback:

🐛 Critical Bug Fix (pegboard-gateway/src/lib.rs:489)

Excellent catch! The negation operator fix is critical:

-if lifecycle_res
+if !&lifecycle_res

This was closing WebSocket connections when they should be hibernating. This is the core fix and correctly addresses the hibernation behavior.

🏗️ Architecture Improvements

1. Actor Start Synchronization (runner/src/mod.ts)

The new actorStartPromise pattern is solid:

Prevents race conditions between actor creation and message handling
Ensures hibernating requests wait for full actor initialization before processing
Good use of promiseWithResolvers pattern

However, there's a potential resource leak (line 959 TODO):

// TODO: Where is this GC'd if something fails?
for (const hr of startCommand.hibernatingRequests) {
    this.#tunnel.addRequestToActor(hr.gatewayId, hr.requestId, actorId);
}

Recommendation: Add cleanup in the catch block at line 987-999 to remove the request-to-actor mappings if initialization fails:

} catch (err) {
    // Clean up request mappings if actor start failed
    for (const hr of startCommand.hibernatingRequests) {
        this.#tunnel.removeRequestToActor(hr.gatewayId, hr.requestId);
    }
    // ... rest of error handling
}

2. Pending Request Management (runner/src/actor.ts)

Good refactoring with createPendingRequest and createPendingRequestWithStreamController:

Eliminates code duplication
Adds helpful debug logging
Includes duplicate detection with warnings

Minor concern (line 82): The duplicate request warning at line 66 might mask bugs rather than expose them. Consider if this should be an error instead of a warning, or at least add metrics to track how often this occurs in production.

3. Hibernation State Contract

The hibernationRestored flag (actor.ts:35) is a good defensive pattern to catch implementation errors. However, the check only validates it hasn't been called twice - not that it was called at all. Consider adding a check before actor shutdown to ensure restoreHibernatingRequests was called if hibernating requests existed.

🔍 Code Quality

Positive:

✅ Comprehensive logging throughout (though verify log levels in production)
✅ Proper error handling with try-catch blocks
✅ Good use of structured logging with context
✅ Type safety maintained throughout

Concerns:

1. Excessive Logging (state-manager.ts)
The state manager file is untouched in this PR but shows info level logs for routine operations:

this.#actor.rLog.info({ msg: "savePersistInner called", ... });
this.#actor.rLog.info({ msg: "prepared entries for kvBatchPut", ... });
this.#actor.rLog.info({ msg: "calling kvBatchPut", ... });
this.#actor.rLog.info({ msg: "kvBatchPut completed successfully" });

Recommendation: Downgrade these to debug level to avoid log spam in production.

2. Import Ordering (pegboard-gateway/src/lib.rs:9-20)
The reformatting changes break from the standard import grouping pattern. Imports should be:

External crates (rivet_*, pegboard)
Internal modules (use crate::)
Current ordering mixes these. Consider reverting to alphabetical or logical grouping.

3. Removed Debug Context

-  request_id=?tunnel_id::request_id_to_string(&request_id),

This removes helpful context from the log at shared_state.rs:398. Was this intentional? The ack_index alone may not be sufficient for debugging.

4. Deleted Message Index Reset (shared_state.rs:127)

-entry.message_index = 0;

The removal of this line needs explanation. Is message index now preserved across reconnections? If so, add a comment explaining why.

⚡ Performance Considerations

The getAndWaitForActor method (mod.ts:348) could potentially create thundering herd if many requests arrive simultaneously during actor startup. Consider if there's a way to batch these or add metrics.
Multiple awaits in the tunnel message handling path (tunnel.ts:638, 668, 779, 808) - ensure these don't introduce latency issues under load.

🧪 Testing

Missing Test Coverage:

No test changes included for this critical bug fix
Consider adding tests for:
- WebSocket hibernation/restoration cycle
- Actor start race conditions
- Duplicate pending request handling
- Error cleanup paths

Existing Tests:
The only test file change is a parameter order fix in actor-conn-hibernation.ts - this suggests the test may not have caught the original bug. Review test coverage.

🔒 Security

No immediate security concerns identified. The hibernation state management appears sound.

📝 Documentation

The restoreHibernatingRequests JSDoc (mod.ts:1582) is excellent and clearly explains:

When to call it
What it does
Important constraints

Great documentation!

🎯 Summary

Critical Issues:

✅ FIXED: Core hibernation bug in gateway close logic
⚠️ TODO: Resource leak cleanup in actor start error path (mod.ts:959)

Recommendations:

Add cleanup for request mappings in error path
Downgrade routine logs to debug level
Add test coverage for hibernation scenarios
Restore removed debug context or document why removed
Document message_index behavior change

Overall Assessment: This is a solid fix for a critical bug. The architectural improvements around actor start synchronization are well-designed. The main concern is the unhandled TODO around resource cleanup in error cases. I'd recommend addressing that before merge.

Would you like me to help with any of these issues or have questions about the feedback?

claude · 2025-11-20T19:48:45Z

PR Review: chore(rivetkit): fix hibernation implementation

This PR refactors the hibernation implementation for WebSocket connections. Overall, the changes improve clarity, but there are several critical areas that need attention.

Critical Issues

1. Inverted Boolean Logic in Rust (lib.rs:489)

The logic 'if !&lifecycle_res' appears incorrect. The negation '!' is being applied to a reference, which will always be true for any non-null reference. This will cause the gateway to send close frames at the wrong time.

2. Race Condition in Actor Start (mod.ts:933-945)

The PR adds hibernatingRequests to the requestToActor map BEFORE onActorStart completes. If incoming tunnel messages arrive during this time and the actor fails during onActorStart, there's a potential race condition.

3. Actor Removed After Stop (mod.ts:288-310)

closeActiveRequests is called BEFORE removing the actor. If another CommandStartActor arrives during this window, you could have duplicate actors briefly.

Potential Bugs

4. Missing Cleanup on Actor Start Failure

When onActorStart fails, requestToActor mappings were already added. Verify these are properly cleaned up.

5. Empty resolve/reject Functions (actor.ts:85-86)

The resolve and reject callbacks are empty no-ops. Document why this is intentional.

6. Duplicate WebSocket Detection (actor.ts:176-180)

setWebSocket returns early without replacing duplicates, while createPendingRequest replaces them. This inconsistency could cause bugs.

Code Quality

7. Typo in Log Message (tunnel.ts:762)

'catught' should be 'caught'

8. Inconsistent Logging

Standardize on including both gatewayId and requestId in WebSocket logs.

9. Removed message_index Reset (shared_state.rs:127)

Add comment explaining why removing this reset is safe.

Performance

10. Linear Search

Consider using Map instead of array for pendingRequests and webSockets if many concurrent connections become a bottleneck.

Test Coverage

11. Missing Tests

Add tests for: the critical Rust logic fix, race conditions during actor start, duplicate handling, and failure scenarios.

Documentation

12. Excellent Improvements

The JSDoc additions for hibernatableWebSocket, onActorStart, and restoreHibernatingRequests significantly improve API clarity.

Summary

Approve with Changes Required

The architectural changes are sound improvements. However, critical issues must be addressed:

CRITICAL: Fix inverted boolean logic in lib.rs:489
IMPORTANT: Address race condition in actor start
IMPORTANT: Verify cleanup on actor start failure
Add test coverage
Fix typo and logging inconsistencies

claude · 2025-11-20T22:33:43Z

PR Review: Hibernation Implementation Fix

Critical Issues

1. Logic Inversion Bug (pegboard-gateway/src/lib.rs:489)

The negation operator is applied to a reference rather than the boolean:

if !&lifecycle_res.as_ref().map_or_else(is_ws_hibernate, |_| false)

Should be:

if !lifecycle_res.as_ref().map_or_else(is_ws_hibernate, |_| false)

Major Concerns

2. Duplicate WebSocket Prevention (runner/src/actor.ts:176-180)

Silent failure when duplicate detected - caller may expect WebSocket was registered. Either replace existing, throw error, or document why silent failure is OK.

3. Actor Removal Timing (runner/src/mod.ts:308-311)

Moving removeActor after closeActiveRequests changes timing. Need detailed comments on race condition prevention.

Code Quality

message_index field reset removed (shared_state.rs:127) - verify if field should be removed entirely
Missing gatewayId in error logs (runner/src/tunnel.ts:142-158)
Logging level changed error to warn in getActor

Performance

Promise.allSettled silently swallows errors. Consider logging which WebSockets failed to restore.

Security

No input validation on hibernation metadata. Add validation to prevent crashes from corrupted data.

Testing Gaps

Need tests for edge cases like already-restored state, mismatched metadata, concurrent operations.

Summary

Must Fix:

Rust logic inversion bug
Clarify duplicate WebSocket behavior

Should Address:

Error handling for partial restoration
Input validation
Test coverage

Positive: Good architectural improvements with actorStartPromise pattern and better separation of concerns.

Overall: Solid refactoring. Main blocker is Rust bug. Otherwise ready after clarifications.

claude · 2025-11-20T22:43:50Z

Pull Request Review: Fix Hibernation Implementation

Summary

This PR fixes critical issues in the hibernation implementation for WebSocket connections. The changes primarily affect the TypeScript runner SDK and the Rust pegboard gateway, with a focus on proper lifecycle management and race condition prevention during actor startup and WebSocket restoration.

Code Quality & Best Practices

✅ Strengths

Race Condition Fixes: The PR introduces proper synchronization with actorStartPromise to prevent race conditions between actor startup and incoming requests. This is a critical improvement.
Better Separation of Concerns: Moving hibernation restoration responsibility to the user (via restoreHibernatingRequests) with clear documentation is a good architectural decision.
Improved Logging: Added structured debug/info logs throughout help with observability (e.g., tracking pending request counts, message indices).
State Tracking: The hibernationRestored flag prevents double-restoration, which is good defensive programming.

Potential Bugs & Issues

🔴 Critical Issues

Negation Bug Fix in pegboard-gateway/src/lib.rs:489
```
// Before: if lifecycle_res
// After:  if !&lifecycle_res
```
This fixes inverted logic for hibernation detection. However, the & reference operator seems unnecessary here:
```
if !lifecycle_res  // This should be sufficient
```
The &lifecycle_res creates a temporary reference that is then negated, which works but is unconventional.
Removed entry.message_index = 0 in shared_state.rs:127
This line was removed during hibernation restoration. Need to verify this doesn't break message ordering - the removal suggests message_index is now managed elsewhere, but this should be explicitly validated.

⚠️ Medium Priority Issues

Actor Replacement Warning in mod.ts:943
```
if (existingActor) {
    this.log?.warn({
        msg: "replacing existing actor in actors map",
        // ...
    });
}
```
While logging is good, silently replacing actors could indicate a serious issue. Should this throw an error instead, or at minimum, clean up the old actor's resources first?
Duplicate Prevention vs. Error Handling in actor.ts:176-180
```
if (exists) {
    logger()?.warn({ msg: "attempting to set websocket twice" });
    return;  // Silent return
}
```
This silently returns on duplicate WebSocket registration. Should this throw an error to surface potential bugs in the calling code?
Error Handling in mod.ts:988
```
instance.actorStartPromise.reject(err);

// TODO: Mark as crashed
await this.forceStopActor(actorId, generation);
```
The TODO comment suggests incomplete error handling. What happens to actors that fail to start? Are they properly cleaned up?

Performance Considerations

🟡 Minor Performance Notes

Sequential Processing in tunnel.ts:134-276
The hibernation restoration processes WebSockets sequentially in a for-loop. For actors with many hibernating connections, this could be slow. Consider:

// Current: sequential processing
for (const { gatewayId, requestId } of actor.hibernatingRequests) { ... }

// Potential: parallel processing with Promise.allSettled
await Promise.allSettled(
    actor.hibernatingRequests.map(async ({ gatewayId, requestId }) => { ... })
);

However, the current approach may be intentional for ordered processing.

Linear Search in Pending Requests
actor.ts:58-62 uses linear search through pendingRequests array. For actors with many concurrent requests, this could become a bottleneck. Consider using a Map<string, PendingRequest> keyed by gatewayId + requestId.

Security Concerns

🔒 Security Review

No Security Issues Identified: The changes don't introduce obvious security vulnerabilities like XSS, injection attacks, or authentication bypasses.
Resource Cleanup: The PR properly cleans up WebSocket connections that fail to restore, preventing resource leaks.
Error Messages: Error messages don't leak sensitive information.

Test Coverage

📋 Testing Notes

Limited Test Changes: Only minor changes to test files:
- driver-test-suite/tests/actor-conn-hibernation.ts (2 line change)
- driver-test-suite/mod.ts (2 line change)
Missing Test Cases: Consider adding tests for:
- Race conditions during actor startup with concurrent requests
- Duplicate actor creation scenarios
- Failed hibernation restoration
- Actor start failure cleanup
- Message index tracking across hibernation/restoration cycles

Specific Code Review Comments

`engine/sdks/typescript/runner/src/mod.ts`

Line 913-932: Actor creation and request mapping race condition fix

// IMPORTANT: Make sure no async code runs before inserting #actors and
// calling addRequestToActor in order to prevent race conditions with
// subsequence commands

✅ Good: The comment clearly explains the critical ordering requirement. The code correctly populates requestToActor before any async operations.

Line 966-982: Actor startup error handling

} catch (err) {
    this.log?.error({
        msg: "error starting runner actor",
        actorId,
        err,
    });

    instance.actorStartPromise.reject(err);

    // TODO: Mark as crashed
    await this.forceStopActor(actorId, generation);
}

⚠️ Question: What happens to actors in the #actors map when startup fails? They're added at line 954 but may not be properly cleaned if forceStopActor doesn't handle this case.

`engine/sdks/typescript/runner/src/actor.ts`

Line 65-99: createPendingRequest method

if (exists) {
    logger()?.warn({
        msg: "attempting to set pending request twice, replacing existing",
        // ...
    });
    this.deletePendingRequest(gatewayId, requestId);
}

✅ Good: The method handles duplicates gracefully by logging and cleaning up.

⚠️ Note: The duplicate object creation between createPendingRequest and createPendingRequestWithStreamController suggests these could be refactored to share common logic.

`engine/packages/pegboard-gateway/src/lib.rs`

Line 489: Fixed negation logic

if !&lifecycle_res

⚠️ Style: Consider simplifying to if !lifecycle_res (remove the & reference operator).

Documentation

📚 Documentation Quality

✅ Excellent: The new restoreHibernatingRequests method in mod.ts:1587-1614 has comprehensive JSDoc comments explaining:

What the method does
When to call it
What it handles
Important ordering requirements

Recommendations

Fix the reference operator in pegboard-gateway/src/lib.rs:489 - change !&lifecycle_res to !lifecycle_res
Improve actor replacement handling - Either throw an error when replacing an existing actor, or ensure proper cleanup of the old actor's resources
Complete the TODO at mod.ts:988 - Implement proper "crashed" state marking for failed actors
Add test coverage for the race condition fixes and error scenarios
Consider refactoring duplicate code in createPendingRequest methods
Document the message_index removal in shared_state.rs with a comment explaining why it's no longer needed
Verify message ordering still works correctly after removing message_index = 0 reset

Conclusion

This PR makes important fixes to the hibernation implementation, particularly around race conditions and lifecycle management. The code quality is generally good with excellent documentation. The main concerns are:

A few minor code style issues (unnecessary & operator)
Incomplete error handling (TODO comments)
Potential resource cleanup issues when actors are replaced
Limited test coverage for the new scenarios

Overall, the changes improve the system's reliability, but I recommend addressing the critical issues before merging.

Review generated by Claude Code

claude · 2025-11-20T22:48:07Z

Code Review: Hibernation Implementation Fix

Overview

This PR fixes the hibernation implementation for RivetKit by addressing race conditions, improving WebSocket restoration flow, and enhancing error handling. The changes span both Rust (pegboard-gateway) and TypeScript (runner SDK) components.

Critical Bug Fix ✅

Location: engine/packages/pegboard-gateway/src/lib.rs:489

- if lifecycle_res
+ if \!&lifecycle_res

Issue: This was a critical logic bug - the condition was inverted. The code was sending close frames when hibernating (should not) and not sending them when not hibernating (should).

Impact: This fix ensures WebSocket close frames are only sent to the runner when NOT hibernating, which is the correct behavior.

Architecture Changes 🏗️

1. Actor Lifecycle Management (`mod.ts:915-990`)

Improvements:

Added actorStartPromise to synchronize actor initialization
Pre-populate requestToActor map BEFORE async operations to prevent race conditions
Better handling of existing actors with warning logs

Concern:

// TODO: Where is this GC'd if something fails?
for (const hr of startCommand.hibernatingRequests) {
    this.#tunnel.addRequestToActor(hr.gatewayId, hr.requestId, actorId);
}

This TODO raises a valid concern - if actor initialization fails after populating the map, these entries may leak. Consider adding cleanup in the error handler.

Recommendation: Add cleanup in the catch block:

} catch (err) {
    // ... existing error handling ...
    
    // Cleanup requestToActor mappings on failure
    for (const hr of startCommand.hibernatingRequests) {
        this.#tunnel.removeRequestToActor(hr.gatewayId, hr.requestId);
    }
    
    instance.actorStartPromise.reject(err);
    await this.forceStopActor(actorId, generation);
}

2. Hibernation Restoration Flow (`tunnel.ts:105-280`)

Good:

Clear separation of concerns - caller loads metadata, tunnel restores connections
Added hibernationRestored flag to prevent double restoration
Proper error handling for missing actors or duplicate restoration attempts

Issue - API Design:
The restoreHibernatingRequests method has shifted responsibility:

Before: Tunnel loaded metadata internally
After: Caller must load metadata and pass it in

While this improves separation of concerns, the documentation in mod.ts:73-108 needs updating to reflect that onActorStart is now responsible for both loading AND calling restoreHibernatingRequests.

Current documentation says:

loadAll will be called from Tunnel::restoreHibernatingRequests

But now:

The user of this library is responsible for loading all persisted hibernatable WebSocket metadata

This is better design, but ensure all consumers are updated.

3. Race Condition Prevention (`tunnel.ts:444-454`)

Excellent addition:

async getAndWaitForRequestActor(
    gatewayId: GatewayId,
    requestId: RequestId,
): Promise<RunnerActor | undefined> {
    const actor = this.getRequestActor(gatewayId, requestId);
    if (\!actor) return;
    await actor.actorStartPromise.promise;
    return actor;
}

This ensures tunnel messages wait for actor initialization before processing, preventing race conditions when messages arrive during startup.

Code Quality Issues 🔍

1. Pending Request Duplication (`actor.ts:65-133`)

Issue:

const exists = this.getPendingRequest(gatewayId, requestId) \!== undefined;
if (exists) {
    logger()?.warn({
        msg: "attempting to set pending request twice, replacing existing",
        // ...
    });
    this.deletePendingRequest(gatewayId, requestId);
}

Concern: Why would we attempt to set a pending request twice? This warning suggests a potential logic error in the caller. Consider:

Is this expected behavior or a bug being masked?
Should this throw an error instead of warning?
Are there race conditions causing duplicate calls?

Recommendation: Investigate why duplicates occur. If it's truly expected, document why. If it's a bug, fix the caller instead of working around it.

2. WebSocket Duplication Handling (`actor.ts:173-181`)

const exists = this.getWebSocket(gatewayId, requestId) \!== undefined;
if (exists) {
    logger()?.warn({ msg: "attempting to set websocket twice" });
    return;  // Silent failure
}

Issue: Silent failures can mask bugs. This should either:

Throw an error if duplicates indicate a bug
Document why duplicates are expected and safe to ignore
Return a success/failure indicator

3. Removed Field (`shared_state.rs:127`)

- entry.message_index = 0;

Question: Where is message_index used? Ensure removing this reset doesn't cause index tracking issues across reconnections.

Logging Improvements ✅

Good additions:

tunnel.ts:510: Added messageIndex to send logs
shared_state.rs:196: Added detailed pending message tracking
Consistent use of structured logging with context

Minor issue:

tracing::debug\!(
-   request_id=?tunnel_id::request_id_to_string(&request_id),
    ack_index,
    removed_count = len_before - len_after,

Removing request_id from logs may make debugging harder. Consider keeping it.

Performance Considerations ⚡

Background Operations (tunnel.ts:129): Using Promise.allSettled(backgroundOperations) is good for handling restoration concurrently, but be mindful of:
- Potential memory pressure with many hibernating connections
- Order dependencies between operations
Linear Search (actor.ts:58-62): getPendingRequest uses linear search with find(). Consider using a Map keyed by gatewayId+requestId if pendingRequests arrays grow large.

Security Concerns 🔒

No major security issues identified. Good practices:

Proper error handling without exposing internals
Validation of actor existence before operations
Close codes used appropriately (1000 for normal, error codes for abnormal)

Test Coverage 📊

Observation: No test file changes in this PR.

Recommendations:

Add tests for the inverted condition fix (lib.rs:489)
Test race condition scenarios with concurrent actor start + message arrival
Test hibernation restoration with missing/extra metadata
Test duplicate request/websocket detection paths

Example test scenarios:

Actor receives tunnel message during initialization (before actorStartPromise resolves)
Restoration with metadata that doesn't match gateway's hibernating requests
Actor initialization failure after requestToActor population

Documentation 📚

Good:

Comprehensive JSDoc updates in mod.ts
Clear comments about race condition prevention

Needs improvement:

Update hibernatableWebSocket.loadAll documentation (now removed from interface)
Document why duplicate pending requests might occur
Add inline comments explaining the actor lifecycle state machine

Summary

Strengths:
✅ Critical bug fix (inverted condition)
✅ Improved race condition handling with promise synchronization
✅ Better separation of concerns in hibernation flow
✅ Enhanced logging for debugging

Areas for improvement:
⚠️ Memory leak potential in requestToActor map on actor init failure
⚠️ Investigate root cause of duplicate pending request warnings
⚠️ Add test coverage for critical paths
⚠️ Consider error handling improvements (throw vs warn+return)

Overall verdict: This is a significant improvement to the hibernation implementation. The critical bug fix alone justifies merging, but please address the memory leak concern and consider adding tests for the new synchronization logic.

Priority fixes before merge:

Add cleanup for requestToActor mappings in actor init error handler
Add at least one test for the inverted condition bug fix

Follow-up tasks:

Investigate duplicate pending request warnings
Add comprehensive test coverage for hibernation restoration
Profile performance with many concurrent hibernating connections

claude · 2025-11-20T23:52:05Z

Code Review - Hibernation Implementation Fix

This PR addresses critical issues in the WebSocket hibernation implementation. I've conducted a thorough review focusing on code quality, potential bugs, performance, and security.

🎯 Critical Issues

1. CRITICAL: Logic Error in Rust (pegboard-gateway/src/lib.rs:489)

-if lifecycle_res
+if \!&lifecycle_res

Status: ✅ FIXED

This is the core fix! The original logic was inverted - it would send close frames when it should hibernate. However, the syntax \!&lifecycle_res is concerning. You're negating a reference, not the value.

Recommendation: Change to \!lifecycle_res.as_ref().map_or_else(is_ws_hibernate, |_| false) for clarity, or verify that the current implementation correctly evaluates to what you intend.

2. CRITICAL: Resource Leak on Actor Start Failure (mod.ts:959)

// TODO: Where is this GC'd if something fails?
for (const hr of startCommand.hibernatingRequests) {
    this.#tunnel.addRequestToActor(hr.gatewayId, hr.requestId, actorId);
}

If onActorStart throws (line 983), the requestToActor mappings added above are never cleaned up. The catch block calls forceStopActor, but this may not remove these specific mappings.

Impact: Memory leak, incorrect routing of future requests

Fix Required:

} catch (err) {
    this.log?.error({ msg: "error starting runner actor", actorId, err });
    instance.actorStartPromise.reject(err);
    
    // Clean up request-to-actor mappings
    for (const hr of startCommand.hibernatingRequests) {
        this.#tunnel.removeRequestToActor(hr.gatewayId, hr.requestId);
    }
    
    await this.forceStopActor(actorId, generation);
}

Note: You'll need to make removeRequestToActor public or add a cleanup method to Tunnel.

3. Race Condition: Actor Removal Timing (mod.ts:305-310)

this.#tunnel?.closeActiveRequests(actor);

// Remove actor after stopping
this.#removeActor(actorId, generation);

this.#sendActorStateUpdate(actorId, actor.generation, "stopped");

closeActiveRequests is called while the actor is still in #actors map. If a new CommandStartActor arrives during this time, you could briefly have duplicate actors. Consider moving #removeActor before closeActiveRequests.

⚠️ Important Issues

4. Inconsistent Duplicate Handling (actor.ts:176-180 vs 65-80)

setWebSocket silently returns without adding duplicates:

if (exists) {
    logger()?.warn({ msg: "attempting to set websocket twice" });
    return; // Silently fails
}

But createPendingRequest deletes the existing one and replaces:

if (exists) {
    logger()?.warn({ msg: "attempting to set pending request twice, replacing existing", ... });
    this.deletePendingRequest(gatewayId, requestId);
}

Impact: Inconsistent behavior could cause subtle bugs

Recommendation: Either make both consistent, or document why they differ.

5. Missing Message Index Reset Documentation (shared_state.rs:127)

The line entry.message_index = 0; was removed. This appears to be an intentional behavior change where message indices are now preserved across reconnections.

Recommendation: Add a comment explaining this behavior change and why it's correct for hibernation.

6. Unhandled Background Promises (mod.ts:863-867)

if (commandWrapper.inner.tag === "CommandStartActor") {
    this.#handleCommandStartActor(commandWrapper);
} else if (commandWrapper.inner.tag === "CommandStopActor") {
    this.#handleCommandStopActor(commandWrapper);
}

These async functions are fire-and-forget. Unhandled rejections could cause silent failures.

Fix:

this.#handleCommandStartActor(commandWrapper).catch(err => {
    this.log?.error({ msg: "unhandled error in CommandStartActor", err });
});

📝 Code Quality

7. Typo (tunnel.ts:762)

"catught" should be "caught"

8. Code Duplication (actor.ts:65-137)

createPendingRequest and createPendingRequestWithStreamController share ~80% of their code. Consider refactoring:

createPendingRequest(
    gatewayId: protocol.GatewayId,
    requestId: protocol.RequestId,
    clientMessageIndex: number,
    streamController?: ReadableStreamDefaultController<Uint8Array>,
) {
    // Unified implementation
}

9. Removed Debug Context (shared_state.rs:395)

-request_id=?tunnel_id::request_id_to_string(&request_id),

This removes helpful debugging context. Was this intentional? Consider keeping it.

✅ Positive Changes

Excellent synchronization pattern with actorStartPromise - prevents race conditions
Improved API clarity - restoreHibernatingRequests is now the caller's responsibility with clear documentation
Better logging - structured logging with context throughout
Defensive programming - hibernationRestored flag prevents double-restoration
Good use of TypeScript - proper typing maintained throughout

🔍 Testing Recommendations

The test file shows minimal changes. Consider adding tests for:

✅ Actor start failure cleanup (addresses issue [SVC-2479] Send cluster events to PostHog #2)
✅ Hibernation state restoration after actor restart
✅ Duplicate WebSocket/request handling
✅ Message index preservation across hibernation
✅ The Rust close frame logic (integration test)

🏎️ Performance Considerations

Linear searches in arrays (actor.ts, tunnel.ts):
Multiple findIndex operations on pendingRequests, webSockets, and requestToActor arrays could become O(n) bottlenecks under load.

Recommendation: For production at scale, consider using Maps with composite keys for O(1) lookups.

🔒 Security

No security vulnerabilities identified. Proper isolation between actors is maintained.

Summary

Must Fix Before Merge:

Resource leak in actor start error path ([SVC-2479] Send cluster events to PostHog #2)
Rust boolean logic syntax verification ([SVC-2555] Set up issue templates #1)
Unhandled promise rejections (Remove servers not in salt #6)

Should Fix:

Race condition in actor removal ([SVC-2504] Fix 5 GB upload limit for local development from Cloudflare #3)
Inconsistent duplicate handling ([SVC-2483] Remove hardcoded uses of rivet.gg #4)
Typo ([SVC-2404] Improve internal documentation #7)

Nice to Have:

Document message_index behavior ([SVC-2358] Enable Redis memory overcommit #5)
Refactor duplicate code (Remove fallible uses of as in api services #8)
Restore debug context ([SOC-15] Fix error when running user-delete #9)
Add comprehensive tests
Performance optimizations for high-load scenarios

Overall Assessment: 7/10 - Solid architectural improvements and critical bug fixes, but needs cleanup in error handling paths before merge.

Great work on identifying and fixing the hibernation logic bug! 🎉

claude · 2025-11-21T00:52:12Z

Pull Request Review: Hibernation Implementation Fix

I've reviewed this PR which fixes the hibernation implementation. Overall, the changes look good and address important race conditions and lifecycle management issues. Here are my findings:

✅ Positive Changes

Critical Bug Fix in Rust Code (pegboard-gateway/src/lib.rs:489)
- Fixed the hibernation check from if lifecycle_res to if !&lifecycle_res
- This was causing WebSocket close frames to be sent incorrectly when hibernating
- This is a critical fix that properly prevents closing hibernating connections
Better Race Condition Handling (runner/src/mod.ts)
- Added actorStartPromise to coordinate actor initialization with incoming tunnel messages
- Pre-populates requestToActor map BEFORE any async code runs (lines 956-962)
- Ensures incoming messages wait for actor initialization to complete
- Good use of the promiseWithResolvers pattern for coordination
Improved API Design (runner/src/mod.ts)
- Made restoreHibernatingRequests a public method on Runner (line 1607)
- Clear documentation explaining when and how to call this method
- Better separation of concerns - caller is now responsible for loading metadata
Better Request Management (runner/src/actor.ts)
- Refactored setPendingRequest into createPendingRequest and createPendingRequestWithStreamController
- Checks for duplicates and logs warnings appropriately
- Cleaner separation of concerns
Enhanced Logging Throughout
- Added structured logging with relevant context (gateway IDs, request IDs, message indices)
- Helps with debugging hibernation-related issues
- Follows the structured logging conventions from CLAUDE.md

🔍 Potential Issues & Concerns

1. TODO Not Addressed (`runner/src/mod.ts:959`)

// TODO: Where is this GC'd if something fails?
for (const hr of startCommand.hibernatingRequests) {
    this.#tunnel.addRequestToActor(hr.gatewayId, hr.requestId, actorId);
}

Issue: If actor start fails after populating the requestToActor map, these entries may leak.

Recommendation: Consider wrapping the try-catch to clean up these mappings on failure:

try {
    // ... actor start logic
} catch (err) {
    // Clean up request mappings
    for (const hr of startCommand.hibernatingRequests) {
        this.#tunnel.removeRequestToActor(hr.gatewayId, hr.requestId);
    }
    // ... rest of error handling
}

2. Actor Replacement Warning (`runner/src/mod.ts:943-952`)

The code logs a warning when replacing an existing actor but doesn't clean up the old actor's resources:

if (existingActor) {
    this.log?.warn({
        msg: "replacing existing actor in actors map",
        // ...
    });
}

Recommendation: Consider calling cleanup on the existing actor before replacement, or validate that this shouldn't happen at all.

3. Inconsistent Error Severity (`runner/src/mod.ts:327,337`)

Changed from error to warn for missing actors and generation mismatches:

this.log?.warn({ msg: "actor not found" });

Question: Is this intentional? If these scenarios are expected during normal operation (e.g., race conditions), warn is appropriate. Otherwise, error might be better.

4. SetWebSocket Duplicate Handling (`runner/src/actor.ts:176-180`)

const exists = this.getWebSocket(gatewayId, requestId) !== undefined;
if (exists) {
    logger()?.warn({ msg: "attempting to set websocket twice" });
    return;  // Silently returns without throwing
}

Concern: Unlike createPendingRequest which replaces the existing entry, setWebSocket silently returns. This inconsistency could hide bugs.

Recommendation: Either throw an error or document why silent failure is the correct behavior here.

5. Message Index Tracking (`pegboard-gateway/src/shared_state.rs:127`)

The line entry.message_index = 0; was removed.

Question: Is this intentional? The comment at line 193 suggests message indices are being tracked. Need to verify this doesn't break message ordering.

6. Logging in Production (`rivetkit/src/actor/instance/state-manager.ts:431-445`)

Very verbose logging at info level for every connection persistence:

this.#actor.rLog.info({
    msg: "persisting connection",
    connId,
    gatewayId: tunnelId.gatewayIdToString(hibernatableDataRaw.requestId),
    // ... lots of details
});

Recommendation: Consider downgrading some of these to debug level to reduce log volume in production.

🔒 Security Considerations

No significant security issues identified. The changes properly validate state and handle error cases.

⚡ Performance Considerations

Array Linear Search (runner/src/actor.ts:58-62, 164-168)
- Using find() on arrays for pending requests and WebSockets
- If these arrays grow large, consider using Map with a composite key
- Current implementation is probably fine for typical use cases
Message Index Logging (runner/src/tunnel.ts:510)
- Debug logging on every message send could be expensive at high volume
- Consider using sampling or conditional logging based on log level

🧪 Test Coverage

Observation: The PR modifies test files:

driver-test-suite/tests/actor-conn-hibernation.ts
driver-test-suite/mod.ts

The changes appear to be minor adjustments (line 2 changes).

Recommendation: Ensure test coverage for:

Actor start failure cleanup (the TODO mentioned above)
Duplicate WebSocket handling
Race conditions between actor start and incoming messages
Restoration of hibernating connections with stale metadata

📝 Code Quality

Good adherence to CLAUDE.md conventions:
- ✅ Uses structured logging correctly
- ✅ Log messages are lowercase
- ✅ Proper TypeScript patterns
Documentation:
- ✅ Good JSDoc comments on public methods
- ✅ Clear explanation of the hibernation restoration flow
Code organization:
- ✅ Clean separation of concerns
- ✅ Meaningful method names

🎯 Recommendations Summary

High Priority:

Address the resource leak TODO (line 959)
Clean up existing actor resources before replacement
Clarify the behavior when setting duplicate WebSockets

Medium Priority:
4. Verify message_index removal doesn't break ordering
5. Review log level choices (info vs debug vs warn)
6. Add test coverage for edge cases

Low Priority:
7. Consider using Map instead of Array for connection tracking
8. Consider log sampling for high-frequency debug logs

✅ Conclusion

This is a solid PR that fixes a critical hibernation bug and improves the overall robustness of the system. The race condition handling and API improvements are well thought out. The main concerns are around resource cleanup in failure scenarios and some minor inconsistencies in error handling.

Recommendation: Approve with minor changes requested (address the high-priority items above).

Great work on improving the hibernation implementation! 🚀

claude · 2025-11-21T00:57:38Z

Code Review: Hibernation Implementation Fix

I've reviewed the changes in this PR. Overall, the fix addresses critical issues in the hibernation implementation across both Rust (pegboard-gateway) and TypeScript (runner SDK and RivetKit). Here are my findings:

✅ Strengths

1. Critical Bug Fix in Rust

File: engine/packages/pegboard-gateway/src/lib.rs:489
Fix: Changed if lifecycle_res to if !&lifecycle_res (negation added)
Impact: This fixes the logic for determining when to send close frames to runners. Previously, close frames were sent when hibernating (incorrect), now they're properly sent when NOT hibernating.

2. Improved Synchronization in TypeScript Runner

Files: engine/sdks/typescript/runner/src/mod.ts, engine/sdks/typescript/runner/src/actor.ts
Enhancement: Introduction of actorStartPromise to ensure proper sequencing:
- Hibernating requests are registered BEFORE async onActorStart runs
- Incoming messages wait for actor initialization via await actor.actorStartPromise.promise
- This prevents race conditions where messages arrive before the actor is ready

3. Better API Design for Hibernation Restoration

File: engine/sdks/typescript/runner/src/mod.ts:1607-1614
Change: Moved restoreHibernatingRequests to public API with excellent documentation
Benefit: Clear contract for users - they must call this at the end of onActorStart, preventing misuse

4. Enhanced Logging and Debugging

Added structured logging throughout with context (gatewayId, requestId, messageIndex)
Removed unnecessary request_id parameter from ack logging (line 398 in shared_state.rs)
Added hibernation restoration count tracking

⚠️ Issues & Concerns

1. Removed `message_index` Reset - Potential Issue

File: engine/packages/pegboard-gateway/src/shared_state.rs:127
Change: Removed entry.message_index = 0; when reopening a request
Concern: If a WebSocket reconnects after hibernation, should the message index continue from where it left off or reset? The removal suggests continuation, but this needs verification:
- Does the gateway track message indices across hibernation cycles?
- Could this cause message ID conflicts or ordering issues?
Recommendation: Add a comment explaining why this line was removed and the expected behavior for message indices across hibernation boundaries.

2. Missing GC for Failed Actor Start

File: engine/sdks/typescript/runner/src/mod.ts:959
TODO Comment: // TODO: Where is this GC'd if something fails?
Issue: When onActorStart fails, the requestToActor mappings are populated but may not be cleaned up properly
Current Mitigation: forceStopActor is called on failure (line 998), which should close active requests
Recommendation: Verify that closeActiveRequests properly cleans up the requestToActor map, or explicitly handle cleanup in the error path

3. Code Duplication in `createPendingRequest` Methods

File: engine/sdks/typescript/runner/src/actor.ts:65-138
Issue: createPendingRequest and createPendingRequestWithStreamController share significant duplicate code
Recommendation: Refactor to reduce duplication:

createPendingRequest(
  gatewayId: protocol.GatewayId,
  requestId: protocol.RequestId,
  clientMessageIndex: number,
  streamController?: ReadableStreamDefaultController<Uint8Array>,
) {
  // ... shared validation and deletion logic ...
  this.pendingRequests.push({
    gatewayId,
    requestId,
    request: {
      resolve: () => {},
      reject: () => {},
      actorId: this.actorId,
      gatewayId,
      requestId,
      clientMessageIndex,
      ...(streamController && { streamController }),
    },
  });
  // ... shared logging ...
}

4. Potential Race Condition in `setWebSocket`

File: engine/sdks/typescript/runner/src/actor.ts:176-180
Change: Early return instead of delete-then-add when WebSocket exists
Concern: If the same WebSocket is being set twice, returning early might hide a bug rather than fixing it. The warning log is good, but:
- Why would this happen in correct operation?
- Should this be an error/throw instead of a warning?
Recommendation: Consider if this indicates a bug in the caller that should be fixed rather than worked around

5. Hibernation State Validation Missing

File: engine/sdks/typescript/runner/src/tunnel.ts:110-116
Good: Added hibernationRestored flag and assertion
Missing: The flag is checked but there's no enforcement that restoreHibernatingRequests is called for actors with hibernating requests
Recommendation: Consider adding a check in actor stop/cleanup to warn if an actor had hibernatingRequests but never called restore

🔍 Security Considerations

No Immediate Security Issues

WebSocket close frame handling is properly secured
No injection vulnerabilities introduced
Proper error handling prevents information leakage

🧪 Testing

Positive:

Existing test actor-conn-hibernation.ts was minimally modified (just imports)
Test coverage appears to validate the hibernation flow

Recommendations:

Add specific test for the negation fix (!&lifecycle_res) to ensure close frames aren't sent during hibernation
Add test for actor start failures to verify requestToActor cleanup
Add test for duplicate setWebSocket calls to document expected behavior

📊 Performance Considerations

Positive:

getAndWaitForActor properly awaits initialization, preventing wasted work
Structured logging uses lazy evaluation (good for production)

Considerations:

Multiple arraysEqual calls for searching pending requests could be optimized with a Map if this becomes a bottleneck
The actorStartPromise adds minimal overhead but provides critical correctness

📝 Documentation

Excellent:

The restoreHibernatingRequests JSDoc is comprehensive and clear
Comments about race condition prevention (line 916-918) are valuable

Could Improve:

Add inline comments explaining the message_index removal in shared_state.rs
Document the expected behavior when setWebSocket is called twice

Summary

This PR makes important fixes to the hibernation implementation, particularly the critical negation bug in the Rust code. The TypeScript changes significantly improve synchronization and API clarity.

Recommended Actions Before Merge:

✅ Verify and document the message_index reset removal
✅ Confirm GC behavior for failed actor starts
⚠️ Consider refactoring duplicate code in createPendingRequest methods
⚠️ Evaluate if duplicate setWebSocket should be an error

Risk Assessment: Low - Changes fix actual bugs and improve robustness, with minimal risk of regression given existing test coverage.

Great work on improving the hibernation reliability! 🚀

claude · 2025-11-21T01:30:58Z

Pull Request Review: Hibernation Implementation Fix

Overview

This PR fixes critical issues with the hibernation implementation across the pegboard gateway, TypeScript runner SDK, and RivetKit.

Critical Bug Fixes

1. Negation Logic Fix in pegboard-gateway/src/lib.rs:489

Previously sending close frames to hibernating connections
Now correctly skips close frames only when hibernating
This was likely causing hibernated WebSocket connections to be prematurely closed

2. Message Index Reset Removed in shared_state.rs:127

Previously reset message index on reconnection, breaking message ordering
Now preserves message index continuity across hibernation cycles
Critical for maintaining proper WebSocket message sequencing

Architecture Improvements

3. Runner Actor Lifecycle Management
The new actorStartPromise pattern prevents race conditions between actor start and incoming tunnel messages.

Question: What happens if onActorStart rejects? The promise is rejected but actor remains in the map until forceStopActor completes. Could this leave zombie actors?

4. Hibernation Restoration API Redesign
The move from automatic loadAll to explicit restoreHibernatingRequests is a significant improvement with better control flow and error handling.

Potential Issues

5. WebSocket Duplication Check (actor.ts:176-180)
Silently returns instead of throwing when attempting to set websocket twice. Could hide bugs.

6. Pending Request Duplication (actor.ts:70-80, 108-117)
Automatically replaces existing pending requests without calling old callbacks. Should throw an error or call the old callbacks with an error.

Test Coverage Concerns

No new tests for:

Hibernation restoration flow
Race condition fixes
Message index preservation across hibernation
Duplicate WebSocket/pending request scenarios

Recommendations

High Priority:

Consider throwing instead of silent returns in setWebSocket duplicate case
Handle old callbacks when replacing pending requests
Add comprehensive tests for the new hibernation flow

Medium Priority:
4. Document error recovery for actor start failures
5. Consider adding metrics for hibernation restoration success/failure rates

Overall Assessment

This is a solid PR that fixes critical bugs in the hibernation implementation.

Strengths:

Fixes critical negation bug that would break hibernation
Fixes message index reset bug
Improves API design significantly
Good race condition handling

Recommendation: Approve with suggestions. The core fixes are correct and necessary. Suggested improvements are mostly defensive programming and testing.

graphite-app · 2025-11-21T02:40:42Z

Merge activity

Nov 21, 2:40 AM UTC: NathanFlurry added this pull request to the Graphite merge queue.
Nov 21, 2:42 AM UTC: CI is running for this pull request on a draft pull request (#3515) due to your merge queue CI optimization settings.
Nov 21, 2:43 AM UTC: Merged by the Graphite merge queue via draft PR: #3515.

MasterPtato mentioned this pull request Nov 20, 2025

fix(tunnel): implement ping pong #3491

Closed

This was referenced Nov 20, 2025

chore(engine-runner): remove echo close on ToClientWebSocketClose #3487

Closed

chore(pegboard-gateway): remove TunnelAck #3488

Closed

chore(pegboard-gateway): add new message id format & add deprecated tunnel ack #3492

Closed

vercel bot deployed to Preview – rivetkit-serverless November 20, 2025 04:53 View deployment

NathanFlurry force-pushed the 11-19-chore_rivetkit_fix_hibernation_implementation branch from e1191b4 to b960df1 Compare November 20, 2025 07:58

vercel bot deployed to Preview – rivetkit-serverless November 20, 2025 08:00 View deployment

NathanFlurry mentioned this pull request Nov 20, 2025

chore(rivetkit): make action cancelled error more descriptive #3500

Closed

MasterPtato force-pushed the 11-19-chore_rivetkit_fix_hibernation_implementation branch from b960df1 to 8145fc3 Compare November 20, 2025 19:38

MasterPtato force-pushed the 11-19-fix_tunnel_implement_ping_pong branch from f0f270e to 40a9f30 Compare November 20, 2025 19:38

vercel bot had a problem deploying to Preview – rivetkit-serverless November 20, 2025 19:38 Failure

MasterPtato mentioned this pull request Nov 20, 2025

fix(pb): stop actors from destroying on goingaway #3501

Closed

NathanFlurry mentioned this pull request Nov 20, 2025

feat(cloudflare-workers): add createInlineClient #3466

Closed

MasterPtato force-pushed the 11-19-chore_rivetkit_fix_hibernation_implementation branch from 193f67a to 8145fc3 Compare November 20, 2025 22:38

MasterPtato force-pushed the 11-19-fix_tunnel_implement_ping_pong branch from d287067 to 40a9f30 Compare November 20, 2025 22:38

NathanFlurry force-pushed the 11-19-fix_tunnel_implement_ping_pong branch from 40a9f30 to d287067 Compare November 20, 2025 23:32

NathanFlurry force-pushed the 11-19-chore_rivetkit_fix_hibernation_implementation branch from 8145fc3 to 193f67a Compare November 20, 2025 23:32

NathanFlurry mentioned this pull request Nov 20, 2025

chore(examples): add counter-next-js example #3506

Closed

NathanFlurry mentioned this pull request Nov 21, 2025

fix(engine): ensure first keepalive tick is not skipped to prevent timeout on ws hibernation #3507

Closed

MasterPtato force-pushed the 11-19-fix_tunnel_implement_ping_pong branch from d287067 to 40a9f30 Compare November 21, 2025 00:48

MasterPtato force-pushed the 11-19-chore_rivetkit_fix_hibernation_implementation branch from 193f67a to 8145fc3 Compare November 21, 2025 00:48

MasterPtato mentioned this pull request Nov 21, 2025

chore: remove msg id parts from protocol, move id utils to runner protocol crate #3508

Closed

NathanFlurry force-pushed the 11-19-fix_tunnel_implement_ping_pong branch from 40a9f30 to d287067 Compare November 21, 2025 01:27

NathanFlurry force-pushed the 11-19-chore_rivetkit_fix_hibernation_implementation branch from 8145fc3 to 193f67a Compare November 21, 2025 01:27

This was referenced Nov 21, 2025

chore(rivetkit): buffer outbound tunnel messages #3509

Closed

docs: runner lifecycle #3510

Closed

chore(engine): upgrade vbare 0.0.4 #3511

Closed

This was referenced Nov 21, 2025

fix(runner-protocol): add noop converters for v1 -> v3 #3512

Closed

fix(runner-protocol): fix decoding gateway & req id from req id instead of message id #3513

Closed

chore(pegboard-runner): log runner messages #3514

Closed

NathanFlurry requested a review from MasterPtato November 21, 2025 02:33

graphite-app bot pushed a commit that referenced this pull request Nov 21, 2025

chore(rivetkit): fix hibernation implementation (#3494)

b3313f9

graphite-app bot closed this Nov 21, 2025

graphite-app bot deleted the 11-19-chore_rivetkit_fix_hibernation_implementation branch November 21, 2025 02:43

chore(rivetkit): fix hibernation implementation #3494

chore(rivetkit): fix hibernation implementation #3494

Uh oh!

Conversation

NathanFlurry commented Nov 20, 2025

Uh oh!

vercel bot commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NathanFlurry commented Nov 20, 2025 • edited by MasterPtato Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to use the Graphite Merge Queue

Uh oh!

claude bot commented Nov 20, 2025

Code Review: Hibernation Implementation Fix

🟢 Strengths

1. Critical Bug Fix in Rust Gateway (pegboard-gateway/src/lib.rs:489)

2. Race Condition Prevention (runner/src/mod.ts:915-962)

3. Improved API Design (runner/src/mod.ts:109-146)

4. Better State Management (runner/src/actor.ts:65-99)

🟡 Issues & Concerns

1. Resource Leak on Actor Start Failure (runner/src/mod.ts:959)

2. Duplicate WebSocket Warning Without Action (runner/src/actor.ts:176-180)

3. Missing Error Handling in Async Message Handler (runner/src/mod.ts:852-869)

4. Potential State Inconsistency (runner/src/tunnel.ts:116-119)

5. Removed Message Index Reset (pegboard-gateway/src/shared_state.rs:127)

🔵 Minor Issues

1. Logging Improvements Are Good ✅

2. Removed Debug Log (pegboard-gateway/src/shared_state.rs:395)

3. Test Changes (driver-test-suite/tests/actor-conn-hibernation.ts)

🔒 Security Considerations

📊 Performance Considerations

🧪 Test Coverage

📝 Documentation

✅ Recommendations Summary

Must Fix (Before Merge):

Should Fix:

Nice to Have:

Uh oh!

claude bot commented Nov 20, 2025

Code Review: Hibernation Implementation Fix

Summary

Positive Observations

Issues & Concerns

Performance Considerations

Test Coverage

Recommendations

Overall Assessment

Uh oh!

claude bot commented Nov 20, 2025

Code Review: Hibernation Implementation Fix

🐛 Critical Bug Fix (pegboard-gateway/src/lib.rs:489)

🏗️ Architecture Improvements

1. Actor Start Synchronization (runner/src/mod.ts)

2. Pending Request Management (runner/src/actor.ts)

3. Hibernation State Contract

🔍 Code Quality

Positive:

Concerns:

⚡ Performance Considerations

🧪 Testing

🔒 Security

📝 Documentation

🎯 Summary

Uh oh!

claude bot commented Nov 20, 2025

PR Review: chore(rivetkit): fix hibernation implementation

Critical Issues

1. Inverted Boolean Logic in Rust (lib.rs:489)

2. Race Condition in Actor Start (mod.ts:933-945)

3. Actor Removed After Stop (mod.ts:288-310)

Potential Bugs

4. Missing Cleanup on Actor Start Failure

5. Empty resolve/reject Functions (actor.ts:85-86)

6. Duplicate WebSocket Detection (actor.ts:176-180)

Code Quality

7. Typo in Log Message (tunnel.ts:762)

8. Inconsistent Logging

9. Removed message_index Reset (shared_state.rs:127)

Performance

vercel bot commented Nov 20, 2025 •

edited

Loading

NathanFlurry commented Nov 20, 2025 •

edited by MasterPtato

Loading

`engine/sdks/typescript/runner/src/mod.ts`

`engine/sdks/typescript/runner/src/actor.ts`

`engine/packages/pegboard-gateway/src/lib.rs`

1. Actor Lifecycle Management (`mod.ts:915-990`)

2. Hibernation Restoration Flow (`tunnel.ts:105-280`)

3. Race Condition Prevention (`tunnel.ts:444-454`)

1. Pending Request Duplication (`actor.ts:65-133`)

2. WebSocket Duplication Handling (`actor.ts:173-181`)

3. Removed Field (`shared_state.rs:127`)