-
Notifications
You must be signed in to change notification settings - Fork 134
chore(rivetkit): fix hibernation implementation #3494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(rivetkit): fix hibernation implementation #3494
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
3 Skipped Deployments
|
Code Review: Hibernation Implementation FixThis PR addresses critical issues in the hibernation implementation for WebSocket connections. Overall, the changes improve the architecture and fix race conditions, though there are some areas that need attention. 🟢 Strengths1. Critical Bug Fix in Rust Gateway (pegboard-gateway/src/lib.rs:489)-if lifecycle_res
+if !&lifecycle_resExcellent catch! This was a logic inversion bug - the code was sending close frames when it should hibernate, and vice versa. This fix is critical for proper hibernation behavior. 2. Race Condition Prevention (runner/src/mod.ts:915-962)The addition of
The comment at line 916 is particularly valuable: // IMPORTANT: Make sure no async code runs before inserting #actors and
// calling addRequestToActor in order to prevent race conditions with
// subsequence commands3. Improved API Design (runner/src/mod.ts:109-146)The refactored
4. Better State Management (runner/src/actor.ts:65-99)Replacing
🟡 Issues & Concerns1. Resource Leak on Actor Start Failure (runner/src/mod.ts:959)// TODO: Where is this GC'd if something fails?
for (const hr of startCommand.hibernatingRequests) {
this.#tunnel.addRequestToActor(hr.gatewayId, hr.requestId, actorId);
}Critical: If Recommendation: try {
await this.#config.onActorStart(actorId, generation, actorConfig);
instance.actorStartPromise.resolve();
} catch (err) {
this.log?.error({
msg: "error starting runner actor",
actorId,
err,
});
instance.actorStartPromise.reject(err);
// Clean up requestToActor mappings
for (const hr of startCommand.hibernatingRequests) {
this.#tunnel.removeRequestToActor(hr.gatewayId, hr.requestId);
}
await this.forceStopActor(actorId, generation);
}2. Duplicate WebSocket Warning Without Action (runner/src/actor.ts:176-180)const exists = this.getWebSocket(gatewayId, requestId) !== undefined;
if (exists) {
logger()?.warn({ msg: "attempting to set websocket twice" });
return; // Just returns without setting!
}Issue: This silently fails when trying to add a duplicate WebSocket. Should this throw an error instead? Or delete the existing one first? The behavior differs from Recommendation: Be consistent with 3. Missing Error Handling in Async Message Handler (runner/src/mod.ts:852-869)#handleCommands(commands: protocol.ToClientCommands) {
for (const commandWrapper of commands) {
if (commandWrapper.inner.tag === "CommandStartActor") {
// Spawn background promise
this.#handleCommandStartActor(commandWrapper);
} else if (commandWrapper.inner.tag === "CommandStopActor") {
// Spawn background promise
this.#handleCommandStopActor(commandWrapper);
}
}
}Issue: These background promises are fire-and-forget. If they throw unhandled exceptions, they could cause silent failures. Recommendation: this.#handleCommandStartActor(commandWrapper).catch(err => {
this.log?.error({ msg: "unhandled error in CommandStartActor", err });
});4. Potential State Inconsistency (runner/src/tunnel.ts:116-119)if (actor.hibernationRestored) {
throw new Error(
`Actor ${actorId} already restored hibernating requests`,
);
}Good: This prevents double-restoration, but... Question: What happens if 5. Removed Message Index Reset (pegboard-gateway/src/shared_state.rs:127)-entry.message_index = 0;Question: Why was this line removed? Is the message index intentionally preserved across hibernation cycles now? This should be documented or explained in the commit message if it's an intentional behavior change. 🔵 Minor Issues1. Logging Improvements Are Good ✅The added structured logging throughout (especially in actor.ts and tunnel.ts) will make debugging much easier. Good practice! 2. Removed Debug Log (pegboard-gateway/src/shared_state.rs:395)-request_id=?tunnel_id::request_id_to_string(&request_id),Was this field removed to reduce log verbosity? Consider if this information is still useful for debugging. 3. Test Changes (driver-test-suite/tests/actor-conn-hibernation.ts)The test file only has minor changes ( 🔒 Security ConsiderationsNo obvious security issues detected. The changes maintain proper separation between actors and don't introduce new attack vectors. 📊 Performance Considerations
🧪 Test CoverageExisting tests in
Missing coverage:
📝 DocumentationThe JSDoc comments added to ✅ Recommendations SummaryMust Fix (Before Merge):
Should Fix:
Nice to Have:
Overall, this is a solid improvement that fixes critical bugs and improves the architecture. The main concerns are around error handling and resource cleanup in failure scenarios. |
e1191b4 to
b960df1
Compare
Code Review: Hibernation Implementation FixThis PR fixes critical issues in the hibernation implementation for RivetKit. SummaryAddresses race conditions and synchronization issues in the WebSocket hibernation flow, ensuring actors are fully initialized before processing hibernating requests. Positive Observations1. Critical Bug Fix in Rust (pegboard-gateway/src/lib.rs:489) 2. Proper Race Condition Handling 3. Improved API Design 4. Enhanced Logging Issues & Concerns1. CRITICAL: Potential Resource Leak (mod.ts:959) Recommendation: Add cleanup in the catch block to remove all hibernatingRequests from the tunnel's requestToActor map. 2. Warning Suppression (actor.ts:177-180) 3. Typo (tunnel.ts:762) 4. Code Duplication (actor.ts:65-137) 5. Missing Validation Performance ConsiderationsLinear Search in Arrays Test CoverageMissing tests for:
RecommendationsMust Fix Before Merge:
Should Fix:
Nice to Have:
Overall AssessmentRating: 7.5/10 This PR fixes critical bugs and improves architecture. Main concerns are the resource leak (must fix) and missing test coverage. With those addressed, this would be production-ready. Great catch on the Rust logic bug! |
b960df1 to
8145fc3
Compare
f0f270e to
40a9f30
Compare
Code Review: Hibernation Implementation FixThanks for working on this hibernation fix! I've reviewed the changes across the Rust gateway code and TypeScript SDK. Here's my feedback: 🐛 Critical Bug Fix (pegboard-gateway/src/lib.rs:489)Excellent catch! The negation operator fix is critical: -if lifecycle_res
+if !&lifecycle_resThis was closing WebSocket connections when they should be hibernating. This is the core fix and correctly addresses the hibernation behavior. 🏗️ Architecture Improvements1. Actor Start Synchronization (runner/src/mod.ts)The new
However, there's a potential resource leak (line 959 TODO): // TODO: Where is this GC'd if something fails?
for (const hr of startCommand.hibernatingRequests) {
this.#tunnel.addRequestToActor(hr.gatewayId, hr.requestId, actorId);
}Recommendation: Add cleanup in the catch block at line 987-999 to remove the request-to-actor mappings if initialization fails: } catch (err) {
// Clean up request mappings if actor start failed
for (const hr of startCommand.hibernatingRequests) {
this.#tunnel.removeRequestToActor(hr.gatewayId, hr.requestId);
}
// ... rest of error handling
}2. Pending Request Management (runner/src/actor.ts)Good refactoring with
Minor concern (line 82): The duplicate request warning at line 66 might mask bugs rather than expose them. Consider if this should be an error instead of a warning, or at least add metrics to track how often this occurs in production. 3. Hibernation State ContractThe 🔍 Code QualityPositive:
Concerns:1. Excessive Logging (state-manager.ts) this.#actor.rLog.info({ msg: "savePersistInner called", ... });
this.#actor.rLog.info({ msg: "prepared entries for kvBatchPut", ... });
this.#actor.rLog.info({ msg: "calling kvBatchPut", ... });
this.#actor.rLog.info({ msg: "kvBatchPut completed successfully" });Recommendation: Downgrade these to 2. Import Ordering (pegboard-gateway/src/lib.rs:9-20)
3. Removed Debug Context - request_id=?tunnel_id::request_id_to_string(&request_id),This removes helpful context from the log at shared_state.rs:398. Was this intentional? The 4. Deleted Message Index Reset (shared_state.rs:127) -entry.message_index = 0;The removal of this line needs explanation. Is message index now preserved across reconnections? If so, add a comment explaining why. ⚡ Performance Considerations
🧪 TestingMissing Test Coverage:
Existing Tests: 🔒 SecurityNo immediate security concerns identified. The hibernation state management appears sound. 📝 DocumentationThe
Great documentation! 🎯 SummaryCritical Issues:
Recommendations:
Overall Assessment: This is a solid fix for a critical bug. The architectural improvements around actor start synchronization are well-designed. The main concern is the unhandled TODO around resource cleanup in error cases. I'd recommend addressing that before merge. Would you like me to help with any of these issues or have questions about the feedback? |
PR Review: chore(rivetkit): fix hibernation implementationThis PR refactors the hibernation implementation for WebSocket connections. Overall, the changes improve clarity, but there are several critical areas that need attention. Critical Issues1. Inverted Boolean Logic in Rust (lib.rs:489)The logic 'if !&lifecycle_res' appears incorrect. The negation '!' is being applied to a reference, which will always be true for any non-null reference. This will cause the gateway to send close frames at the wrong time. 2. Race Condition in Actor Start (mod.ts:933-945)The PR adds hibernatingRequests to the requestToActor map BEFORE onActorStart completes. If incoming tunnel messages arrive during this time and the actor fails during onActorStart, there's a potential race condition. 3. Actor Removed After Stop (mod.ts:288-310)closeActiveRequests is called BEFORE removing the actor. If another CommandStartActor arrives during this window, you could have duplicate actors briefly. Potential Bugs4. Missing Cleanup on Actor Start FailureWhen onActorStart fails, requestToActor mappings were already added. Verify these are properly cleaned up. 5. Empty resolve/reject Functions (actor.ts:85-86)The resolve and reject callbacks are empty no-ops. Document why this is intentional. 6. Duplicate WebSocket Detection (actor.ts:176-180)setWebSocket returns early without replacing duplicates, while createPendingRequest replaces them. This inconsistency could cause bugs. Code Quality7. Typo in Log Message (tunnel.ts:762)'catught' should be 'caught' 8. Inconsistent LoggingStandardize on including both gatewayId and requestId in WebSocket logs. 9. Removed message_index Reset (shared_state.rs:127)Add comment explaining why removing this reset is safe. Performance10. Linear SearchConsider using Map instead of array for pendingRequests and webSockets if many concurrent connections become a bottleneck. Test Coverage11. Missing TestsAdd tests for: the critical Rust logic fix, race conditions during actor start, duplicate handling, and failure scenarios. Documentation12. Excellent ImprovementsThe JSDoc additions for hibernatableWebSocket, onActorStart, and restoreHibernatingRequests significantly improve API clarity. SummaryApprove with Changes Required The architectural changes are sound improvements. However, critical issues must be addressed:
|
PR Review: Hibernation Implementation FixCritical Issues1. Logic Inversion Bug (pegboard-gateway/src/lib.rs:489) The negation operator is applied to a reference rather than the boolean: if !&lifecycle_res.as_ref().map_or_else(is_ws_hibernate, |_| false)Should be: if !lifecycle_res.as_ref().map_or_else(is_ws_hibernate, |_| false)Major Concerns2. Duplicate WebSocket Prevention (runner/src/actor.ts:176-180) Silent failure when duplicate detected - caller may expect WebSocket was registered. Either replace existing, throw error, or document why silent failure is OK. 3. Actor Removal Timing (runner/src/mod.ts:308-311) Moving removeActor after closeActiveRequests changes timing. Need detailed comments on race condition prevention. Code Quality
PerformancePromise.allSettled silently swallows errors. Consider logging which WebSockets failed to restore. SecurityNo input validation on hibernation metadata. Add validation to prevent crashes from corrupted data. Testing GapsNeed tests for edge cases like already-restored state, mismatched metadata, concurrent operations. SummaryMust Fix:
Should Address:
Positive: Good architectural improvements with actorStartPromise pattern and better separation of concerns. Overall: Solid refactoring. Main blocker is Rust bug. Otherwise ready after clarifications. |
193f67a to
8145fc3
Compare
d287067 to
40a9f30
Compare
Pull Request Review: Fix Hibernation ImplementationSummaryThis PR fixes critical issues in the hibernation implementation for WebSocket connections. The changes primarily affect the TypeScript runner SDK and the Rust pegboard gateway, with a focus on proper lifecycle management and race condition prevention during actor startup and WebSocket restoration. Code Quality & Best Practices✅ Strengths
Potential Bugs & Issues🔴 Critical Issues
|
Code Review: Hibernation Implementation FixOverviewThis PR fixes the hibernation implementation for RivetKit by addressing race conditions, improving WebSocket restoration flow, and enhancing error handling. The changes span both Rust (pegboard-gateway) and TypeScript (runner SDK) components. Critical Bug Fix ✅Location: - if lifecycle_res
+ if \!&lifecycle_resIssue: This was a critical logic bug - the condition was inverted. The code was sending close frames when hibernating (should not) and not sending them when not hibernating (should). Impact: This fix ensures WebSocket close frames are only sent to the runner when NOT hibernating, which is the correct behavior. Architecture Changes 🏗️1. Actor Lifecycle Management (
|
40a9f30 to
d287067
Compare
8145fc3 to
193f67a
Compare
Code Review - Hibernation Implementation FixThis PR addresses critical issues in the WebSocket hibernation implementation. I've conducted a thorough review focusing on code quality, potential bugs, performance, and security. 🎯 Critical Issues1. CRITICAL: Logic Error in Rust (pegboard-gateway/src/lib.rs:489)-if lifecycle_res
+if \!&lifecycle_resStatus: ✅ FIXED This is the core fix! The original logic was inverted - it would send close frames when it should hibernate. However, the syntax Recommendation: Change to 2. CRITICAL: Resource Leak on Actor Start Failure (mod.ts:959)// TODO: Where is this GC'd if something fails?
for (const hr of startCommand.hibernatingRequests) {
this.#tunnel.addRequestToActor(hr.gatewayId, hr.requestId, actorId);
}If Impact: Memory leak, incorrect routing of future requests Fix Required: } catch (err) {
this.log?.error({ msg: "error starting runner actor", actorId, err });
instance.actorStartPromise.reject(err);
// Clean up request-to-actor mappings
for (const hr of startCommand.hibernatingRequests) {
this.#tunnel.removeRequestToActor(hr.gatewayId, hr.requestId);
}
await this.forceStopActor(actorId, generation);
}Note: You'll need to make 3. Race Condition: Actor Removal Timing (mod.ts:305-310)this.#tunnel?.closeActiveRequests(actor);
// Remove actor after stopping
this.#removeActor(actorId, generation);
this.#sendActorStateUpdate(actorId, actor.generation, "stopped");
|
d287067 to
40a9f30
Compare
193f67a to
8145fc3
Compare
Pull Request Review: Hibernation Implementation FixI've reviewed this PR which fixes the hibernation implementation. Overall, the changes look good and address important race conditions and lifecycle management issues. Here are my findings: ✅ Positive Changes
🔍 Potential Issues & Concerns1. TODO Not Addressed (
|
Code Review: Hibernation Implementation FixI've reviewed the changes in this PR. Overall, the fix addresses critical issues in the hibernation implementation across both Rust (pegboard-gateway) and TypeScript (runner SDK and RivetKit). Here are my findings: ✅ Strengths1. Critical Bug Fix in Rust
2. Improved Synchronization in TypeScript Runner
3. Better API Design for Hibernation Restoration
4. Enhanced Logging and Debugging
|
40a9f30 to
d287067
Compare
8145fc3 to
193f67a
Compare
Pull Request Review: Hibernation Implementation FixOverviewThis PR fixes critical issues with the hibernation implementation across the pegboard gateway, TypeScript runner SDK, and RivetKit. Critical Bug Fixes1. Negation Logic Fix in pegboard-gateway/src/lib.rs:489
2. Message Index Reset Removed in shared_state.rs:127
Architecture Improvements3. Runner Actor Lifecycle Management Question: What happens if onActorStart rejects? The promise is rejected but actor remains in the map until forceStopActor completes. Could this leave zombie actors? 4. Hibernation Restoration API Redesign Potential Issues5. WebSocket Duplication Check (actor.ts:176-180) 6. Pending Request Duplication (actor.ts:70-80, 108-117) Test Coverage ConcernsNo new tests for:
RecommendationsHigh Priority:
Medium Priority: Overall AssessmentThis is a solid PR that fixes critical bugs in the hibernation implementation. Strengths:
Recommendation: Approve with suggestions. The core fixes are correct and necessary. Suggested improvements are mostly defensive programming and testing. |
Merge activity
|

No description provided.