-
Notifications
You must be signed in to change notification settings - Fork 135
fix(rivetkit): fix graceful runner shutdown #3497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(rivetkit): fix graceful runner shutdown #3497
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
3 Skipped Deployments
|
Code Review: Graceful Runner ShutdownI've reviewed PR #3497 which fixes graceful runner shutdown. Here are my findings: ✅ Strengths
|
34ee3a4 to
e1152dc
Compare
4f192f3 to
a11d203
Compare
PR Review: Fix Graceful Runner ShutdownI've reviewed the changes in this PR. Overall, this is a solid improvement to the shutdown flow that addresses several important issues. Here are my findings: ✅ Strengths
|
Code ReviewOverviewThis PR improves graceful shutdown handling in the TypeScript runner by implementing proper actor lifecycle management during shutdown. The changes address race conditions and ensure actors are fully stopped before closing connections. Positive Changes ✅
Issues & Concerns 🔴1. Critical: Commented-out process.exit() calls// TODO: Add back
// process.exit(0);Location: Issue: Signal handlers no longer exit the process, which could leave the process hanging indefinitely. This breaks expected SIGTERM/SIGINT behavior. Recommendation: Either:
2. Race condition: Shutdown flag timingthis.#shuttingDown = true;
// ...
this.#shutdown = true;Location: Issue: There's a window between setting Recommendation: Consider consolidating to a single shutdown state or document the distinction between the two flags clearly. 3. Removed shutdown guards may allow invalid operationsMultiple methods had
Issue: These operations can now be attempted during shutdown, potentially causing:
Recommendation: Either:
4. Potential infinite wait in #waitForActorsToStopconst interval = setInterval(() => {
if (checkActors()) {
clearInterval(interval);
resolve();
}
}, shutdownCheckInterval);Location: Issue: If an actor gets stuck or the server never sends StopActor commands, this will wait for the full 120s timeout. Recommendation:
5. Logging improvements need verbosity considerationThe new debug logging in
Recommendation: Consider using Performance Considerations ⚡
Security Considerations 🔒No security vulnerabilities identified. The changes are primarily about lifecycle management. Test Coverage 📊Missing:
Recommendation: Add integration tests covering: - Graceful shutdown with active actors
- Shutdown timeout behavior
- Concurrent shutdown requests
- Signal handling (SIGTERM/SIGINT)
- Actor stop acknowledgment flowCode Style ✨Generally follows conventions well:
Recommendations SummaryHigh Priority:
Medium Priority: Low Priority: VerdictThis PR makes important improvements to shutdown reliability, but has critical issues that should be addressed before merging:
Please address the high-priority items before merging. |
e1152dc to
968a92b
Compare
a11d203 to
0ed702c
Compare
Code Review: Graceful Runner ShutdownI've reviewed this PR and have the following feedback: ✅ Strengths1. Shutdown Idempotency 2. Proper Actor Cleanup
3. Enhanced Debugging 🚨 Critical Issues1. Process Exit Commented Out (BLOCKING) // TODO: Add back
// process.exit(0);Issue: The signal handlers now await shutdown but never exit the process. This means:
Fix: Remove the TODO and uncomment 2. Removed Shutdown Guards May Cause Race Conditions
Issue: Without these guards, operations can be queued/sent after shutdown begins, potentially causing:
Fix: Consider retaining selective guards, especially for:
Or provide explicit justification for why operations should continue during shutdown. 3. Signal Handler Memory Leak SIGNAL_HANDLERS.push(async () => {
const weak = new WeakRef(this);
await weak.deref()?.shutdown(false, false);
});Issue: Every
Fix: Track which runner created which handler and clean up in
|
| Severity | Count | Status |
|---|---|---|
| Critical | 3 | 🔴 Must fix |
| Moderate | 4 | 🟡 Should fix |
| Minor | 3 | 🟢 Optional |
Recommendation: Request changes to address critical issues before merge, particularly:
- Uncomment
process.exit(0)or document why it's removed - Review removed shutdown guards for race conditions
- Fix signal handler memory leak
The core approach to graceful shutdown is sound, but the execution has some gaps that could cause production issues.
Pull Request Review: Fix Graceful Runner ShutdownSummaryThis PR improves the graceful shutdown mechanism for the TypeScript runner by implementing proper waiting for actors to stop before closing the WebSocket connection. The changes address a critical issue where the runner would close the WebSocket before actors finished stopping, potentially causing state inconsistencies. Positive Changes1. Proper Actor Shutdown Coordination ✅The new
This is a significant improvement over the previous implementation which didn't wait for actors. 2. Concurrency Guard ✅The addition of if (this.#shuttingDown) {
this.log?.debug({ msg: "shutdown already in progress, ignoring" });
return;
}3. Enhanced Debugging ✅The additional logging in
4. Async Signal Handlers ✅Converting signal handlers to async (lines 417-446) allows proper awaiting of shutdown, which is necessary for graceful cleanup. Issues & Concerns1. Critical: Commented Out Process.exit()
|
968a92b to
e1152dc
Compare
0ed702c to
a11d203
Compare
Pull Request Review: Graceful Runner ShutdownThank you for this PR! I've reviewed the changes and have some feedback organized by category: Code Quality & Best Practices ✅Positive Changes:
Concerns:
Potential Bugs 🐛
Performance Considerations ⚡
Security Concerns 🔒No major security concerns identified. The shutdown logic properly rejects in-flight KV requests and closes connections. Test Coverage 📝Critical Issue:
Recommendations:
Additional Observations
SummaryThis PR makes important improvements to shutdown handling, but has some issues that should be addressed: Must Fix Before Merge:
Should Consider:
Nice to Have:
Overall, the core approach is sound, but please address the critical issues before merging. Happy to re-review once these are addressed! |
e1152dc to
968a92b
Compare
Pull Request Review: Fix Graceful Runner ShutdownOverviewThis PR fixes issues with graceful runner shutdown by implementing proper actor lifecycle management and improving debugging capabilities. The changes span three files with a focus on ensuring all actors are properly stopped before closing WebSocket connections. Code Quality & Best Practices✅ Strengths
|
968a92b to
e1152dc
Compare
0ed702c to
a11d203
Compare
PR Review: Graceful Runner ShutdownThis PR improves the graceful shutdown flow for the RivetKit runner. Overall, the changes look solid and address important race conditions. Here's my detailed feedback: ✅ Strengths
|
PR Review: Fix graceful runner shutdownSummaryThis PR improves graceful shutdown handling for the TypeScript runner by properly waiting for all actors to stop before closing the WebSocket connection. The changes also add enhanced debugging logs for tracking action lifecycle in the client connection. Code Quality & Best Practices ✅Positive:
Areas for Improvement: 1. TODO Comments Need ResolutionLocation: // TODO: Add back
// process.exit(0);Issue: The
Recommendation: Either:
2. Race Condition Risk with #shuttingDown FlagLocation: if (this.#shuttingDown) {
this.log?.debug({
msg: "shutdown already in progress, ignoring",
});
return;
}
this.#shuttingDown = true;Issue: While this prevents concurrent shutdowns, it doesn't reset the flag if shutdown fails partway through. Recommendation: Add error handling to reset 3. Removed Safety Guards May Cause IssuesLocation: Multiple locations in The PR removes several
Issue: These guards prevented operations during shutdown. Removing them could allow:
Recommendation:
4. Magic Numbers Should Be ConstantsLocation: const shutdownTimeout = 120_000; // 120 seconds
const shutdownCheckInterval = 100; // Check every 100ms
const progressLogInterval = 5_000; // Log progress every 5 secondsRecommendation: Move these to class-level or module-level constants for easier configuration and testing. 5. WebSocket ReadyState Magic NumbersLocation: const wsIsClosed = ws.readyState === 2 || ws.readyState === 3;Recommendation: Use named constants from WebSocket standard: const wsIsClosed = ws.readyState === ws.CLOSING || ws.readyState === ws.CLOSED;Potential Bugs 🐛1. Potential Memory Leak in #waitForActorsToStopLocation: Issue: If the promise resolves via the immediate check (line 645-650), the interval is never created. However, if it enters the interval creation path, there's a theoretical edge case where the interval might not be cleared if something throws. Recommendation: Wrap in try-finally or use a cleanup pattern to ensure the interval is always cleared. 2. Action ID Tracking MismatchLocation: The enhanced error logging is good, but the error itself ( Recommendation: Consider if there are legitimate race conditions that could cause this, and handle them more gracefully rather than throwing an internal error. Performance Considerations ⚡1. Polling Interval EfficiencyLocation: The 100ms polling interval is reasonable, but could be improved with an event-driven approach: Recommendation: Consider using a Promise that resolves when the last actor is removed, rather than polling. This would be more efficient and responsive. 2. Logging OverheadLocation: The PR adds several debug/trace level logs that include potentially expensive operations (Array.from, map operations): inFlightActions: Array.from(this.#actionsInFlight.entries()).map(([id, action]) => ({
id,
name: action.name,
})),Recommendation: Ensure these are only evaluated when the log level is active, or wrap in conditional checks. Security Concerns 🔒1. Shutdown Timeout Could Be ExploitedLocation: The 120-second timeout for actor shutdown is generous. If actors are not properly implementing shutdown, this could:
Recommendation:
Test Coverage 📋Observations:
Recommendations:
Suggested ChangesHigh Priority:
Medium Priority:
Low Priority:
Overall AssessmentThis PR makes meaningful improvements to graceful shutdown handling, which is critical for production systems. The core logic is sound, but there are several areas that need attention:
Recommendation: Request changes to address the TODOs and add test coverage before merging. |
e1152dc to
968a92b
Compare
a11d203 to
0ed702c
Compare
Pull Request Review: fix(rivetkit): fix graceful runner shutdownSummaryThis PR fixes graceful shutdown of the TypeScript runner by implementing proper actor lifecycle management during shutdown. The changes ensure all actors are stopped before closing the WebSocket connection and prevent race conditions during shutdown. Code Quality ✅Strengths:
Concerns:
Potential Bugs 🐛
Performance Considerations ⚡Good:
Potential Improvements:
Security Concerns 🔒Minor:
Test Coverage 📊Current State:
Critical Gap:
Testing the fix: Specific File Reviews
|
Code ReviewThank you for this PR improving graceful runner shutdown! The implementation looks solid overall. Here's my detailed feedback: ✅ Strengths
|
0ed702c to
a11d203
Compare
968a92b to
e1152dc
Compare
Pull Request Review: fix(rivetkit): fix graceful runner shutdownSummaryThis PR improves the graceful shutdown mechanism for the TypeScript runner, addressing issues with actor lifecycle management during shutdown and adding better tracking for in-flight actions in the client connection. Code Quality & Best Practices ✅Strengths:
Areas for Improvement:
Potential Bugs 🐛
Performance Considerations ⚡
Security Concerns 🔒No significant security concerns identified. The changes are primarily focused on lifecycle management and don't introduce new attack vectors. Test Coverage 🧪Concerns:
Additional ObservationsActorConnRaw Changes:
This is excellent for debugging, but consider:
Logging Convention Adherence:
RecommendationsHigh Priority:
Medium Priority:
Low Priority:
VerdictThis PR makes meaningful improvements to graceful shutdown handling and action tracking. The code quality is generally good with proper logging and error handling. The main concerns are:
Recommendation: Request clarification on the removed shutdown checks and TODOs before merging. Consider adding tests for the shutdown scenarios. |
e1152dc to
968a92b
Compare
a11d203 to
0ed702c
Compare
Code ReviewThis PR addresses graceful runner shutdown. Overall, the changes improve shutdown handling, but there are several concerns that should be addressed: Critical Issues1. TODOs Left in Production Code (engine/sdks/typescript/runner/src/mod.ts:424-425, 434-435)// TODO: Add back
// process.exit(0);The Recommendation: Either restore the 2. Race Condition in Shutdown Guards (engine/sdks/typescript/runner/src/mod.ts:264-273, 1112-1119, 1153-1160)Multiple
Recommendation: Consider using the 3. Missing Error Handling in Signal Handlers (engine/sdks/typescript/runner/src/mod.ts:417-436)Signal handlers now use async/await but don't catch errors: for (const handler of SIGNAL_HANDLERS) {
await handler(); // No try/catch
}If a handler throws, subsequent handlers won't run and the error will be unhandled. Recommendation: for (const handler of SIGNAL_HANDLERS) {
try {
await handler();
} catch (error) {
console.error('Error in signal handler:', error);
}
}Medium Priority Issues4. Polling-Based Shutdown is Inefficient (engine/sdks/typescript/runner/src/mod.ts:597-672)The async #waitForActorsToStop(ws: WebSocket): Promise<void> {
// Use Promise.race with a timeout instead of polling
const timeout = new Promise((resolve) =>
setTimeout(() => resolve('timeout'), 120_000)
);
const actorsStopped = new Promise((resolve) => {
const checkInterval = setInterval(() => {
if (this.#actors.size === 0) {
clearInterval(checkInterval);
resolve('complete');
}
}, 100);
});
const result = await Promise.race([actorsStopped, timeout]);
// ... handle result
}Even better would be tracking actor lifecycle with events/promises. 5. WebSocket State Check is Fragile (engine/sdks/typescript/runner/src/mod.ts:608)const wsIsClosed = ws.readyState === 2 || ws.readyState === 3;Use WebSocket constants for readability: const wsIsClosed = ws.readyState === ws.CLOSING || ws.readyState === ws.CLOSED;6. Debug Logging Changes Lack Context (rivetkit-typescript/packages/rivetkit/src/client/actor-conn.ts:200-223, 469-476, 572-593)The added debug/error logging for action in-flight tracking is helpful for debugging, but:
Recommendation: Add a comment explaining why this debugging was needed (e.g., "Track down issue with actions not being removed from in-flight map"). Minor Issues7. Import Order Changed (engine/sdks/typescript/runner/src/mod.ts:4-13)Import order was changed without explanation. While not critical, this violates the code style guideline: "Always add imports at the top of the file inside of inline within the function." Also check if there's an import sorting convention in the project. 8. Example Script Modified (examples/counter/scripts/connect.ts)The removal of the 10-second delay and Positive Aspects✅ Good: Added Security ConsiderationsNo major security issues identified. The shutdown handling changes don't introduce obvious vulnerabilities. Performance Considerations
Test Coverage❌ Missing: No new tests for the graceful shutdown behavior Recommendation: Add tests in
SummaryThe PR makes important improvements to shutdown handling, but should not be merged until:
Please address the critical issues before merging. Happy to review again once updated! |
Merge activity
|

No description provided.