Skip to content

Conversation

@NathanFlurry
Copy link
Member

No description provided.

@vercel
Copy link

vercel bot commented Nov 20, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Error Error Nov 20, 2025 8:52pm
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 20, 2025 8:52pm
rivet-inspector Ignored Ignored Preview Nov 20, 2025 8:52pm
rivet-site Ignored Ignored Preview Nov 20, 2025 8:52pm

Copy link
Member Author

NathanFlurry commented Nov 20, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Nov 20, 2025

Code Review: Graceful Runner Shutdown

I've reviewed PR #3497 which fixes graceful runner shutdown. Here are my findings:

✅ Strengths

  1. Proper Race Condition Prevention: The new #shuttingDown flag (line 453-458 in mod.ts) correctly prevents concurrent shutdown calls, which is a solid improvement.

  2. Graceful Actor Cleanup: The new #waitForActorsToStop() method (lines 587-672) implements a thoughtful polling mechanism with:

    • 100ms check interval (reasonable for responsiveness)
    • 120s timeout (prevents indefinite hangs)
    • Progress logging every 5s (good observability)
    • Early exit conditions (actors stopped, WebSocket closed, timeout)
  3. Better Async Signal Handling: Converting signal handlers to async (lines 414-447) allows proper awaiting of shutdown, fixing potential race conditions where shutdown might not complete before process exit.

  4. Enhanced Debugging: Added structured logging in actor-conn.ts to track action lifecycle (lines 200-205, 469-474, 587-592) which will help debug action tracking issues.

⚠️ Concerns & Issues

1. CRITICAL: Process Exit Commented Out (Lines 424, 434)

// TODO: Add back
// process.exit(0);

Issue: Signal handlers no longer exit the process. This means SIGTERM/SIGINT won't actually terminate the application after shutdown completes.

Impact: The process will hang after receiving shutdown signals, requiring SIGKILL to terminate.

Recommendation: Either:

  • Re-enable process.exit(0) after awaiting all handlers
  • Add a comment explaining why this was removed and what the new termination mechanism is
  • Ensure the shutdown flow naturally terminates the process

2. Removed Shutdown Guards May Allow Invalid Operations

Multiple shutdown guards were removed from:

  • sleepActor() (lines 263-269)
  • #sendActorIntent() (lines 1109+)
  • #sendActorState() (lines 1150+)
  • #sendCommandAcknowledgment() (lines 1192+)
  • setAlarm() (lines 1497+)
  • #sendKvRequest() (lines 1530+)
  • __sendToServer() (lines 1606+)

Concern: These methods can now be called during shutdown, potentially sending messages to a closed/closing WebSocket.

Questions:

  • Is this intentional to allow final cleanup messages?
  • Should some guards remain to prevent new operations (vs. cleanup operations)?
  • Could this cause "WebSocket is not open" errors or message loss?

Recommendation: Document why these guards were removed or add selective guards that distinguish between:

  • Cleanup operations (allowed during shutdown)
  • New operations (blocked during shutdown)

3. Potential Memory Leak in #waitForActorsToStop()

The interval created at line 658 could leak if the WebSocket closes immediately after the interval is created but before the next tick:

const interval = setInterval(() => {
    if (checkActors()) {
        clearInterval(interval);
        resolve();
    }
}, shutdownCheckInterval);

Issue: If checkActors() never returns true due to an edge case, the interval runs forever.

Recommendation: Clear the interval on WebSocket close event or add it to cleanup:

ws.addEventListener('close', () => {
    clearInterval(interval);
    resolve();
});

4. Missing Test Coverage

No tests found for shutdown behavior. Given the complexity of this change:

Recommendation: Add tests for:

  • Concurrent shutdown calls (should be idempotent)
  • Shutdown with active actors (should wait)
  • Shutdown timeout scenario
  • WebSocket closes during shutdown
  • Signal handler behavior

5. Minor: Inconsistent Logging Levels

In actor-conn.ts, action tracking uses debug level (line 200), but the error case uses error level (line 572). The success case at line 469 also uses debug.

Recommendation: Consider using trace for the frequent success cases to reduce noise in debug logs.

🔍 Questions for Author

  1. What is the plan for process termination now that process.exit(0) is commented out?
  2. Were the removed shutdown guards causing issues during graceful shutdown?
  3. Have you tested this with a large number of actors to ensure the 120s timeout is sufficient?
  4. Should #shutdown and #shuttingDown be consolidated into a single state enum (NOT_STARTED, RUNNING, SHUTTING_DOWN, SHUTDOWN)?

📝 Minor Style Notes

  1. Line 4 in mod.ts: Import order changed (not a real issue, just noting)
  2. The example script change (examples/counter/scripts/connect.ts) removes manual disposal - is this now handled automatically?

Summary

This is a meaningful improvement to shutdown handling, but the commented-out process.exit(0) calls need to be addressed before merging. The removal of shutdown guards should also be documented or reconsidered to prevent potential issues with operations during shutdown.

Overall assessment: Needs revision primarily due to the process exit issue.

@claude
Copy link

claude bot commented Nov 20, 2025

PR Review: Fix Graceful Runner Shutdown

I've reviewed the changes in this PR. Overall, this is a solid improvement to the shutdown flow that addresses several important issues. Here are my findings:

Strengths

  1. Proper Shutdown Coordination: The new #waitForActorsToStop() method properly coordinates waiting for actors to stop before closing the WebSocket connection, which prevents race conditions during shutdown.

  2. Good Progress Logging: The shutdown wait loop includes periodic progress logging (every 5 seconds) to help with observability during shutdowns.

  3. Multiple Exit Conditions: The wait loop properly handles multiple exit scenarios:

    • All actors stopped (success case)
    • WebSocket closed prematurely
    • Timeout reached (120s)
  4. Concurrency Protection: The #shuttingDown flag prevents concurrent shutdown attempts, which is important for signal handlers.

  5. Enhanced Debugging: Added comprehensive logging in actor-conn.ts for action in-flight tracking, which will help diagnose issues.

⚠️ Issues & Concerns

1. Critical: TODO Comments Left in Production Code

engine/sdks/typescript/runner/src/mod.ts:429-431

// TODO: Add back
// process.exit(0);

Issue: The process.exit(0) calls are commented out in the signal handlers. This means the process will never actually exit after handling SIGTERM/SIGINT signals, which could lead to zombie processes or deployment issues.

Recommendation: Either:

  • Remove the TODO and keep them commented if intentional (with explanation)
  • Implement proper shutdown completion and re-enable process.exit()
  • Add a timeout mechanism that forces exit if shutdown hangs

2. Potential Memory Leak: WeakRef in Signal Handlers

engine/sdks/typescript/runner/src/mod.ts:443-446

SIGNAL_HANDLERS.push(async () => {
    const weak = new WeakRef(this);
    await weak.deref()?.shutdown(false, false);
});

Issue: While using WeakRef is clever to avoid strong references, there are concerns:

  • The signal handler itself is stored in the global SIGNAL_HANDLERS array and never removed
  • If multiple Runner instances are created and destroyed, handlers accumulate
  • The WeakRef may already be garbage collected when the signal fires, resulting in no-op shutdown

Recommendation: Consider one of these approaches:

  • Store a cleanup function that removes the handler when the Runner is destroyed
  • Use a singleton pattern for the Runner if that fits your architecture
  • Document the expected lifecycle (one Runner per process)

3. Race Condition: Shutdown Flag Timing

engine/sdks/typescript/runner/src/mod.ts:451-466

if (this.#shuttingDown) {
    // ...
    return;
}
this.#shuttingDown = true;
// ...
this.#shutdown = true;

Issue: There's a brief window between checking #shuttingDown and setting both flags where another caller could slip through. While unlikely in practice with async/await, this could theoretically happen with concurrent signal handlers.

Recommendation: Set #shuttingDown = true immediately in a single operation:

if (this.#shuttingDown) return;
this.#shuttingDown = this.#shutdown = true;

4. Inconsistent Shutdown Check Removal

The PR removes multiple if (this.#shutdown) checks from methods like:

  • sleepActor()
  • #sendActorIntent()
  • #sendActorStateUpdate()
  • #sendCommandAcknowledgment()
  • __sendToServer()

Issue: While this allows messages to flow during shutdown (which may be intentional for graceful shutdown), it could lead to unexpected behavior if methods are called after resources are cleaned up.

Recommendation:

  • Add clear documentation explaining why shutdown checks were removed
  • Consider adding assertions that critical resources (like WebSocket) still exist
  • Ensure all callers handle potential errors during shutdown state

5. WebSocket State Check is Too Simple

engine/sdks/typescript/runner/src/mod.ts:609

const wsIsClosed = ws.readyState === 2 || ws.readyState === 3;

Issue: Magic numbers instead of WebSocket constants make the code less readable.

Recommendation: Use constants or named values:

const CLOSING = 2;
const CLOSED = 3;
const wsIsClosed = ws.readyState === CLOSING || ws.readyState === CLOSED;

6. Missing Error Handling in Wait Loop

engine/sdks/typescript/runner/src/mod.ts:587-671

Issue: The #waitForActorsToStop() method doesn't catch errors that might occur during the wait loop. If logging throws or there's an unexpected error, the promise might never resolve.

Recommendation: Wrap the interval logic in try-catch to ensure the promise always resolves.

📝 Minor Issues

  1. Inconsistent Log Levels: Some shutdown progress logs use info while debug details use debug. Consider if the progress logs should be debug level to reduce noise in production.

  2. Magic Numbers:

    • shutdownTimeout = 120_000 - Should this be configurable?
    • progressLogInterval = 5_000 - Should this be configurable?
  3. Example File Change: The change to examples/counter/scripts/connect.ts removes the dispose call. Ensure this is intentional and doesn't break the example's purpose.

🧪 Test Coverage

Concern: I don't see test files in this PR. Given the complexity of shutdown logic and the critical nature of signal handling, this code should have tests covering:

  • Graceful shutdown with active actors
  • Shutdown timeout scenarios
  • Signal handler registration and execution
  • Concurrent shutdown attempts
  • Actor cleanup during shutdown

Recommendation: Add integration tests for shutdown scenarios.

🔒 Security Considerations

No security issues identified. The shutdown logic properly cleans up resources.

Performance Considerations

  1. The 100ms polling interval in #waitForActorsToStop() is reasonable for shutdown scenarios.
  2. Consider if shutdown could be event-driven instead of polling (using promises that resolve when actors stop).

📋 Summary

This PR makes important improvements to shutdown handling, but has some issues that should be addressed:

Must Fix:

  • Resolve TODO comments about process.exit() - either implement properly or document why it's disabled
  • Address WeakRef signal handler memory leak potential

Should Fix:

  • Document why shutdown checks were removed from various methods
  • Add test coverage for shutdown scenarios
  • Consider making shutdown event-driven instead of polling

Nice to Have:

  • Use WebSocket state constants instead of magic numbers
  • Make timeouts configurable
  • Add error handling in wait loop

The core logic is sound and this PR addresses real issues with graceful shutdown. With the critical items addressed, this will be a solid improvement to the codebase.

@claude
Copy link

claude bot commented Nov 20, 2025

Code Review

Overview

This PR improves graceful shutdown handling in the TypeScript runner by implementing proper actor lifecycle management during shutdown. The changes address race conditions and ensure actors are fully stopped before closing connections.

Positive Changes ✅

  1. Concurrent shutdown protection - Adding #shuttingDown flag prevents multiple concurrent shutdown attempts (lines 453-459)
  2. Proper async signal handling - Signal handlers now properly await async shutdown operations (lines 417-446)
  3. Actor lifecycle tracking - New #waitForActorsToStop() method properly waits for actors to terminate with timeout and progress logging (lines 587-671)
  4. Cleaner message sending - Removed redundant shutdown checks in message sending methods, simplifying control flow
  5. Better debugging - Enhanced logging in actor-conn.ts helps diagnose action lifecycle issues

Issues & Concerns 🔴

1. Critical: Commented-out process.exit() calls

// TODO: Add back
// process.exit(0);

Location: engine/sdks/typescript/runner/src/mod.ts:425, 435

Issue: Signal handlers no longer exit the process, which could leave the process hanging indefinitely. This breaks expected SIGTERM/SIGINT behavior.

Recommendation: Either:

  • Add a timeout after shutdown completes to force exit
  • Document why this is intentional
  • Track actor stop completion and exit once done

2. Race condition: Shutdown flag timing

this.#shuttingDown = true;
// ...
this.#shutdown = true;

Location: engine/sdks/typescript/runner/src/mod.ts:459, 466

Issue: There's a window between setting #shuttingDown and #shutdown where behavior might be inconsistent. Some methods check #shutdown, others were checking it before being removed.

Recommendation: Consider consolidating to a single shutdown state or document the distinction between the two flags clearly.

3. Removed shutdown guards may allow invalid operations

Multiple methods had if (this.#shutdown) guards removed:

  • sleepActor() (lines 264-270 removed)
  • #sendActorIntent() (lines 1112-1119 removed)
  • #sendActorStateUpdate() (lines 1153-1159 removed)
  • #sendCommandAcknowledgment() (lines 1193-1200 removed)
  • setAlarm() (lines 1501-1504 removed)
  • kv() (lines 1533-1536 removed)

Issue: These operations can now be attempted during shutdown, potentially causing:

  • Messages sent after websocket closure
  • Race conditions with cleanup
  • Resource leaks

Recommendation: Either:

  • Add guards back in strategic places
  • Ensure websocket state is checked before sending
  • Add comprehensive tests for shutdown edge cases

4. Potential infinite wait in #waitForActorsToStop

const interval = setInterval(() => {
    if (checkActors()) {
        clearInterval(interval);
        resolve();
    }
}, shutdownCheckInterval);

Location: engine/sdks/typescript/runner/src/mod.ts:658-667

Issue: If an actor gets stuck or the server never sends StopActor commands, this will wait for the full 120s timeout.

Recommendation:

  • Add a force-stop mechanism after a shorter grace period
  • Consider sending explicit stop commands to actors if server doesn't respond
  • Log which specific actors are blocking shutdown

5. Logging improvements need verbosity consideration

The new debug logging in actor-conn.ts is helpful for debugging but may be too verbose for production:

  • Every action add/remove logs (lines 200-205, 587-591)
  • In-flight IDs logged on every response (line 472)

Recommendation: Consider using trace level instead of debug for high-frequency operations per CLAUDE.md conventions.

Performance Considerations ⚡

  1. Polling overhead - The 100ms polling interval is reasonable, but could be optimized using event-driven approach (listen for actor stop events)
  2. Memory during shutdown - The #actionsInFlight map detailed logging could be memory-intensive with many concurrent actions

Security Considerations 🔒

No security vulnerabilities identified. The changes are primarily about lifecycle management.

Test Coverage 📊

Missing:

  • No tests for shutdown edge cases
  • No tests for concurrent shutdown attempts
  • No tests for actor stop timeout scenarios
  • No tests for signal handler behavior

Recommendation: Add integration tests covering:

- Graceful shutdown with active actors
- Shutdown timeout behavior
- Concurrent shutdown requests
- Signal handling (SIGTERM/SIGINT)
- Actor stop acknowledgment flow

Code Style ✨

Generally follows conventions well:

  • ✅ Uses structured logging correctly
  • ✅ Lowercase log messages
  • ✅ Proper async/await usage
  • ⚠️ TODO comments should have issue tracking

Recommendations Summary

High Priority:

  1. Fix or document the commented-out process.exit() calls
  2. Add shutdown guards back or validate websocket state in message sending
  3. Add tests for shutdown scenarios

Medium Priority:
4. Consolidate shutdown state flags or document their distinctions
5. Improve actor stop timeout handling with force-stop mechanism
6. Adjust logging verbosity levels

Low Priority:
7. Consider event-driven approach instead of polling
8. Link TODO comments to tracking issues

Verdict

This PR makes important improvements to shutdown reliability, but has critical issues that should be addressed before merging:

  • The removed process.exit() calls are a breaking change
  • Removed shutdown guards may cause runtime errors
  • Insufficient test coverage for new shutdown logic

Please address the high-priority items before merging.

@NathanFlurry NathanFlurry force-pushed the 11-19-fix_pegboard-gateway_write_hws_immediately_on_socket_hibernate branch from e1152dc to 968a92b Compare November 20, 2025 20:51
@NathanFlurry NathanFlurry force-pushed the 11-19-fix_rivetkit_fix_graceful_runner_shutdown branch from a11d203 to 0ed702c Compare November 20, 2025 20:51
@claude
Copy link

claude bot commented Nov 20, 2025

Code Review: Graceful Runner Shutdown

I've reviewed this PR and have the following feedback:

✅ Strengths

1. Shutdown Idempotency
The addition of #shuttingDown guard in shutdown() (engine/sdks/typescript/runner/src/mod.ts:453-459) properly prevents concurrent shutdown attempts. This is a good defensive pattern.

2. Proper Actor Cleanup
The new #waitForActorsToStop() method (engine/sdks/typescript/runner/src/mod.ts:597-671) implements a robust waiting mechanism with:

  • Progress logging every 5 seconds
  • Early exit conditions (all actors stopped, WebSocket closed, timeout)
  • Reasonable timeout of 120 seconds

3. Enhanced Debugging
The additional logging in actor-conn.ts will help diagnose action lifecycle issues. The detailed error logging in #takeActionInFlight() (rivetkit-typescript/packages/rivetkit/src/client/actor-conn.ts:572-583) is particularly valuable.


🚨 Critical Issues

1. Process Exit Commented Out (BLOCKING)
Lines 424-425 and 434-435 in engine/sdks/typescript/runner/src/mod.ts:

// TODO: Add back
// process.exit(0);

Issue: The signal handlers now await shutdown but never exit the process. This means:

  • SIGTERM/SIGINT will hang indefinitely after shutdown completes
  • The process becomes a zombie that must be forcefully killed
  • Orchestration systems expecting clean exits will fail

Fix: Remove the TODO and uncomment process.exit(0), or provide a clear explanation if intentional.

2. Removed Shutdown Guards May Cause Race Conditions
The PR removes several if (this.#shutdown) checks from:

  • sleepActor() (line 264-271 removed)
  • #sendActorIntent() (line 1112-1119 removed)
  • #sendActorStateUpdate() (line 1153-1159 removed)
  • #sendCommandAcknowledgment() (line 1193-1199 removed)
  • #setAlarm() (line 1501-1504 removed)
  • #sendKvRequest() (line 1533-1536 removed)
  • __sendToServer() signature changed (line 1609)

Issue: Without these guards, operations can be queued/sent after shutdown begins, potentially causing:

  • WebSocket send attempts after closure
  • Race conditions between shutdown and incoming server commands
  • Unpredictable behavior during the shutdown window

Fix: Consider retaining selective guards, especially for:

  • Operations that send to WebSocket
  • New operation requests (vs. cleanup/acknowledgment messages)

Or provide explicit justification for why operations should continue during shutdown.

3. Signal Handler Memory Leak
Lines 443-446:

SIGNAL_HANDLERS.push(async () => {
    const weak = new WeakRef(this);
    await weak.deref()?.shutdown(false, false);
});

Issue: Every Runner instance adds a handler to the global SIGNAL_HANDLERS array, but handlers are never removed when a runner is destroyed. This causes:

  • Memory leak in long-running processes that create/destroy runners
  • Multiple shutdown attempts if multiple runners exist

Fix: Track which runner created which handler and clean up in shutdown(), or use a different pattern (e.g., single global handler that iterates weak refs to all runners).


⚠️ Moderate Issues

4. Actor Map Modification During Iteration
While #waitForActorsToStop() polls this.#actors.size, actors are being removed from the map concurrently by other code paths (likely in response to server StopActor commands).

Potential Issue: If the map is being modified during iteration elsewhere in the code, this could cause issues. Review if there are any forEach/for-of loops over this.#actors that could conflict.

5. Missing Error Handling in Signal Handlers
Lines 417-436: The async signal handlers don't have try-catch blocks.

Risk: If any handler throws, subsequent handlers won't run and the process won't exit gracefully.

Recommendation:

process.on("SIGTERM", async () => {
    this.log?.debug("received SIGTERM");
    try {
        for (const handler of SIGNAL_HANDLERS) {
            await handler();
        }
    } catch (error) {
        this.log?.error({ msg: "error in signal handler", error });
    }
    process.exit(0);
});

6. WebSocket State Check Fragility
Line 609 in #waitForActorsToStop():

const wsIsClosed = ws.readyState === 2 || ws.readyState === 3;

Recommendation: Use WebSocket constants for readability:

const wsIsClosed = ws.readyState === ws.CLOSING || ws.readyState === ws.CLOSED;

7. Logging Level Inconsistency
The new logs use various levels (debug, info, warn, error) but some choices seem inconsistent:

  • Line 467: "received action response" changed from trace to debug - good
  • Line 200: "added action to in-flight map" is debug but happens for every action

Minor: Consider if action-level logging should remain at trace to reduce noise in production.


📋 Minor/Style Issues

8. Import Ordering
Lines 3-13: The import of ActorConfig, RunnerActor was moved but is now out of the typical ordering convention (types/interfaces before implementations).

Minor: Not critical but maintaining consistent import ordering aids readability.

9. Removed console.trace
Line 1112 removed console.trace("send actor intent", actorId, intentType);

Good: Removing console statements in favor of structured logging is correct.

10. Example Script Change
examples/counter/scripts/connect.ts removed manual cleanup:

await new Promise((resolve) => setTimeout(resolve, 10000));
await counter.dispose();

Question: Is this intentional? If this is demonstrating automatic cleanup, add a comment explaining the behavior.


🧪 Test Coverage Recommendations

This PR significantly changes shutdown behavior. Recommend adding tests for:

  1. Graceful shutdown completes when all actors stop normally
  2. Shutdown timeout triggers after 120 seconds
  3. Shutdown completes early if WebSocket closes
  4. Multiple concurrent shutdown calls are idempotent
  5. Signal handlers properly await async shutdown
  6. Operations during shutdown behave correctly (given the removed guards)

📊 Summary

Severity Count Status
Critical 3 🔴 Must fix
Moderate 4 🟡 Should fix
Minor 3 🟢 Optional

Recommendation: Request changes to address critical issues before merge, particularly:

  1. Uncomment process.exit(0) or document why it's removed
  2. Review removed shutdown guards for race conditions
  3. Fix signal handler memory leak

The core approach to graceful shutdown is sound, but the execution has some gaps that could cause production issues.

@claude
Copy link

claude bot commented Nov 20, 2025

Pull Request Review: Fix Graceful Runner Shutdown

Summary

This PR improves the graceful shutdown mechanism for the TypeScript runner by implementing proper waiting for actors to stop before closing the WebSocket connection. The changes address a critical issue where the runner would close the WebSocket before actors finished stopping, potentially causing state inconsistencies.

Positive Changes

1. Proper Actor Shutdown Coordination

The new #waitForActorsToStop() method (lines 587-672 in mod.ts) implements a robust polling mechanism that:

  • Waits for all actors to stop before closing the WebSocket
  • Has multiple exit conditions (all actors stopped, WebSocket closed, timeout)
  • Provides progress logging every 5 seconds
  • Uses a 120-second timeout as a safety mechanism

This is a significant improvement over the previous implementation which didn't wait for actors.

2. Concurrency Guard

The addition of #shuttingDown flag (line 205) prevents concurrent shutdown calls, which is good defensive programming:

if (this.#shuttingDown) {
    this.log?.debug({ msg: "shutdown already in progress, ignoring" });
    return;
}

3. Enhanced Debugging

The additional logging in actor-conn.ts helps track action lifecycle:

  • Logs when actions are added/removed from in-flight map
  • Includes detailed context when actions are not found
  • Uses debug level appropriately

4. Async Signal Handlers

Converting signal handlers to async (lines 417-446) allows proper awaiting of shutdown, which is necessary for graceful cleanup.

Issues & Concerns

1. Critical: Commented Out Process.exit() ⚠️ HIGH PRIORITY

Location: lines 424-425, 434-435 in mod.ts

// TODO: Add back
// process.exit(0);

Issue: The process will never exit after receiving SIGTERM/SIGINT. This breaks deployment lifecycle and could cause:

  • Containers hanging during shutdown
  • Graceful shutdown timeouts in orchestration systems (K8s, etc.)
  • Resource leaks in long-running processes

Recommendation:

  • If shutdown is already called in the signal handler, the exit parameter should be set to true
  • Change to: await weak.deref()?.shutdown(false, true);
  • Remove the TODOs and process.exit(0) lines since shutdown will handle it

2. Race Condition: Removed Shutdown Checks ⚠️ MEDIUM PRIORITY

Location: Multiple locations (lines 263-269, 1107-1146, 1148-1189, 1192-1205, 1497-1521, 1528-1554)

Issue: The PR removes if (this.#shutdown) checks from several methods:

  • sleepActor()
  • #sendActorIntent()
  • #sendActorStateUpdate()
  • #sendCommandAcknowledgment()
  • setAlarm()
  • #sendKvRequest()

Concern: During the shutdown window (after #shutdown = true but before WebSocket closes), these methods could queue messages or send operations that will never complete. The removed checks were preventing this.

Recommendation:

  • Consider using #shuttingDown flag instead of removing checks entirely
  • At minimum, restore checks in methods that initiate new operations (KV requests, alarms)
  • For state update methods, document why it's safe to proceed during shutdown

3. Memory Leak: Global Signal Handlers Array ⚠️ MEDIUM PRIORITY

Location: line 25, usage in lines 416-446

const SIGNAL_HANDLERS: (() => void | Promise<void>)[] = [];

Issue: Signal handlers are pushed to a module-level array but never removed. If multiple Runner instances are created and destroyed, handlers accumulate.

Recommendation:

// In shutdown method, after line 582:
const handlerIndex = SIGNAL_HANDLERS.findIndex(h => {
    // Compare with the handler we added
});
if (handlerIndex \!== -1) {
    SIGNAL_HANDLERS.splice(handlerIndex, 1);
}

Or better yet, use WeakMap to associate handlers with runner instances.

4. Potential Resource Leak: Interval Not Cleared ℹ️ LOW PRIORITY

Location: line 658 in #waitForActorsToStop()

Issue: The interval created in #waitForActorsToStop() will continue running if the promise is resolved via the immediate check, though it will clear on the next tick.

Recommendation: Wrap in try-finally or ensure interval is always cleared:

let interval: NodeJS.Timeout | undefined;
try {
    // ... existing code ...
    interval = setInterval(() => { ... }, shutdownCheckInterval);
} finally {
    if (interval) clearInterval(interval);
}

Actually, looking closer at the code, this is handled correctly by the resolution logic. False alarm - the interval IS cleared in all paths. ✅

5. Import Order Convention ℹ️ STYLE

Location: lines 1-4 in mod.ts

Per CLAUDE.md guidelines: "Always add imports at the top of the file inside of inline within the function."

The import order was changed:

// Before
import { importWebSocket } from "./websocket.js";
import { RunnerActor, type ActorConfig } from "./actor";

// After  
import { type ActorConfig, RunnerActor } from "./actor";
// ... other imports ...
import { importWebSocket } from "./websocket.js";

Recommendation: Maintain a consistent import ordering (third-party, then local, alphabetically within each group).

6. Missing Documentation ℹ️ LOW PRIORITY

Location: lines 451-585

The new shutdown() method has significant behavioral changes but lacks updated documentation explaining:

  • The difference between immediate and graceful shutdown
  • What happens to in-flight operations
  • Expected timeline for completion

Recommendation: Add JSDoc explaining the shutdown flow and expected behavior.

Code Quality & Best Practices

Positive Observations:

  • ✅ Structured logging with appropriate levels (debug, info, warn)
  • ✅ Promise-based implementation with proper resolve/reject handling
  • ✅ Good use of elapsed time tracking for monitoring
  • ✅ WebSocket readyState checks (2 = CLOSING, 3 = CLOSED)

Alignment with CLAUDE.md:

  • ✅ Uses structured logging (tracing): tracing::info\!(?x, "foo") pattern (TypeScript equivalent)
  • ✅ Lowercase log messages
  • ⚠️ Import ordering could be more consistent

Security Considerations

No significant security concerns identified. The changes are primarily related to lifecycle management and don't introduce new attack surfaces.

Performance Considerations

  • ✅ The 100ms polling interval in #waitForActorsToStop() is reasonable
  • ✅ 5-second progress logging won't spam logs
  • ✅ 120-second timeout prevents indefinite hanging

Test Coverage Recommendations

This PR lacks test coverage for critical shutdown logic. Recommended tests:

  1. Test graceful shutdown with multiple actors

    • Verify all actors stop before WebSocket closes
    • Verify messages sent to actors during shutdown
  2. Test shutdown timeout scenario

    • Verify behavior when actors don't stop within 120s
  3. Test concurrent shutdown calls

    • Verify #shuttingDown guard works correctly
  4. Test signal handler behavior

    • Verify SIGTERM/SIGINT trigger proper shutdown (once process.exit is fixed)
  5. Test WebSocket close during shutdown

    • Verify early exit when WebSocket closes unexpectedly

Actionable Items

Must Fix Before Merge:

  1. Fix process.exit() issue - Remove TODOs and ensure process exits after signal handlers complete
  2. Review removed shutdown checks - Ensure it's safe to call methods during shutdown, or restore checks

Should Fix Before Merge:

  1. Fix signal handler memory leak - Remove handlers on Runner disposal
  2. Add test coverage - At least basic shutdown flow tests

Nice to Have:

  1. Add JSDoc to shutdown method - Document behavioral changes
  2. Standardize import ordering - Follow consistent conventions

Conclusion

This PR makes significant improvements to the graceful shutdown mechanism, particularly the #waitForActorsToStop() implementation. However, there are critical issues with process lifecycle (commented out process.exit) that must be addressed before merge. The removal of shutdown checks in multiple methods also needs careful review to ensure no race conditions are introduced.

Overall Assessment: ⚠️ Needs Changes - The core logic is solid, but the process.exit issue and removed shutdown checks need to be addressed for production safety.


Review generated with assistance from Claude Code

@claude
Copy link

claude bot commented Nov 20, 2025

Pull Request Review: Graceful Runner Shutdown

Thank you for this PR! I've reviewed the changes and have some feedback organized by category:

Code Quality & Best Practices ✅

Positive Changes:

  • Good separation of concerns with #shuttingDown flag to prevent concurrent shutdowns
  • Excellent documentation in the #waitForActorsToStop JSDoc
  • Proper use of structured logging throughout (following CLAUDE.md conventions)
  • Clean refactoring of signal handlers to be async

Concerns:

  1. TODOs Left in Production Code (mod.ts:424-425, 434-435)

    // TODO: Add back
    // process.exit(0);

    These commented-out process.exit(0) calls with TODO comments should be resolved before merging. This is critical functionality - what's the plan here?

  2. Guard Clauses Removed Without Clear Justification
    Multiple guard clauses checking this.#shutdown were removed from methods like:

    • sleepActor() (line 263)
    • #sendActorIntent() (line 1109)
    • #sendActorStateUpdate() (line 1150)
    • __sendToServer() (line 1606)

    While I understand you want to allow cleanup messages during shutdown, removing ALL guards might allow unwanted operations. Consider if some operations should still be blocked during shutdown.

  3. Import Reordering (mod.ts:3-4)
    The import of ActorConfig, RunnerActor was moved above the logger imports. While minor, this should follow the project's import ordering conventions consistently.

Potential Bugs 🐛

  1. Race Condition in #waitForActorsToStop
    The method checks this.#actors.size === 0 but actors are removed from the map asynchronously. There's a potential race where:

    • Actor A starts stopping
    • #actors.size becomes 0
    • Method resolves
    • Actor A's cleanup is still running

    Consider waiting for a signal from actor cleanup completion rather than just map size.

  2. Interval Cleanup Missing
    In #waitForActorsToStop, if the WebSocket closes or timeout occurs before clearInterval is called, the interval continues running. The current code does clear it, but only in the success path inside checkActors(). Consider using a finally block or ensure cleanup happens in all exit paths.

  3. WeakRef Potential Issue (mod.ts:444-445)

    const weak = new WeakRef(this);
    await weak.deref()?.shutdown(false, false);

    If the Runner instance is garbage collected between creating the WeakRef and calling deref(), the shutdown won't execute. Given this is in a signal handler for graceful shutdown, this seems risky. Why use WeakRef here?

Performance Considerations ⚡

  1. Polling Interval
    The 100ms polling interval in #waitForActorsToStop is reasonable, but for environments with many actors, you might want to consider an event-driven approach instead of polling.

  2. Logging Volume
    With debug logging on every interval tick (mod.ts:659-662), shutdowns with many actors could generate significant log volume. The 5-second progress log throttling is good, but consider if the debug-level tick logs are necessary.

Security Concerns 🔒

No major security concerns identified. The shutdown logic properly rejects in-flight KV requests and closes connections.

Test Coverage 📝

Critical Issue:

  • The existing test file engine/sdks/typescript/runner/tests/lifecycle.test.ts is completely commented out
  • This PR makes significant changes to shutdown logic but there are no active tests
  • The PR modifies examples/counter/scripts/connect.ts to remove disposal logic, which suggests manual testing was done but automated tests are lacking

Recommendations:

  1. Uncomment and update the lifecycle tests
  2. Add specific tests for:
    • Graceful shutdown with actors running
    • Concurrent shutdown attempts (testing #shuttingDown flag)
    • Timeout scenarios in #waitForActorsToStop
    • Signal handler behavior (SIGTERM/SIGINT)
  3. Test the scenario where actors don't stop within the 120s timeout

Additional Observations

  1. Enhanced Debugging in actor-conn.ts: The added debug logging for action tracking is excellent for debugging the "action not found" error, but consider if this should be temporary or permanent. The detailed logging might impact performance in production.

  2. Magic Numbers: Consider extracting hardcoded values to constants:

    • 120_000SHUTDOWN_TIMEOUT_MS
    • 100SHUTDOWN_CHECK_INTERVAL_MS
    • 5_000SHUTDOWN_PROGRESS_LOG_INTERVAL_MS

Summary

This PR makes important improvements to shutdown handling, but has some issues that should be addressed:

Must Fix Before Merge:

  • ❌ Resolve the TODO comments about process.exit(0)
  • ❌ Add or re-enable test coverage for shutdown logic
  • ❌ Explain/fix the WeakRef usage in signal handlers

Should Consider:

  • ⚠️ Review which guard clauses should remain to prevent unintended operations during shutdown
  • ⚠️ Add proper interval cleanup in all exit paths of #waitForActorsToStop
  • ⚠️ Consider event-driven approach instead of polling for actor shutdown tracking

Nice to Have:

  • 💡 Extract magic numbers to named constants
  • 💡 Review logging volume for production environments
  • 💡 Import ordering consistency

Overall, the core approach is sound, but please address the critical issues before merging. Happy to re-review once these are addressed!

@NathanFlurry NathanFlurry force-pushed the 11-19-fix_pegboard-gateway_write_hws_immediately_on_socket_hibernate branch from e1152dc to 968a92b Compare November 20, 2025 22:32
@claude
Copy link

claude bot commented Nov 20, 2025

Pull Request Review: Fix Graceful Runner Shutdown

Overview

This PR fixes issues with graceful runner shutdown by implementing proper actor lifecycle management and improving debugging capabilities. The changes span three files with a focus on ensuring all actors are properly stopped before closing WebSocket connections.

Code Quality & Best Practices

✅ Strengths

  1. Proper Concurrency Control: The addition of #shuttingDown flag prevents race conditions during concurrent shutdown attempts (mod.ts:453-459).

  2. Comprehensive Shutdown Logic: The new #waitForActorsToStop() method implements proper polling with multiple exit conditions (timeout, WebSocket closure, all actors stopped).

  3. Good Logging Practices: Follows CLAUDE.md conventions with structured logging using tracing-style parameters:

    logger().debug({
        msg: "added action to in-flight map",
        actionId,
        actionName: opts.name,
        inFlightCount: this.#actionsInFlight.size,
    });
  4. Async Signal Handlers: Signal handlers are now async, allowing for proper awaiting of shutdown operations (mod.ts:417-436).

⚠️ Potential Issues

1. TODOs Left in Critical Code Path (HIGH PRIORITY)

// TODO: Add back
// process.exit(0);

Location: mod.ts:424-425, 434-435

Issue: The process.exit(0) calls are commented out in SIGTERM/SIGINT handlers. This means the process will NOT exit after handling signals, which could leave zombie processes running.

Impact: Production deployments may have processes that don't terminate properly, leading to resource leaks.

Recommendation: Either:

  • Re-enable process.exit(0) after confirming shutdown completes
  • Add a comment explaining why manual exit is not needed (e.g., if the event loop naturally exits)
  • Implement a timeout-based force exit as a safety net

2. Removed Safety Checks Could Cause Unexpected Behavior (MEDIUM PRIORITY)

Multiple safety checks for this.#shutdown were removed from critical methods:

  • sleepActor() (removed lines 263-269)
  • #sendActorIntent() (removed lines 1112-1119)
  • #sendActorStateUpdate() (removed lines 1153-1159)
  • #sendCommandAcknowledgment() (removed lines 1193-1199)
  • setAlarm() (removed lines 1499-1503)
  • #sendKvRequest() (removed lines 1533-1537)

Issue: These methods can now be called during/after shutdown, potentially causing:

  1. WebSocket writes to closed connections (causing errors in logs)
  2. Race conditions where actors try to send messages after the WebSocket is closed
  3. Confusing error states for debugging

Example scenario:

// During shutdown, an actor might still try to set an alarm
runner.shutdown(false, false);
// Meanwhile, some async code calls:
runner.setAlarm(actorId, Date.now() + 1000); // This will now attempt to send despite shutdown

Recommendation:

  • Restore at least some checks, particularly in methods that send WebSocket messages
  • Or add a single check in __sendToServer() to gracefully handle shutdown state
  • Consider if operations during shutdown should be no-ops vs errors

3. allowShutdown Parameter Removed (MEDIUM PRIORITY)

The __sendToServer() method signature changed from:

__sendToServer(message: protocol.ToServer, allowShutdown: boolean = false)

to:

__sendToServer(message: protocol.ToServer)

Issue: This was specifically used for sending the "ToServerStopping" message during shutdown (mod.ts:527). While it still works, this reduces flexibility and removes explicit documentation that certain messages should be allowed during shutdown.

Recommendation: Consider keeping the parameter for clarity, even if the implementation is currently a no-op.

4. Polling-Based Actor Wait (LOW-MEDIUM PRIORITY)

The #waitForActorsToStop() uses polling every 100ms rather than event-based notification.

Issue:

  • Less efficient than event-driven approach
  • 100ms polling interval could delay shutdown by up to 100ms
  • Creates unnecessary debug logs every 100ms (mod.ts:659-662)

Recommendation: Consider using an event emitter or promise-based approach when actors are removed from the map. Example:

// In forceStopActor after removing actor
this.#removeActor(actorId, generation);
this.#actorStoppedEmitter.emit('actor-stopped');

Performance Considerations

Positive

  • Shutdown timeout of 120 seconds is reasonable for graceful actor cleanup
  • Progress logging every 5 seconds prevents log spam while providing visibility

Concerns

  • Polling every 100ms during shutdown adds unnecessary CPU usage
  • The 120-second timeout might be too long for serverless environments with stricter timeouts

Security Concerns

No direct security vulnerabilities identified. The changes maintain existing security boundaries.

Test Coverage

Missing Test Scenarios

The PR modifies critical shutdown logic but doesn't include test files. Recommended tests:

  1. Test graceful shutdown with active actors

    test('waits for all actors to stop before closing websocket', async () => {
      // Create runner with multiple actors
      // Trigger shutdown
      // Verify actors receive stop commands
      // Verify WebSocket closes only after all actors stopped
    });
  2. Test shutdown timeout

    test('forces shutdown after 120 second timeout', async () => {
      // Create actor that never stops
      // Trigger shutdown
      // Verify timeout is hit and WebSocket closes anyway
    });
  3. Test concurrent shutdown attempts

    test('prevents concurrent shutdowns', async () => {
      // Trigger shutdown twice simultaneously
      // Verify only one shutdown proceeds
    });
  4. Test WebSocket closure during shutdown

    test('handles websocket closing before actors stop', async () => {
      // Trigger shutdown
      // Force WebSocket to close
      // Verify shutdown completes gracefully
    });

Debug Logging Improvements (actor-conn.ts)

The added debug logs in actor-conn.ts are helpful for troubleshooting:

  • Tracks action lifecycle through in-flight map
  • Provides visibility into action ID mismatches
  • Good use of structured logging

Minor suggestion: Consider changing log level to trace for some of the more verbose logs to reduce noise in production.

Additional Recommendations

  1. Document shutdown behavior: Add JSDoc comments explaining the shutdown flow and exit conditions
  2. Consider metrics: Add metrics/tracing for shutdown duration and actor stop times
  3. Graceful degradation: Consider what happens if KV operations or other async operations are in-flight during shutdown

Summary

This PR makes important improvements to runner shutdown reliability, but has some critical issues that should be addressed before merging:

Must Fix (Blocking):

  • ❌ Re-enable or document why process.exit(0) is commented out

Should Fix (Recommended):

  • ⚠️ Restore at least basic #shutdown checks to prevent errors during shutdown
  • ⚠️ Add test coverage for shutdown scenarios

Nice to Have:

  • 💡 Consider event-based approach instead of polling
  • 💡 Add JSDoc documentation for shutdown flow

Overall, the approach is sound, but the devil is in the details of handling edge cases during the shutdown sequence.

@MasterPtato MasterPtato force-pushed the 11-19-fix_pegboard-gateway_write_hws_immediately_on_socket_hibernate branch from 968a92b to e1152dc Compare November 20, 2025 22:38
@MasterPtato MasterPtato force-pushed the 11-19-fix_rivetkit_fix_graceful_runner_shutdown branch from 0ed702c to a11d203 Compare November 20, 2025 22:38
@claude
Copy link

claude bot commented Nov 20, 2025

PR Review: Graceful Runner Shutdown

This PR improves the graceful shutdown flow for the RivetKit runner. Overall, the changes look solid and address important race conditions. Here's my detailed feedback:


Strengths

  1. Concurrency Protection: Adding #shuttingDown flag prevents concurrent shutdown attempts, which is a critical race condition fix.

  2. Proper Async Signal Handlers: Converting signal handlers to async and awaiting shutdown is correct for graceful shutdown.

  3. Actor Wait Logic: The new #waitForActorsToStop() method provides proper shutdown orchestration with:

    • Configurable timeout (120s)
    • Progress logging every 5s
    • Early exit on WebSocket close
    • Polling-based actor tracking
  4. Removed Shutdown Checks: Removing #shutdown checks from sleepActor, #sendActorIntent, #sendActorStateUpdate, etc. makes sense - during graceful shutdown, we need to send these messages to properly coordinate actor cleanup.

  5. Debug Logging: Added comprehensive debug logging in actor-conn.ts for tracking in-flight actions, which will help diagnose issues.


⚠️ Issues & Concerns

1. Process.exit() Commented Out (CRITICAL)

// TODO: Add back
// process.exit(0);

Issue: Signal handlers no longer exit the process. This means:

  • On SIGTERM/SIGINT, the runner shuts down but the process hangs
  • In containerized environments (Kubernetes, Docker), this prevents proper termination
  • The orchestrator will force-kill after timeout

Recommendation: Either:

  • Re-enable process.exit(0) after awaiting all handlers
  • Add a comment explaining why it's intentionally removed
  • Ensure there's an alternative exit mechanism

2. Race Condition in #waitForActorsToStop()

const checkActors = () => {
    if (this.#actors.size === 0) {
        // ...
        return true;
    }
    // ...
};

// Check immediately first
if (checkActors()) {
    resolve();
    return;
}

const interval = setInterval(() => {
    if (checkActors()) {
        clearInterval(interval);
        resolve();
    }
}, shutdownCheckInterval);

Issue: If actors are removed from this.#actors between the initial check and starting the interval, updates during that 100ms window might be missed.

Recommendation: While unlikely to cause issues in practice, consider using a single code path:

const interval = setInterval(() => {
    if (checkActors()) {
        clearInterval(interval);
        resolve();
    }
}, 0); // Start immediately with 0 delay

3. Potential Memory Leak in Signal Handlers

SIGNAL_HANDLERS.push(async () => {
    const weak = new WeakRef(this);
    await weak.deref()?.shutdown(false, false);
});

Issue: SIGNAL_HANDLERS is a module-level array. Each Runner instance adds a handler, but there's no cleanup. If many Runner instances are created/destroyed (e.g., in tests), handlers accumulate.

Recommendation:

  • Add cleanup in shutdown: SIGNAL_HANDLERS.splice(SIGNAL_HANDLERS.indexOf(handler), 1)
  • Or document that Runner should be singleton per process

4. Error Handling in #takeActionInFlight()

logger().error({
    msg: "action not found in in-flight map",
    lookupId: id,
    inFlightIds: Array.from(this.#actionsInFlight.keys()),
});
throw new errors.InternalError(`No in flight response for ${id}`);

Issue: While the enhanced logging is great, this suggests a timing issue where responses arrive for unknown action IDs. This could indicate:

  • Duplicate responses
  • Race condition on reconnection
  • Messages arriving after connection reset

Recommendation: Consider if this should:

  • Log a warning and ignore (graceful degradation)
  • Only throw in development/debug mode
  • Add telemetry to track frequency

5. Shutdown State Confusion

You have both #shutdown and #shuttingDown:

#shutdown: boolean = false;
#shuttingDown: boolean = false;

At line 466: this.#shutdown = true; is set after #shuttingDown check.

Issue: The distinction between these flags is unclear:

  • #shuttingDown: Prevents concurrent shutdown calls
  • #shutdown: Seems unused after this PR (most checks removed)

Recommendation:

  • Remove #shutdown entirely if it's no longer needed
  • Or document the semantic difference between the two

🔍 Minor Issues

6. Logging Consistency

Mixed log levels in actor-conn.ts:

  • logger().tracelogger().debug

Suggestion: Ensure log levels match their semantic importance across the codebase.


7. Test File Change

-	await new Promise((resolve) => setTimeout(resolve, 10000));
-	await counter.dispose();

Question: Was this manual disposal causing issues with graceful shutdown? Consider adding a comment explaining why it was removed.


🚀 Performance Considerations

  • Polling Interval: 100ms polling is reasonable for shutdown (low overhead)
  • Timeout: 120s timeout is generous, appropriate for graceful termination
  • Logging Frequency: 5s progress logs won't spam but provide good visibility

🔐 Security Considerations

No security concerns identified. The changes are internal runtime behavior.


🧪 Test Coverage

Missing Tests:

  1. Concurrent shutdown attempts (verify #shuttingDown guard works)
  2. Shutdown timeout behavior (what happens after 120s)
  3. Shutdown with actors that won't stop
  4. Signal handler execution order

Recommendation: Add integration tests for graceful shutdown scenarios, especially timeout and force-close paths.


📋 Summary

Must Fix:

  1. ❌ Re-enable process.exit(0) or explain removal
  2. ⚠️ Clean up signal handlers to prevent memory leaks

Should Fix:
3. Consider removing unused #shutdown flag
4. Document why test disposal was removed

Nice to Have:
5. Simplify #waitForActorsToStop() initial check
6. Add test coverage for shutdown edge cases

The core shutdown logic is well-designed. Addressing the process exit issue is critical before merging.


Overall: Strong improvement to shutdown reliability. Main blocker is the commented-out process.exit().

@claude
Copy link

claude bot commented Nov 20, 2025

PR Review: Fix graceful runner shutdown

Summary

This PR improves graceful shutdown handling for the TypeScript runner by properly waiting for all actors to stop before closing the WebSocket connection. The changes also add enhanced debugging logs for tracking action lifecycle in the client connection.

Code Quality & Best Practices ✅

Positive:

  • Well-structured shutdown logic with proper timeout handling
  • Good use of async/await patterns
  • Clear separation of concerns with the new #waitForActorsToStop method
  • Proper use of WeakRef for signal handlers to avoid memory leaks
  • Comprehensive logging at appropriate levels

Areas for Improvement:

1. TODO Comments Need Resolution

Location: engine/sdks/typescript/runner/src/mod.ts:424-425, 434-435

// TODO: Add back
// process.exit(0);

Issue: The process.exit(0) calls are commented out in the signal handlers. This is concerning because:

  • The shutdown may never complete if there are lingering resources
  • The process won't exit cleanly after SIGTERM/SIGINT
  • This appears to be intentional based on the TODO, but there's no explanation

Recommendation: Either:

  1. Add back the process.exit(0) with proper cleanup
  2. Document WHY this is intentionally removed
  3. Ensure there's a mechanism to exit the process after shutdown completes

2. Race Condition Risk with #shuttingDown Flag

Location: engine/sdks/typescript/runner/src/mod.ts:452-459

if (this.#shuttingDown) {
    this.log?.debug({
        msg: "shutdown already in progress, ignoring",
    });
    return;
}
this.#shuttingDown = true;

Issue: While this prevents concurrent shutdowns, it doesn't reset the flag if shutdown fails partway through.

Recommendation: Add error handling to reset #shuttingDown on failure, or use a more robust state machine.

3. Removed Safety Guards May Cause Issues

Location: Multiple locations in mod.ts

The PR removes several if (this.#shutdown) checks from:

  • sleepActor() (lines 263-269 removed)
  • #sendActorIntent() (lines 1107-1111 removed)
  • #sendActorStateUpdate() (lines 1148-1152 removed)
  • #sendCommandAcknowledgment() (lines 1192-1196 removed)
  • setAlarm() (lines 1497-1500 removed)
  • #sendKvRequest() (lines 1528-1531 removed)
  • __sendToServer() (allowShutdown parameter removed)

Issue: These guards prevented operations during shutdown. Removing them could allow:

  • Race conditions where messages are sent after WebSocket is closed
  • Inconsistent state during shutdown
  • Potential errors from trying to send messages on a closed connection

Recommendation:

  • Document WHY these checks were removed
  • Ensure the shutdown sequence handles these cases properly
  • Consider if some guards should be reinstated for safety

4. Magic Numbers Should Be Constants

Location: engine/sdks/typescript/runner/src/mod.ts:598-601

const shutdownTimeout = 120_000; // 120 seconds
const shutdownCheckInterval = 100; // Check every 100ms
const progressLogInterval = 5_000; // Log progress every 5 seconds

Recommendation: Move these to class-level or module-level constants for easier configuration and testing.

5. WebSocket ReadyState Magic Numbers

Location: engine/sdks/typescript/runner/src/mod.ts:608

const wsIsClosed = ws.readyState === 2 || ws.readyState === 3;

Recommendation: Use named constants from WebSocket standard:

const wsIsClosed = ws.readyState === ws.CLOSING || ws.readyState === ws.CLOSED;

Potential Bugs 🐛

1. Potential Memory Leak in #waitForActorsToStop

Location: engine/sdks/typescript/runner/src/mod.ts:658-670

Issue: If the promise resolves via the immediate check (line 645-650), the interval is never created. However, if it enters the interval creation path, there's a theoretical edge case where the interval might not be cleared if something throws.

Recommendation: Wrap in try-finally or use a cleanup pattern to ensure the interval is always cleared.

2. Action ID Tracking Mismatch

Location: rivetkit-typescript/packages/rivetkit/src/client/actor-conn.ts:569-594

The enhanced error logging is good, but the error itself (InternalError) suggests this should never happen. The additional logging will help debug when it does happen, but:

Recommendation: Consider if there are legitimate race conditions that could cause this, and handle them more gracefully rather than throwing an internal error.

Performance Considerations ⚡

1. Polling Interval Efficiency

Location: engine/sdks/typescript/runner/src/mod.ts:658-670

The 100ms polling interval is reasonable, but could be improved with an event-driven approach:

Recommendation: Consider using a Promise that resolves when the last actor is removed, rather than polling. This would be more efficient and responsive.

2. Logging Overhead

Location: rivetkit-typescript/packages/rivetkit/src/client/actor-conn.ts

The PR adds several debug/trace level logs that include potentially expensive operations (Array.from, map operations):

inFlightActions: Array.from(this.#actionsInFlight.entries()).map(([id, action]) => ({
    id,
    name: action.name,
})),

Recommendation: Ensure these are only evaluated when the log level is active, or wrap in conditional checks.

Security Concerns 🔒

1. Shutdown Timeout Could Be Exploited

Location: engine/sdks/typescript/runner/src/mod.ts:598

The 120-second timeout for actor shutdown is generous. If actors are not properly implementing shutdown, this could:

  • Delay service restarts
  • Be exploited to keep resources allocated
  • Cause cascading failures during emergency shutdowns

Recommendation:

  • Make this configurable
  • Add metrics/alerting for slow shutdowns
  • Consider a forced kill mechanism after timeout

Test Coverage 📋

Observations:

  • No test file changes in this PR
  • The shutdown logic is complex and should have comprehensive tests

Recommendations:

  1. Add unit tests for #waitForActorsToStop:
    • Actors stop immediately
    • Timeout is reached
    • WebSocket closes before actors stop
    • All three early-exit conditions
  2. Add integration tests for graceful shutdown:
    • With multiple actors
    • With slow-stopping actors
    • With signal handling (SIGTERM/SIGINT)
  3. Add tests for the removed safety checks to ensure they're not needed

Suggested Changes

High Priority:

  1. Resolve TODO comments - Decide on process.exit() behavior
  2. Document removed safety checks - Explain why they're safe to remove
  3. Add tests - Especially for shutdown logic

Medium Priority:

  1. Extract magic numbers to constants
  2. Improve #waitForActorsToStop - Consider event-driven approach
  3. Reset #shuttingDown on error

Low Priority:

  1. Use WebSocket constants instead of magic numbers
  2. Optimize logging - Conditional evaluation for expensive operations

Overall Assessment

This PR makes meaningful improvements to graceful shutdown handling, which is critical for production systems. The core logic is sound, but there are several areas that need attention:

  • ✅ Proper async shutdown flow
  • ✅ Good logging for debugging
  • ⚠️ Removed safety checks need documentation
  • ⚠️ TODO comments need resolution
  • ❌ Missing test coverage

Recommendation: Request changes to address the TODOs and add test coverage before merging.

@NathanFlurry NathanFlurry force-pushed the 11-19-fix_pegboard-gateway_write_hws_immediately_on_socket_hibernate branch from e1152dc to 968a92b Compare November 20, 2025 23:32
@NathanFlurry NathanFlurry force-pushed the 11-19-fix_rivetkit_fix_graceful_runner_shutdown branch from a11d203 to 0ed702c Compare November 20, 2025 23:32
@claude
Copy link

claude bot commented Nov 20, 2025

Pull Request Review: fix(rivetkit): fix graceful runner shutdown

Summary

This PR fixes graceful shutdown of the TypeScript runner by implementing proper actor lifecycle management during shutdown. The changes ensure all actors are stopped before closing the WebSocket connection and prevent race conditions during shutdown.


Code Quality ✅

Strengths:

  • Well-structured implementation with clear separation of concerns
  • Good use of TypeScript async/await patterns
  • Proper state management with guard flags to prevent concurrent shutdowns
  • Comprehensive logging for debugging shutdown flow

Concerns:

  1. Commented TODO without timeline (lines 424-425, 434-435 in mod.ts):

    // TODO: Add back
    // process.exit(0);

    These TODOs disable the process exit after signal handlers run. This is a critical issue - the process will not exit after SIGTERM/SIGINT, which could lead to hanging processes. This should either be:

    • Restored immediately with proper shutdown flow
    • Documented with an explanation of why it's disabled
    • Tracked in a separate issue
  2. Signal handler async compatibility: The signal handlers were changed from synchronous to async, but signal handlers in Node.js are typically synchronous. The current implementation works, but it's unconventional and could have edge cases.


Potential Bugs 🐛

  1. Process won't exit on signals (CRITICAL):

    • The commented process.exit(0) means SIGTERM/SIGINT won't terminate the process
    • This breaks standard Unix signal handling expectations
    • Impact: High - processes may hang in production deployments
  2. WebSocket state race condition (line 608):

    const wsIsClosed = ws.readyState === 2 || ws.readyState === 3;

    The WebSocket could transition states between the check and subsequent operations. Consider using named constants (WebSocket.CLOSING, WebSocket.CLOSED) for clarity.

  3. Shutdown guard not reset on error:
    If shutdown() throws an exception after setting #shuttingDown = true, subsequent shutdown calls will be ignored. Consider a try-finally to reset the flag or handle errors more gracefully.

  4. Action tracking during shutdown (actor-conn.ts):
    The enhanced logging reveals potential action ID mismatches. Line 584 throws InternalError if an action isn't found, but during shutdown with concurrent requests, this could cause issues. Consider more graceful degradation.


Performance Considerations ⚡

Good:

  • Polling interval of 100ms for shutdown is reasonable
  • Timeout of 120 seconds provides adequate time for graceful shutdown
  • Progress logging every 5 seconds prevents log spam

Potential Improvements:

  • The shutdown polling could use a more event-driven approach (promises resolving when actors stop) instead of busy-polling
  • Consider using setImmediate or process.nextTick for the initial check before starting the interval

Security Concerns 🔒

Minor:

  1. Shutdown timeout handling: The 120-second timeout is hardcoded. In production, this could be too long or too short depending on actor cleanup requirements. Consider making it configurable via RunnerConfig.

  2. No protection against shutdown during critical operations: While the #shuttingDown flag prevents concurrent shutdowns, there's no prevention of shutdown during critical actor operations.


Test Coverage 📊

Current State:

  • The test file lifecycle.test.ts is entirely commented out
  • No active tests for the shutdown flow
  • The removed code from examples/counter/scripts/connect.ts suggests manual testing was used

Critical Gap:
This PR modifies critical shutdown logic without test coverage. Recommend:

  1. Uncomment and update lifecycle.test.ts
  2. Add specific tests for:
    • Graceful shutdown with multiple actors
    • Shutdown timeout behavior
    • Concurrent shutdown attempts
    • Signal handler behavior
  3. Test the scenario where actors don't stop within timeout

Testing the fix:
Since the tests are commented out, it's unclear if this fix has been validated beyond manual testing. The CLAUDE.md guidelines specify tests should be piped to /tmp/ and grepped, but there's no evidence of test execution.


Specific File Reviews

engine/sdks/typescript/runner/src/mod.ts

Good Changes:

  • ✅ Added #shuttingDown flag to prevent concurrent shutdowns (line 205)
  • ✅ Removed premature shutdown checks in sleepActor, #sendActorIntent, etc. - these were preventing proper cleanup
  • ✅ New #waitForActorsToStop method with proper timeout and progress logging (lines 587-672)
  • ✅ Signal handlers properly await async shutdown

Issues:

  • CRITICAL: process.exit(0) commented out in signal handlers (lines 424-425, 434-435)
  • ⚠️ No reset of #shuttingDown on error
  • ⚠️ Multiple shutdown checks removed without explanation (were they all wrong?)

rivetkit-typescript/packages/rivetkit/src/client/actor-conn.ts

Good Changes:

  • ✅ Enhanced debug logging for action lifecycle tracking
  • ✅ Better visibility into the in-flight actions map
  • ✅ Detailed error logging when actions aren't found (lines 572-583)

Issues:

  • ⚠️ Logging changed from trace to debug level (lines 469, 587) - this will increase log volume in production
  • ⚠️ Error logging on every action not found could be noisy during shutdown

examples/counter/scripts/connect.ts

  • Removed manual test code - likely cleanup, but reduces ad-hoc testing ability

Recommendations

Before Merging:

  1. 🚨 CRITICAL: Fix or document the commented process.exit(0) calls
  2. 🚨 HIGH: Add test coverage for graceful shutdown scenarios
  3. MEDIUM: Add error handling in shutdown() to reset #shuttingDown on failure
  4. MEDIUM: Use WebSocket state constants instead of magic numbers

Future Improvements:

  1. Make shutdown timeout configurable
  2. Consider event-driven approach instead of polling for actor shutdown
  3. Add metrics/telemetry for shutdown duration and success rate
  4. Document the shutdown flow in code comments or architecture docs

Architecture Alignment

Follows conventions:

  • Uses structured logging as per CLAUDE.md
  • Proper async/await patterns
  • Good separation of concerns

Violates conventions:

  • Process won't exit on signals (Unix conventions)
  • No test coverage (CLAUDE.md specifies test requirements)

Overall Assessment

Score: 6/10 - Needs work before merge

This PR addresses an important issue with graceful shutdown, and the core implementation of #waitForActorsToStop is solid. However, the commented-out process.exit(0) calls are a critical blocker. The lack of test coverage for such critical shutdown logic is also concerning.

The implementation shows good engineering practices (logging, timeout handling, progress reporting) but needs to be completed and tested before merging.

Recommendation: Request Changes - Fix the process exit issue and add basic test coverage.

@claude
Copy link

claude bot commented Nov 20, 2025

Code Review

Thank you for this PR improving graceful runner shutdown! The implementation looks solid overall. Here's my detailed feedback:

✅ Strengths

  1. Graceful shutdown logic: The #waitForActorsToStop method properly waits for actors to shut down before closing the WebSocket connection, preventing abrupt terminations.

  2. Concurrent shutdown protection: Good addition of the #shuttingDown flag to prevent concurrent shutdown attempts (line 453-458 in mod.ts).

  3. Comprehensive logging: Excellent structured logging throughout the shutdown flow with progress updates every 5 seconds and detailed debugging information.

  4. Multiple exit conditions: The shutdown wait logic properly handles multiple scenarios (all actors stopped, WebSocket closed, timeout reached).

  5. Enhanced debugging: The additional logging in actor-conn.ts for action in-flight tracking will help diagnose race conditions.

⚠️ Issues & Concerns

1. CRITICAL: Signal handlers don't exit process

// TODO: Add back
// process.exit(0);

Lines 424-425 and 434-435 in mod.ts show that process.exit(0) is commented out after handling SIGTERM/SIGINT. This means:

  • The process will not terminate after graceful shutdown completes
  • Multiple signals could trigger concurrent shutdown attempts (though the #shuttingDown guard helps)
  • Action required: Either re-enable process.exit(0) or document why it's intentionally disabled

2. Potential memory leak in interval

The #waitForActorsToStop method creates a setInterval that could leak if the promise resolves via the immediate check before the interval fires:

// Check immediately first
if (checkActors()) {
    // ...
    resolve();
    return;  // Interval is never created, no leak here
}

const interval = setInterval(() => {
    // If this never fires before resolution, could leak
}, shutdownCheckInterval);

Actually, upon closer review, this is not a leak - the interval is only created if the immediate check fails. However, consider using setTimeout recursively for more explicit control.

3. WebSocket state constants should use named enums

const wsIsClosed = ws.readyState === 2 || ws.readyState === 3;

Line 608: Use WebSocket constants instead of magic numbers:

const wsIsClosed = ws.readyState === ws.CLOSING || ws.readyState === ws.CLOSED;

4. Removed shutdown guards may cause issues

The PR removes multiple if (this.#shutdown) guards from methods like:

  • sleepActor (lines 264-269 removed)
  • #sendActorIntent (lines 1111-1118 removed)
  • #sendActorStateUpdate (lines 1153-1159 removed)
  • #sendCommandAcknowledgment (lines 1193-1199 removed)

Concern: Without these guards, these methods can now execute during shutdown. While this may be intentional (to allow final state updates), it could cause:

  • Messages sent after WebSocket closure
  • Race conditions during shutdown
  • Unexpected behavior if the server expects no more messages

Recommendation: Document why these guards were removed and ensure the shutdown flow properly handles any messages sent during this window.

5. Missing cleanup for signal handlers

SIGNAL_HANDLERS.push(async () => {
    const weak = new WeakRef(this);
    await weak.deref()?.shutdown(false, false);
});

Lines 443-446: Signal handlers are never removed from the global array. If multiple Runner instances are created and destroyed, this could:

  • Cause memory leaks (weak references help but array still grows)
  • Trigger shutdowns on already-destroyed runners
  • Execute handlers in unexpected order

Recommendation: Remove handlers from SIGNAL_HANDLERS during cleanup or document the singleton pattern if only one Runner is expected.

🔍 Minor Issues

  1. Inconsistent log levels: Line 469 uses logger().debug while line 587 uses logger().debug for similar "removed from in-flight map" messages. Consider standardizing.

  2. Timeout could be configurable: The 120-second shutdown timeout is hardcoded. Consider making it configurable via RunnerConfig for different deployment scenarios.

  3. Example code change unclear: The removal of the timeout and dispose call in examples/counter/scripts/connect.ts (lines 17-19 removed) isn't explained. Is this demonstrating the graceful shutdown working automatically?

🧪 Testing Recommendations

  1. Test concurrent shutdown: Verify that rapid SIGTERM signals don't cause issues
  2. Test timeout path: Simulate actors that never stop to verify the 120s timeout works
  3. Test WebSocket closure during wait: Ensure clean handling when connection drops during shutdown
  4. Test empty actor case: Verify immediate resolution when no actors exist
  5. Memory leak test: Create/destroy multiple Runner instances to verify no handler accumulation

📝 Documentation Needs

  1. Document why process.exit(0) is commented out
  2. Document the shutdown flow in code comments
  3. Explain the relationship between #shutdown and #shuttingDown flags
  4. Add JSDoc to #waitForActorsToStop explaining the timeout values

Security & Performance

  • ✅ No obvious security concerns
  • ✅ 100ms polling interval is reasonable for shutdown scenarios
  • ✅ 120s timeout prevents indefinite hangs
  • ✅ Structured logging follows best practices from CLAUDE.md

Verdict

This is a solid improvement to shutdown reliability. Please address the critical issue with process.exit(0) and consider the recommendations around removed shutdown guards and signal handler cleanup before merging.

Recommendation: Request changes for the process.exit issue, others are minor improvements.

@claude
Copy link

claude bot commented Nov 21, 2025

Pull Request Review: fix(rivetkit): fix graceful runner shutdown

Summary

This PR improves the graceful shutdown mechanism for the TypeScript runner, addressing issues with actor lifecycle management during shutdown and adding better tracking for in-flight actions in the client connection.


Code Quality & Best Practices ✅

Strengths:

  • Good use of structured logging with detailed context throughout
  • Proper concurrency control with #shuttingDown flag to prevent concurrent shutdowns
  • Well-documented #waitForActorsToStop method with clear explanation of timeout behavior
  • Consistent error handling patterns

Areas for Improvement:

  1. TODO Comments Left in Production Code ⚠️

    // TODO: Add back
    // process.exit(0);

    Lines 424-425 and 434-435 have commented-out process.exit(0) calls with TODO comments. This appears intentional for testing, but should either:

    • Have an explanation of why this is intentional
    • Include a tracking issue number
    • Or be restored if the testing phase is complete
  2. Signal Handler Type Change

    const SIGNAL_HANDLERS: (() => void | Promise<void>)[] = [];

    The signal handlers now support async functions, but the signal event listeners (SIGTERM, SIGINT) use async callbacks. This is good, but note that Node.js signal handlers are synchronous by design. The async handlers will start executing but won't block the signal handler from completing. This might be the desired behavior, but it's worth documenting.


Potential Bugs 🐛

  1. Race Condition in #waitForActorsToStop
    The method checks this.#actors.size === 0 every 100ms, but there's a potential race condition: if the WebSocket closes exactly as actors are being stopped, the method might log a warning about remaining actors when they were actually in the process of stopping. This is logged as a warning (line 617-621) which is appropriate, but consider if this is the desired behavior.

  2. Removed Shutdown Checks Without Clear Justification
    The PR removes multiple if (this.#shutdown) checks from methods like:

    • sleepActor (lines 264-269 removed)
    • #sendActorIntent (lines 1109-1117 removed)
    • #sendActorStateUpdate (lines 1152-1159 removed)
    • #sendCommandAcknowledgment (lines 1194-1200 removed)
    • setAlarm (lines 1499-1503 removed)
    • #sendKvRequest (lines 1532-1536 removed)

    Concern: These checks prevented operations during shutdown. Removing them means these methods can now be called during shutdown, which allows the graceful shutdown process to send messages (like StopActor commands). However, this could also allow unintended operations.

    Recommendation: The code comment at line 523-526 explains this is intentional for sending the ToServerStopping message, but consider:

    • Adding a more nuanced check (e.g., allow only specific message types during shutdown)
    • Or document why these checks were removed in the PR description
  3. __sendToServer Signature Change
    The allowShutdown parameter was removed from __sendToServer. All callers now can send messages during shutdown. This is intentional based on the code, but it's a significant behavioral change that could have unintended side effects.


Performance Considerations ⚡

  1. Polling Interval

    const shutdownCheckInterval = 100; // Check every 100ms

    Checking every 100ms is reasonable, but consider if this could be event-driven instead. However, for a shutdown scenario where we want a tight feedback loop, 100ms is acceptable.

  2. Logging in Tight Loop
    The #waitForActorsToStop method logs progress every 5 seconds (lines 631-638), which is good. The debug logs on every interval tick (lines 659-662) might be excessive in production. Consider using trace level instead of debug for the per-tick logs.


Security Concerns 🔒

No significant security concerns identified. The changes are primarily focused on lifecycle management and don't introduce new attack vectors.


Test Coverage 🧪

Concerns:

  1. Reduced Test Coverage - The example file examples/counter/scripts/connect.ts had test code removed:

    - await new Promise((resolve) => setTimeout(resolve, 10000));
    - await counter.dispose();

    This suggests manual testing code was removed. Consider:

    • Adding automated tests for graceful shutdown scenarios
    • Testing the new #waitForActorsToStop logic with various edge cases:
      • All actors stop normally
      • Timeout is reached
      • WebSocket closes prematurely
      • Mix of fast and slow actor shutdowns
  2. Recommended Test Scenarios:

    • Test concurrent shutdown attempts (verify #shuttingDown flag works)
    • Test action completion during shutdown
    • Test the new action tracking in ActorConnRaw
    • Verify signal handlers work correctly with async shutdown

Additional Observations

ActorConnRaw Changes:
The changes to rivetkit-typescript/packages/rivetkit/src/client/actor-conn.ts add extensive debug logging for action tracking:

  • Lines 200-205: Added logging when actions are added
  • Lines 469-474: Enhanced logging for action responses
  • Lines 572-583: Added error logging with full state dump
  • Lines 587-592: Added logging when actions are removed

This is excellent for debugging, but consider:

  1. These debug logs should be at trace level in production to avoid log spam
  2. The error log at line 572 dumps the entire in-flight map, which could be large - consider limiting or sampling in production

Logging Convention Adherence:
The code properly follows the CLAUDE.md logging conventions:

  • Using structured logging with tracing
  • Parameters are not formatted into messages
  • Messages are lowercase ✅

Recommendations

High Priority:

  1. Resolve or document the TODO comments about process.exit(0)
  2. Add automated tests for graceful shutdown scenarios
  3. Consider downgrading some debug logs to trace level

Medium Priority:

  1. Document the rationale for removing shutdown checks in the PR description
  2. Consider making the shutdown check more nuanced (allow only specific operations)
  3. Add a comment explaining the async signal handler behavior

Low Priority:

  1. Consider event-driven actor stopping instead of polling
  2. Add sampling/limiting for large in-flight action map dumps

Verdict

This PR makes meaningful improvements to graceful shutdown handling and action tracking. The code quality is generally good with proper logging and error handling. The main concerns are:

  1. The removed shutdown checks need clear documentation of intent
  2. Missing automated test coverage for the new shutdown logic
  3. TODOs left in the code

Recommendation: Request clarification on the removed shutdown checks and TODOs before merging. Consider adding tests for the shutdown scenarios.

@NathanFlurry NathanFlurry force-pushed the 11-19-fix_pegboard-gateway_write_hws_immediately_on_socket_hibernate branch from e1152dc to 968a92b Compare November 21, 2025 01:27
@NathanFlurry NathanFlurry force-pushed the 11-19-fix_rivetkit_fix_graceful_runner_shutdown branch from a11d203 to 0ed702c Compare November 21, 2025 01:27
@claude
Copy link

claude bot commented Nov 21, 2025

Code Review

This PR addresses graceful runner shutdown. Overall, the changes improve shutdown handling, but there are several concerns that should be addressed:

Critical Issues

1. TODOs Left in Production Code (engine/sdks/typescript/runner/src/mod.ts:424-425, 434-435)

// TODO: Add back
// process.exit(0);

The process.exit(0) calls are commented out in both SIGTERM and SIGINT handlers. This means the process will never exit after signal handlers run, which could lead to zombie processes or hung deployments. This needs to be resolved before merging.

Recommendation: Either restore the process.exit(0) calls with proper timing (after shutdown completes), or document why they were removed and add a TODO with a tracking issue.

2. Race Condition in Shutdown Guards (engine/sdks/typescript/runner/src/mod.ts:264-273, 1112-1119, 1153-1160)

Multiple #shutdown checks were removed from methods like sleepActor(), #sendActorIntent(), #sendActorState(), etc. While this allows actors to complete their lifecycle during shutdown, it also means these methods can now be called after the runner has begun shutting down, potentially causing:

  • Messages sent over closed WebSockets
  • State updates to cleaned-up data structures
  • Undefined behavior if #pegboardWebSocket is null

Recommendation: Consider using the #shuttingDown flag more granularly. For example, allow cleanup messages but prevent new actor creation.

3. Missing Error Handling in Signal Handlers (engine/sdks/typescript/runner/src/mod.ts:417-436)

Signal handlers now use async/await but don't catch errors:

for (const handler of SIGNAL_HANDLERS) {
    await handler(); // No try/catch
}

If a handler throws, subsequent handlers won't run and the error will be unhandled.

Recommendation:

for (const handler of SIGNAL_HANDLERS) {
    try {
        await handler();
    } catch (error) {
        console.error('Error in signal handler:', error);
    }
}

Medium Priority Issues

4. Polling-Based Shutdown is Inefficient (engine/sdks/typescript/runner/src/mod.ts:597-672)

The #waitForActorsToStop() method polls every 100ms for up to 120 seconds. This could be improved with event-based waiting:

async #waitForActorsToStop(ws: WebSocket): Promise<void> {
    // Use Promise.race with a timeout instead of polling
    const timeout = new Promise((resolve) => 
        setTimeout(() => resolve('timeout'), 120_000)
    );
    
    const actorsStopped = new Promise((resolve) => {
        const checkInterval = setInterval(() => {
            if (this.#actors.size === 0) {
                clearInterval(checkInterval);
                resolve('complete');
            }
        }, 100);
    });
    
    const result = await Promise.race([actorsStopped, timeout]);
    // ... handle result
}

Even better would be tracking actor lifecycle with events/promises.

5. WebSocket State Check is Fragile (engine/sdks/typescript/runner/src/mod.ts:608)

const wsIsClosed = ws.readyState === 2 || ws.readyState === 3;

Use WebSocket constants for readability:

const wsIsClosed = ws.readyState === ws.CLOSING || ws.readyState === ws.CLOSED;

6. Debug Logging Changes Lack Context (rivetkit-typescript/packages/rivetkit/src/client/actor-conn.ts:200-223, 469-476, 572-593)

The added debug/error logging for action in-flight tracking is helpful for debugging, but:

  • tracedebug change may be too noisy in production
  • The detailed error logging in #takeActionInFlight suggests this is addressing a specific bug but there's no explanation

Recommendation: Add a comment explaining why this debugging was needed (e.g., "Track down issue with actions not being removed from in-flight map").

Minor Issues

7. Import Order Changed (engine/sdks/typescript/runner/src/mod.ts:4-13)

Import order was changed without explanation. While not critical, this violates the code style guideline: "Always add imports at the top of the file inside of inline within the function." Also check if there's an import sorting convention in the project.

8. Example Script Modified (examples/counter/scripts/connect.ts)

The removal of the 10-second delay and dispose() call changes the example behavior. Ensure this aligns with the intended example usage and that other examples are updated similarly if needed.

Positive Aspects

Good: Added #shuttingDown guard to prevent concurrent shutdowns
Good: Comprehensive logging in #waitForActorsToStop() with progress updates
Good: Multiple timeout conditions (all actors stopped, WS closed, timeout reached)
Good: Signal handlers now properly await async shutdown

Security Considerations

No major security issues identified. The shutdown handling changes don't introduce obvious vulnerabilities.

Performance Considerations

  • The 100ms polling interval during shutdown is reasonable but could be event-driven
  • 120-second timeout is generous but may delay container restarts in orchestrated environments

Test Coverage

Missing: No new tests for the graceful shutdown behavior
Missing: No tests for the actor waiting logic
Missing: No tests for signal handler behavior

Recommendation: Add tests in engine/sdks/typescript/runner/tests/lifecycle.test.ts covering:

  1. Graceful shutdown waits for actors to stop
  2. Shutdown timeout is respected
  3. Shutdown with closed WebSocket doesn't hang
  4. Concurrent shutdown attempts are handled

Summary

The PR makes important improvements to shutdown handling, but should not be merged until:

  1. ✅ TODOs are resolved (restore process.exit() or document why removed)
  2. ✅ Error handling added to signal handlers
  3. ✅ Tests added for new shutdown behavior
  4. ⚠️ (Nice to have) Consider event-based actor tracking instead of polling

Please address the critical issues before merging. Happy to review again once updated!

@graphite-app
Copy link
Contributor

graphite-app bot commented Nov 21, 2025

Merge activity

  • Nov 21, 2:40 AM UTC: NathanFlurry added this pull request to the Graphite merge queue.
  • Nov 21, 2:42 AM UTC: CI is running for this pull request on a draft pull request (#3515) due to your merge queue CI optimization settings.
  • Nov 21, 2:43 AM UTC: Merged by the Graphite merge queue via draft PR: #3515.

@graphite-app graphite-app bot closed this Nov 21, 2025
@graphite-app graphite-app bot deleted the 11-19-fix_rivetkit_fix_graceful_runner_shutdown branch November 21, 2025 02:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants