fix(pegboard): fix actors getting destroyed on runner shutdown #3499

NathanFlurry · 2025-11-20T07:59:05Z

No description provided.

vercel · 2025-11-20T07:59:09Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
rivetkit-serverless	Error			Nov 20, 2025 8:52pm

3 Skipped Deployments

Project	Deployment	Preview	Updated (UTC)
rivet-cloud	Ignored		Nov 20, 2025 8:52pm
rivet-inspector	Ignored	Preview	Nov 20, 2025 8:52pm
rivet-site	Ignored	Preview	Nov 20, 2025 8:52pm

NathanFlurry · 2025-11-20T07:59:22Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

fix(pegboard): fix actors getting destroyed on runner shutdown #3499 👈 (View in Graphite)
fix(pegboard-gateway): ping hibernating requests both during open hws connections and during hibernation #3498 : 1 other dependent PR (#3501 )
fix(rivetkit): fix graceful runner shutdown #3497
fix(pegboard-gateway): write hws immediately on socket hibernate #3496
chore(rivetkit): fix hibernation implementation #3494
fix(tunnel): implement ping pong #3491 : 1 other dependent PR (#3475 )
chore(pegboard-gateway): add new message id format & add deprecated tunnel ack #3492
chore(pegboard-gateway): remove TunnelAck #3488
chore(engine-runner): remove echo close on ToClientWebSocketClose #3487
chore(rivetkit): implement new hibernating ws protocol #3464 : 1 other dependent PR (#3452 )
fix(actor-kv): fix listing keys #3484
fix(guard): handle actor stopped during ws/req #3481
chore: rename retry -> hibernate in runner protocol #3473
fix(udb): update backoff algo #3469
chore: write/update docs #3468
chore: add debug scripts #3467
main

How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

_{An organization admin has enabled the Graphite Merge Queue in this repository.} _{Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.}

This stack of pull requests is managed by Graphite. Learn more about stacking.

claude · 2025-11-20T08:00:46Z

Code Review

Summary

This PR fixes a critical bug where actors with CrashPolicy::Sleep were being incorrectly put to sleep even on clean shutdowns (when a runner shuts down gracefully). The fix changes the match logic to handle both crash and clean shutdown scenarios appropriately.

Analysis

Previous Behavior (Bug):

The code matched on (input.crash_policy, failed) tuple
For CrashPolicy::Sleep, only the (CrashPolicy::Sleep, true) arm was handled
When failed = false (clean shutdown, e.g., runner shutdown), the code fell through to the wildcard _ pattern at line 695
This wildcard arm calls SetCompleteInput and returns StoppedResult::Destroy, which destroys the actor instead of putting it to sleep

New Behavior (Fix):

Now matches only on input.crash_policy
The CrashPolicy::Sleep arm handles both failed = true (crash) and failed = false (clean shutdown)
Both scenarios now correctly put the actor to sleep
Different log messages distinguish between crash and clean shutdown scenarios

Code Quality: ✅ Excellent

Strengths:

Correct logic fix: Aligns behavior with the intended semantics of CrashPolicy::Sleep - actors should sleep regardless of whether they crashed or stopped cleanly
Improved observability: Added helpful logging to distinguish between crash-induced sleep vs clean shutdown sleep
Clean implementation: Simplified the match statement by removing the unnecessary tuple pattern
Follows conventions: Uses structured logging with ?actor_id as per CLAUDE.md guidelines

Potential Issues: ⚠️ Minor Considerations

Test Coverage: There are no test files in the pegboard package. This critical actor lifecycle logic would benefit from integration tests covering:
- Actor with CrashPolicy::Sleep stopping due to crash (StopCode != Ok)
- Actor with CrashPolicy::Sleep stopping cleanly (StopCode::Ok)
- Actor with CrashPolicy::Sleep stopping due to runner shutdown
- Verification that state.sleeping = true is set in all cases
- Verification that actors are NOT destroyed
Documentation: Consider adding a comment explaining the semantic difference between:
- CrashPolicy::Restart: Reschedule immediately (regardless of clean/crash)
- CrashPolicy::Sleep: Put to sleep (regardless of clean/crash)
- CrashPolicy::Destroy: Destroy (handled by wildcard)
This would help future maintainers understand the intent.

Performance: ✅ No Impact

The change simplifies the match logic and has no performance implications.

Security: ✅ No Concerns

No security implications from this change.

Recommendation: ✅ Approve with Suggestion

The fix is correct and addresses a critical bug. The code quality is excellent. I recommend:

Short-term: Merge this fix as-is since it resolves the immediate issue
Medium-term: Add integration tests to prevent regression
Optional: Add explanatory comments about crash policy semantics

Great catch on this bug! The PR title accurately describes the fix, and the implementation is clean.

claude · 2025-11-20T20:53:41Z

Code Review

Summary

This PR fixes a bug where actors with CrashPolicy::Sleep were being destroyed on clean shutdown (e.g., runner shutdown) instead of sleeping. The fix changes the logic to enter sleep mode for any stopped actor with CrashPolicy::Sleep, not just failed ones.

Analysis

The Bug:
The previous code only put actors to sleep if they had CrashPolicy::Sleep AND failed:

match (input.crash_policy, failed) {
    (CrashPolicy::Sleep, true) => { /* sleep */ }
    // ...
}

This meant that actors with CrashPolicy::Sleep that stopped cleanly (e.g., due to runner shutdown with StopCode::Ok) would fall through to the default case and potentially be destroyed, which contradicts the intended behavior of the sleep policy.

The Fix:
The new code correctly handles sleep policy regardless of failure status:

match input.crash_policy {
    CrashPolicy::Sleep => {
        if failed {
            tracing::debug!("actor sleeping due to crash");
        } else {
            tracing::debug!("actor sleeping due to clean shutdown");
        }
        state.sleeping = true;
        // ...
    }
}

✅ Strengths

Correct behavior: The fix aligns with the semantic meaning of CrashPolicy::Sleep - actors should sleep when stopped, regardless of the reason
Improved logging: The added log message for clean shutdowns improves observability
Consistent pattern: The CrashPolicy::Restart case already worked this way (always restarting, not checking failed)
Minimal change: The fix is surgical and doesn't introduce unnecessary changes

🔍 Observations

No test coverage: There don't appear to be unit tests for the pegboard actor lifecycle. Consider adding tests for:
- Actor with CrashPolicy::Sleep stopping cleanly
- Actor with CrashPolicy::Sleep crashing
- Runner shutdown scenarios
Error handling pattern: The code follows the project's error handling conventions using anyhow properly
Logging follows conventions: Uses structured logging with ?actor_id parameter and lowercase messages as per CLAUDE.md

Minor Suggestions

Consider enhancing the log messages to include the stop code for better debugging:

if failed {
    tracing::debug!(actor_id=?input.actor_id, ?variant, "actor sleeping due to crash");
} else {
    tracing::debug!(actor_id=?input.actor_id, ?variant, "actor sleeping due to clean shutdown");
}

The CrashPolicy::Destroy case (the _ catch-all at line 695-699) could be made explicit for clarity:

CrashPolicy::Destroy => {
    ctx.activity(runtime::SetCompleteInput {}).await?;
    return Ok(StoppedResult::Destroy);
}

Security & Performance

✅ No security concerns identified
✅ No performance implications - same code paths, just corrected control flow
✅ No new dependencies or unsafe code

Verdict

LGTM 🚀

This is a good bug fix that corrects the actor lifecycle behavior. The change is minimal, focused, and improves both correctness and observability. The suggestions above are optional enhancements.

The title accurately describes the fix: actors with sleep policy will no longer be incorrectly destroyed on runner shutdown.

NathanFlurry · 2025-11-20T20:54:50Z

See #3501

claude · 2025-11-20T20:55:37Z

Code Review - PR #3499

Summary

This PR fixes a bug where actors with CrashPolicy::Sleep were being destroyed instead of sleeping when a runner shuts down cleanly. The fix correctly handles both crash and clean shutdown scenarios.

✅ Code Quality: Excellent

What Changed:

Before: Matched on (input.crash_policy, failed) tuple, only handling (CrashPolicy::Sleep, true)
After: Matches on input.crash_policy alone, handling both failed and non-failed cases within CrashPolicy::Sleep

Why This Matters:
Previously, when an actor stopped cleanly (StopCode::Ok) with CrashPolicy::Sleep, it fell through to the wildcard pattern (_) at line 695, which calls SetCompleteInput and returns StoppedResult::Destroy. This destroyed the actor instead of putting it to sleep, contradicting the intended semantics of the sleep policy.

🎯 Correctness

The fix is semantically correct:

CrashPolicy::Restart → Always restart (already worked correctly)
CrashPolicy::Sleep → Always sleep (now fixed to work correctly)
CrashPolicy::Destroy → Always destroy (still handled by wildcard)

The failed flag is now correctly used only for observability (choosing the right log message) rather than for control flow.

📊 Improvements Made

Better observability: Distinguishes between crash-induced sleep and clean shutdown sleep in logs
Follows conventions: Uses structured logging (?actor_id) per CLAUDE.md guidelines
Simplified logic: Removing the tuple match makes the code easier to understand

🔍 Observations & Suggestions

Test Coverage Gap: The pegboard package lacks test files. Consider adding integration tests for:
- Actor lifecycle with different crash policies
- Runner shutdown scenarios
- Verification that state.sleeping is set correctly
Related to fix(pb): stop actors from destroying on goingaway #3501: I see this is part of a stack addressing runner shutdown issues. Good approach to fix related bugs together.
Minor Enhancement (optional): The wildcard pattern at line 695 could be made explicit for clarity:
```
CrashPolicy::Destroy => {
    ctx.activity(runtime::SetCompleteInput {}).await?;
    return Ok(StoppedResult::Destroy);
}
```
This makes it clear that Destroy is the only remaining case, not an oversight.

🔒 Security & Performance

✅ No security concerns
✅ No performance impact (same execution paths, just corrected routing)
✅ No new dependencies

✨ Verdict: LGTM

This is a clean, focused bug fix that addresses a critical issue in actor lifecycle management. The change is minimal, well-reasoned, and improves both correctness and observability.

Recommendation: Approve and merge as part of the stack.

NathanFlurry mentioned this pull request Nov 20, 2025

chore(rivetkit): fix hibernation implementation #3494

Closed

This was referenced Nov 20, 2025

fix(pegboard-gateway): write hws immediately on socket hibernate #3496

Closed

fix(rivetkit): fix graceful runner shutdown #3497

Closed

fix(pegboard-gateway): ping hibernating requests both during open hws connections and during hibernation #3498

Closed

NathanFlurry marked this pull request as ready for review November 20, 2025 18:24

NathanFlurry mentioned this pull request Nov 20, 2025

chore(rivetkit): make action cancelled error more descriptive #3500

Closed

MasterPtato force-pushed the 11-19-fix_pegboard-gateway_ping_hibernating_requests_both_during_open_hws_connections_and_during_hibernation branch from 5552b74 to 815c918 Compare November 20, 2025 19:38

This was referenced Nov 20, 2025

fix(pb): stop actors from destroying on goingaway #3501

Closed

fix(pb): add ability to timeout force resceduling pending state #3502

Closed

fix(pegboard): fix actors getting destroyed on runner shutdown

3b027cc

NathanFlurry force-pushed the 11-19-fix_pegboard-gateway_ping_hibernating_requests_both_during_open_hws_connections_and_during_hibernation branch from 815c918 to 986289f Compare November 20, 2025 20:51

NathanFlurry force-pushed the 11-19-fix_pegboard_fix_actors_getting_destroyed_on_runner_shutdown branch from 0f48573 to 3b027cc Compare November 20, 2025 20:51

vercel bot had a problem deploying to Preview – rivetkit-serverless November 20, 2025 20:52 Failure

NathanFlurry closed this Nov 20, 2025

This was referenced Nov 20, 2025

chore(rivetkit): fix lockfile #3503

Closed

fix(rivetkit): uri encode actor ids to fix cloudflare workers actor id with generation #3504

Closed

MasterPtato mentioned this pull request Nov 20, 2025

fix(gateway): prevent gc from removing hibernating in flight req, check actor started after sub when hibernating #3505

Closed

This was referenced Nov 20, 2025

feat(cloudflare-workers): add createInlineClient #3466

Closed

chore(examples): add counter-next-js example #3506

Closed

fix(engine): ensure first keepalive tick is not skipped to prevent timeout on ws hibernation #3507

Closed

MasterPtato mentioned this pull request Nov 21, 2025

chore: remove msg id parts from protocol, move id utils to runner protocol crate #3508

Closed

This was referenced Nov 21, 2025

chore(rivetkit): buffer outbound tunnel messages #3509

Closed

docs: runner lifecycle #3510

Closed

chore(engine): upgrade vbare 0.0.4 #3511

Closed

This was referenced Nov 21, 2025

fix(runner-protocol): add noop converters for v1 -> v3 #3512

Closed

fix(runner-protocol): fix decoding gateway & req id from req id instead of message id #3513

Closed

chore(pegboard-runner): log runner messages #3514

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(pegboard): fix actors getting destroyed on runner shutdown #3499

fix(pegboard): fix actors getting destroyed on runner shutdown #3499

Uh oh!

NathanFlurry commented Nov 20, 2025

Uh oh!

vercel bot commented Nov 20, 2025 •

edited

Loading

Uh oh!

NathanFlurry commented Nov 20, 2025 •

edited

Loading

Uh oh!

claude bot commented Nov 20, 2025

Uh oh!

claude bot commented Nov 20, 2025

Uh oh!

NathanFlurry commented Nov 20, 2025

Uh oh!

claude bot commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix(pegboard): fix actors getting destroyed on runner shutdown #3499

fix(pegboard): fix actors getting destroyed on runner shutdown #3499

Uh oh!

Conversation

NathanFlurry commented Nov 20, 2025

Uh oh!

vercel bot commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NathanFlurry commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to use the Graphite Merge Queue

Uh oh!

claude bot commented Nov 20, 2025

Code Review

Summary

Analysis

Code Quality: ✅ Excellent

Potential Issues: ⚠️ Minor Considerations

Performance: ✅ No Impact

Security: ✅ No Concerns

Recommendation: ✅ Approve with Suggestion

Uh oh!

claude bot commented Nov 20, 2025

Code Review

Summary

Analysis

✅ Strengths

🔍 Observations

Minor Suggestions

Security & Performance

Verdict

Uh oh!

NathanFlurry commented Nov 20, 2025

Uh oh!

claude bot commented Nov 20, 2025

Code Review - PR #3499

Summary

✅ Code Quality: Excellent

🎯 Correctness

📊 Improvements Made

🔍 Observations & Suggestions

🔒 Security & Performance

✨ Verdict: LGTM

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel bot commented Nov 20, 2025 •

edited

Loading

NathanFlurry commented Nov 20, 2025 •

edited

Loading