Skip to content

Conversation

@NathanFlurry
Copy link
Member

No description provided.

@vercel
Copy link

vercel bot commented Nov 20, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Error Error Nov 20, 2025 8:52pm
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 20, 2025 8:52pm
rivet-inspector Ignored Ignored Preview Nov 20, 2025 8:52pm
rivet-site Ignored Ignored Preview Nov 20, 2025 8:52pm

Copy link
Member Author

NathanFlurry commented Nov 20, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Nov 20, 2025

Code Review

Summary

This PR fixes a critical bug where actors with CrashPolicy::Sleep were being incorrectly put to sleep even on clean shutdowns (when a runner shuts down gracefully). The fix changes the match logic to handle both crash and clean shutdown scenarios appropriately.

Analysis

Previous Behavior (Bug):

  • The code matched on (input.crash_policy, failed) tuple
  • For CrashPolicy::Sleep, only the (CrashPolicy::Sleep, true) arm was handled
  • When failed = false (clean shutdown, e.g., runner shutdown), the code fell through to the wildcard _ pattern at line 695
  • This wildcard arm calls SetCompleteInput and returns StoppedResult::Destroy, which destroys the actor instead of putting it to sleep

New Behavior (Fix):

  • Now matches only on input.crash_policy
  • The CrashPolicy::Sleep arm handles both failed = true (crash) and failed = false (clean shutdown)
  • Both scenarios now correctly put the actor to sleep
  • Different log messages distinguish between crash and clean shutdown scenarios

Code Quality: ✅ Excellent

Strengths:

  1. Correct logic fix: Aligns behavior with the intended semantics of CrashPolicy::Sleep - actors should sleep regardless of whether they crashed or stopped cleanly
  2. Improved observability: Added helpful logging to distinguish between crash-induced sleep vs clean shutdown sleep
  3. Clean implementation: Simplified the match statement by removing the unnecessary tuple pattern
  4. Follows conventions: Uses structured logging with ?actor_id as per CLAUDE.md guidelines

Potential Issues: ⚠️ Minor Considerations

  1. Test Coverage: There are no test files in the pegboard package. This critical actor lifecycle logic would benefit from integration tests covering:

    • Actor with CrashPolicy::Sleep stopping due to crash (StopCode != Ok)
    • Actor with CrashPolicy::Sleep stopping cleanly (StopCode::Ok)
    • Actor with CrashPolicy::Sleep stopping due to runner shutdown
    • Verification that state.sleeping = true is set in all cases
    • Verification that actors are NOT destroyed
  2. Documentation: Consider adding a comment explaining the semantic difference between:

    • CrashPolicy::Restart: Reschedule immediately (regardless of clean/crash)
    • CrashPolicy::Sleep: Put to sleep (regardless of clean/crash)
    • CrashPolicy::Destroy: Destroy (handled by wildcard)

    This would help future maintainers understand the intent.

Performance: ✅ No Impact

The change simplifies the match logic and has no performance implications.

Security: ✅ No Concerns

No security implications from this change.

Recommendation: ✅ Approve with Suggestion

The fix is correct and addresses a critical bug. The code quality is excellent. I recommend:

  1. Short-term: Merge this fix as-is since it resolves the immediate issue
  2. Medium-term: Add integration tests to prevent regression
  3. Optional: Add explanatory comments about crash policy semantics

Great catch on this bug! The PR title accurately describes the fix, and the implementation is clean.

@NathanFlurry NathanFlurry marked this pull request as ready for review November 20, 2025 18:24
@MasterPtato MasterPtato force-pushed the 11-19-fix_pegboard-gateway_ping_hibernating_requests_both_during_open_hws_connections_and_during_hibernation branch from 5552b74 to 815c918 Compare November 20, 2025 19:38
@NathanFlurry NathanFlurry force-pushed the 11-19-fix_pegboard-gateway_ping_hibernating_requests_both_during_open_hws_connections_and_during_hibernation branch from 815c918 to 986289f Compare November 20, 2025 20:51
@NathanFlurry NathanFlurry force-pushed the 11-19-fix_pegboard_fix_actors_getting_destroyed_on_runner_shutdown branch from 0f48573 to 3b027cc Compare November 20, 2025 20:51
@claude
Copy link

claude bot commented Nov 20, 2025

Code Review

Summary

This PR fixes a bug where actors with CrashPolicy::Sleep were being destroyed on clean shutdown (e.g., runner shutdown) instead of sleeping. The fix changes the logic to enter sleep mode for any stopped actor with CrashPolicy::Sleep, not just failed ones.

Analysis

The Bug:
The previous code only put actors to sleep if they had CrashPolicy::Sleep AND failed:

match (input.crash_policy, failed) {
    (CrashPolicy::Sleep, true) => { /* sleep */ }
    // ...
}

This meant that actors with CrashPolicy::Sleep that stopped cleanly (e.g., due to runner shutdown with StopCode::Ok) would fall through to the default case and potentially be destroyed, which contradicts the intended behavior of the sleep policy.

The Fix:
The new code correctly handles sleep policy regardless of failure status:

match input.crash_policy {
    CrashPolicy::Sleep => {
        if failed {
            tracing::debug!("actor sleeping due to crash");
        } else {
            tracing::debug!("actor sleeping due to clean shutdown");
        }
        state.sleeping = true;
        // ...
    }
}

✅ Strengths

  1. Correct behavior: The fix aligns with the semantic meaning of CrashPolicy::Sleep - actors should sleep when stopped, regardless of the reason
  2. Improved logging: The added log message for clean shutdowns improves observability
  3. Consistent pattern: The CrashPolicy::Restart case already worked this way (always restarting, not checking failed)
  4. Minimal change: The fix is surgical and doesn't introduce unnecessary changes

🔍 Observations

  1. No test coverage: There don't appear to be unit tests for the pegboard actor lifecycle. Consider adding tests for:

    • Actor with CrashPolicy::Sleep stopping cleanly
    • Actor with CrashPolicy::Sleep crashing
    • Runner shutdown scenarios
  2. Error handling pattern: The code follows the project's error handling conventions using anyhow properly

  3. Logging follows conventions: Uses structured logging with ?actor_id parameter and lowercase messages as per CLAUDE.md

Minor Suggestions

  1. Consider enhancing the log messages to include the stop code for better debugging:

    if failed {
        tracing::debug!(actor_id=?input.actor_id, ?variant, "actor sleeping due to crash");
    } else {
        tracing::debug!(actor_id=?input.actor_id, ?variant, "actor sleeping due to clean shutdown");
    }
  2. The CrashPolicy::Destroy case (the _ catch-all at line 695-699) could be made explicit for clarity:

    CrashPolicy::Destroy => {
        ctx.activity(runtime::SetCompleteInput {}).await?;
        return Ok(StoppedResult::Destroy);
    }

Security & Performance

  • ✅ No security concerns identified
  • ✅ No performance implications - same code paths, just corrected control flow
  • ✅ No new dependencies or unsafe code

Verdict

LGTM 🚀

This is a good bug fix that corrects the actor lifecycle behavior. The change is minimal, focused, and improves both correctness and observability. The suggestions above are optional enhancements.

The title accurately describes the fix: actors with sleep policy will no longer be incorrectly destroyed on runner shutdown.

Copy link
Member Author

See #3501

@claude
Copy link

claude bot commented Nov 20, 2025

Code Review - PR #3499

Summary

This PR fixes a bug where actors with CrashPolicy::Sleep were being destroyed instead of sleeping when a runner shuts down cleanly. The fix correctly handles both crash and clean shutdown scenarios.


Code Quality: Excellent

What Changed:

  • Before: Matched on (input.crash_policy, failed) tuple, only handling (CrashPolicy::Sleep, true)
  • After: Matches on input.crash_policy alone, handling both failed and non-failed cases within CrashPolicy::Sleep

Why This Matters:
Previously, when an actor stopped cleanly (StopCode::Ok) with CrashPolicy::Sleep, it fell through to the wildcard pattern (_) at line 695, which calls SetCompleteInput and returns StoppedResult::Destroy. This destroyed the actor instead of putting it to sleep, contradicting the intended semantics of the sleep policy.


🎯 Correctness

The fix is semantically correct:

  1. CrashPolicy::Restart → Always restart (already worked correctly)
  2. CrashPolicy::Sleep → Always sleep (now fixed to work correctly)
  3. CrashPolicy::Destroy → Always destroy (still handled by wildcard)

The failed flag is now correctly used only for observability (choosing the right log message) rather than for control flow.


📊 Improvements Made

  1. Better observability: Distinguishes between crash-induced sleep and clean shutdown sleep in logs
  2. Follows conventions: Uses structured logging (?actor_id) per CLAUDE.md guidelines
  3. Simplified logic: Removing the tuple match makes the code easier to understand

🔍 Observations & Suggestions

  1. Test Coverage Gap: The pegboard package lacks test files. Consider adding integration tests for:

    • Actor lifecycle with different crash policies
    • Runner shutdown scenarios
    • Verification that state.sleeping is set correctly
  2. Related to fix(pb): stop actors from destroying on goingaway #3501: I see this is part of a stack addressing runner shutdown issues. Good approach to fix related bugs together.

  3. Minor Enhancement (optional): The wildcard pattern at line 695 could be made explicit for clarity:

    CrashPolicy::Destroy => {
        ctx.activity(runtime::SetCompleteInput {}).await?;
        return Ok(StoppedResult::Destroy);
    }

    This makes it clear that Destroy is the only remaining case, not an oversight.


🔒 Security & Performance

  • ✅ No security concerns
  • ✅ No performance impact (same execution paths, just corrected routing)
  • ✅ No new dependencies

Verdict: LGTM

This is a clean, focused bug fix that addresses a critical issue in actor lifecycle management. The change is minimal, well-reasoned, and improves both correctness and observability.

Recommendation: Approve and merge as part of the stack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants