Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

@vercel
Copy link

vercel bot commented Nov 20, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Error Error Nov 20, 2025 8:56pm
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 20, 2025 8:56pm
rivet-inspector Ignored Ignored Preview Nov 20, 2025 8:56pm
rivet-site Ignored Ignored Preview Nov 20, 2025 8:56pm

Copy link
Contributor Author

MasterPtato commented Nov 20, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Nov 20, 2025

Code Review

Overview

This PR fixes an issue where actors were incorrectly being destroyed when receiving a GoingAway signal. The key change is moving the state.going_away = false reset to the end of the handle_stopped function and incorporating the going_away flag into the restart/sleep decision logic.

Changes Analysis

The Fix:
The PR makes two important changes:

  1. Moved going_away reset (line 712): Previously cleared at line 586, now cleared at line 712 after all reschedule logic
  2. Updated graceful exit logic (lines 660-665): Now considers an exit graceful only if \!state.going_away && StopCode::Ok

Logic Flow:

  • Previously: failed = \!StopCode::Ok → restart/sleep on failed == true
  • Now: graceful_exit = \!going_away && StopCode::Ok → restart/sleep on graceful_exit == false

This is semantically equivalent BUT with the critical addition that going_away now prevents the exit from being considered graceful.

Findings

Logic Correctness

The inverted boolean logic is correct:

  • Old: (CrashPolicy::Restart, failed=true) → reschedule
  • New: (CrashPolicy::Restart, graceful_exit=false) → reschedule

The key improvement is that when going_away == true, even with StopCode::Ok, the exit is treated as non-graceful, triggering the crash policy (restart/sleep) instead of destroying the actor.

🤔 Semantic Clarity Concern

The variable naming could be clearer:

  • graceful_exit suggests a normal shutdown, but the logic uses it to determine whether to apply crash policies
  • When graceful_exit == false, we restart/sleep (crash handling)
  • When graceful_exit == true, we destroy the actor

This is somewhat counterintuitive. A clearer name might be should_handle_as_crash or needs_rescheduling, with the boolean inverted:

let needs_rescheduling = state.going_away || \!matches\!(
    variant,
    StoppedVariant::Normal { code: protocol::StopCode::Ok }
);

match (input.crash_policy, needs_rescheduling) {
    (CrashPolicy::Restart, true) => { /* ... */ }
    (CrashPolicy::Sleep, true) => { /* ... */ }
    // ...
}

However, this is a minor style preference and doesn't affect correctness.

State Management

Moving state.going_away = false to the end (line 712) is correct because:

  • It ensures the flag is available throughout the reschedule decision logic
  • It's cleared alongside other transient flags (wake_for_alarm, will_wake)
  • Placement makes the lifecycle state cleanup more obvious

🟡 Comment Placement

The comment "Reset retry count on successful exit" (line 575) was moved from above the match arm to inside it. This is fine, though having it above the arm was slightly more visible. Not a significant issue.

Edge Cases

The logic correctly handles:

  • StopCode::Ok with going_away == true → treated as crash (reschedule/sleep)
  • StopCode::Ok with going_away == false → graceful exit (destroy)
  • Non-OK stop codes → always treated as crash regardless of going_away
  • force_reschedule path bypasses this logic entirely (correct)

Questions for Consideration

  1. Test Coverage: Are there tests that verify actors restart/sleep on GoingAway instead of being destroyed? This would be valuable to prevent regression.

  2. Documentation: Should there be a comment explaining why going_away makes an otherwise successful exit non-graceful? The intent might not be obvious to future maintainers.

  3. StopCode variants: Are there other StopCode variants besides Ok? Understanding the full enum would help verify the match logic is complete.

Verdict

Approve - The logic is correct and fixes the described issue. The only suggestions are minor improvements to clarity:

  • Consider renaming graceful_exitneeds_rescheduling (inverted) for better semantic clarity
  • Consider adding a comment explaining the going_away behavior
  • Ensure test coverage for this scenario

The fix properly ensures actors don't get destroyed when their runner is going away, allowing them to be rescheduled according to the crash policy instead.

@claude
Copy link

claude bot commented Nov 20, 2025

Code Review

Summary

This PR fixes a critical bug where actors were being incorrectly destroyed when receiving a GoingAway signal. The fix inverts the logic to properly distinguish between graceful exits and crashes/forced stops.


✅ Positive Aspects

  1. Critical Bug Fix: The core fix correctly addresses the issue where going_away state was preventing proper rescheduling
  2. Logical Improvement: The new graceful_exit variable is much clearer than the previous failed variable - it better expresses intent
  3. Correct State Management: Moving state.going_away = false to the end of the function (line 713) ensures the flag is cleared after all rescheduling logic completes

🔍 Code Analysis

Main Logic Change (lines 660-668, 668-695)

Before:

let failed = !matches!(variant, StoppedVariant::Normal { code: protocol::StopCode::Ok });
match (input.crash_policy, failed) {
    (CrashPolicy::Restart, true) => { /* reschedule */ }
    (CrashPolicy::Sleep, true) => { /* sleep */ }
    _ => { /* destroy */ }
}

After:

let graceful_exit = !state.going_away
    && matches!(variant, StoppedVariant::Normal { code: protocol::StopCode::Ok });
match (input.crash_policy, graceful_exit) {
    (CrashPolicy::Restart, false) => { /* reschedule */ }
    (CrashPolicy::Sleep, false) => { /* sleep */ }
    _ => { /* destroy */ }
}

Analysis:

  • The old logic would trigger rescheduling/sleeping on ANY non-OK stop code
  • The new logic properly distinguishes: a graceful exit (StopCode::Ok AND not going_away) should destroy the actor
  • When going_away=true, the actor is being intentionally stopped (e.g., runner shutdown), but should still respect crash policy for rescheduling
  • This fixes the reported bug where actors were destroyed prematurely during runner maintenance

State Flag Ordering (lines 584-586 → 711-713)

Before:

state.gc_timeout_ts = None;
state.going_away = false;  // ❌ Cleared too early
state.stopping = false;

After:

state.gc_timeout_ts = None;
state.stopping = false;
// ... rescheduling logic ...
state.going_away = false;  // ✅ Cleared after all logic

Why this matters: The going_away flag is used in the rescheduling logic (line 660), so it must remain set until after that logic executes.


🎯 Semantic Correctness

The fix aligns with the intended actor lifecycle:

  1. GoingAway signal → Sets going_away = true + sends stop command (lines 472-508)
  2. Actor stops → Enters handle_stopped with going_away = true
  3. Rescheduling logic → Should treat this as a non-graceful exit (crash policy applies)
  4. Cleanup → Reset going_away = false for next lifecycle

Crash Policy Semantics:

  • CrashPolicy::Restart: Always restart on non-graceful exits
  • CrashPolicy::Sleep: Sleep on non-graceful exits, destroy on graceful
  • CrashPolicy::Destroy: Always destroy (default)

The new code correctly implements these semantics.


🧹 Code Quality Notes

  1. Formatting changes: Most other changes are formatting (tabs/spaces, import ordering) - these are style-only
  2. Import ordering: Changes in custom_serve.rs and runner.rs follow Rust conventions (std → external → internal)
  3. Comment placement: Moving the "Reset retry count" comment (line 575) improves readability
  4. Structured logging: Changes in shared_state.rs (lines 197-198) follow the project's tracing conventions from CLAUDE.md

⚠️ Considerations

  1. Test Coverage: I don't see test changes in this PR. Consider adding a test case that:

    • Spawns an actor with CrashPolicy::Restart
    • Sends a GoingAway signal
    • Verifies the actor is rescheduled (not destroyed)
  2. Edge Case: What happens if an actor receives multiple GoingAway signals? The guard at line 482 (if !state.going_away) prevents duplicate processing, which is correct.

  3. Documentation: Consider adding a comment explaining the graceful_exit logic, e.g.:

    // A graceful exit means the actor stopped cleanly (StopCode::Ok) without being
    // forced to stop (going_away=false). In this case, respect the crash policy's
    // destroy semantics. Otherwise, treat as a crash and reschedule accordingly.

📊 Impact Assessment

Risk Level: Low

  • Changes are localized to actor lifecycle management
  • Logic inversion is straightforward and well-reasoned
  • No changes to external APIs or data structures

Performance: Neutral (no performance impact)

Security: No security concerns identified


✨ Recommendation

APPROVE with suggestion to add test coverage for the going_away scenario.

This is a well-reasoned fix that correctly addresses the reported issue. The logic is clearer and more maintainable than before.

@NathanFlurry NathanFlurry changed the base branch from 11-19-fix_pegboard-gateway_ping_hibernating_requests_both_during_open_hws_connections_and_during_hibernation to graphite-base/3501 November 20, 2025 20:51
@NathanFlurry NathanFlurry force-pushed the 11-20-fix_pb_stop_actors_from_destroying_on_goingaway branch from a78fb2c to c20b8cd Compare November 20, 2025 20:55
@NathanFlurry NathanFlurry changed the base branch from graphite-base/3501 to 11-19-fix_pegboard-gateway_ping_hibernating_requests_both_during_open_hws_connections_and_during_hibernation November 20, 2025 20:55
@claude
Copy link

claude bot commented Nov 20, 2025

Code Review

Summary

This PR fixes a critical bug where actors were being destroyed instead of rescheduled when they stopped during a runner's graceful shutdown ("going away" state). The fix correctly distinguishes between graceful exits that should trigger destruction versus non-graceful exits that should trigger rescheduling.

Changes Overview

The main logic change is in handle_stopped function (engine/packages/pegboard/src/workflows/actor/mod.rs:660-713):

Before:

  • The code incorrectly treated any non-StopCode::Ok as a failure that should trigger restart/sleep
  • going_away flag was cleared too early (line 586), before the rescheduling logic

After:

  • Introduces graceful_exit condition that checks both !state.going_away AND StopCode::Ok
  • Only truly graceful exits (not during shutdown) with StopCode::Ok avoid rescheduling
  • going_away flag is now cleared at the end (line 713), after all rescheduling decisions

Logic Analysis

The inverted crash policy matching is correct:

match (input.crash_policy, graceful_exit) {
    (CrashPolicy::Restart, false) => { /* reschedule */ }
    (CrashPolicy::Sleep, false) => { /* sleep */ }
    _ => { /* destroy */ }
}

This means:

  • Restart policy + non-graceful exit → reschedule the actor ✅
  • Sleep policy + non-graceful exit → put actor to sleep ✅
  • Any policy + graceful exit → destroy (normal termination) ✅
  • Destroy policy + any exit → destroy ✅

Positive Aspects

  1. Correct State Management: Moving going_away = false to line 713 ensures the flag is available for the rescheduling decision
  2. Clear Logic: The graceful_exit variable makes the intent explicit
  3. Consistent Formatting: The PR includes formatting fixes that improve readability
  4. Proper Import Ordering: Import statements are now properly ordered

Potential Concerns

1. State Transition Safety ⚠️

The going_away flag is now reset after rescheduling decisions. This is correct for the current flow, but consider:

  • What happens if an actor receives another GoingAway signal while it's being rescheduled?
  • Is there a race condition window between rescheduling and flag reset?

Recommendation: Add a comment explaining why going_away is reset at the end of the function.

2. Missing Test Coverage

This is a critical state machine bug fix, but I don't see corresponding test changes. Key scenarios to test:

  • Actor with CrashPolicy::Restart stops with StopCode::Ok during going_away → should reschedule
  • Actor with CrashPolicy::Restart stops with StopCode::Ok NOT during going_away → should destroy
  • Actor with CrashPolicy::Destroy stops during going_away → should destroy

Recommendation: Add integration tests covering the going_away + crash policy combinations.

3. Documentation Reference

The fix aligns with docs/engine/RUNNER_SHUTDOWN.md, which documents that GoingAway is sent during runner shutdown. However, the documentation doesn't explicitly state the expected actor behavior.

Recommendation: Update RUNNER_SHUTDOWN.md to clarify that actors should be rescheduled to other runners during shutdown, not destroyed.

Minor Issues

  1. Import Ordering Changes: While the import reordering (e.g., guard-core/src/custom_serve.rs) follows conventions, these seem unrelated to the bug fix. Consider separating formatting changes from logic changes in future PRs for easier review.

  2. Spacing Changes: Changes like the if-let reformatting in proxy_service.rs:612-627 are style improvements but make the diff larger.

Security & Performance

  • ✅ No security concerns identified
  • ✅ No performance regressions expected
  • ✅ The fix prevents incorrect actor destruction, improving system reliability

Verdict

Approved with recommendations

The core logic fix is correct and addresses a real bug in the actor lifecycle management. The main concerns are:

  1. Missing test coverage for this critical state machine fix
  2. Documentation could be clearer about expected behavior

The code is safe to merge as-is, but would benefit from follow-up work on testing and documentation.


Review generated by Claude Code

@claude
Copy link

claude bot commented Nov 20, 2025

PR Review: fix(pb): stop actors from destroying on goingaway

Summary

This PR fixes a critical bug where actors were being destroyed when they received a GoingAway signal, even when they had a CrashPolicy::Restart or CrashPolicy::Sleep set. The fix properly distinguishes between graceful exits and exits triggered by the going_away state.


Code Quality & Logic Changes ✅

Main Logic Fix (actor/mod.rs:660-713)
The core fix is solid:

  • Previously, the code checked failed based on whether the stop code was Ok
  • Now it checks for graceful_exit which requires BOTH !state.going_away AND StopCode::Ok
  • The match pattern was inverted from (CrashPolicy, failed) to (CrashPolicy, graceful_exit), flipping the boolean logic correctly

State Management Timing (actor/mod.rs:586,713)

  • Critical observation: state.going_away = false was moved from line 586 to line 713
  • This is the key fix: going_away needs to persist through the entire handle_stopped function to properly inform the crash policy decision
  • Resetting it at line 713 (after all rescheduling logic) is correct

Correctness Analysis ✅

The boolean logic inversion is handled correctly:

Before:

let failed = !matches!(variant, StoppedVariant::Normal { code: protocol::StopCode::Ok });
match (input.crash_policy, failed) {
    (CrashPolicy::Restart, true) => { /* restart */ }
    (CrashPolicy::Sleep, true) => { /* sleep */ }
    _ => { /* destroy */ }
}

After:

let graceful_exit = !state.going_away && matches!(variant, StoppedVariant::Normal { code: protocol::StopCode::Ok });
match (input.crash_policy, graceful_exit) {
    (CrashPolicy::Restart, false) => { /* restart */ }
    (CrashPolicy::Sleep, false) => { /* sleep */ }
    _ => { /* destroy */ }
}

The match arms correctly flip truefalse because !graceful_exit == failed.


Potential Issues & Concerns

1. State Reset Ordering ⚠️

The state.going_away = false reset at line 713 happens AFTER several potential early returns:

  • Line 656: return Ok(StoppedResult::Destroy) (force_reschedule path)
  • Line 676: return Ok(StoppedResult::Destroy) (Restart policy path)
  • Line 693: return Ok(StoppedResult::Destroy) (default path)
  • Line 707: return Ok(StoppedResult::Destroy) (will_wake path)

Question: Should going_away be reset before these early returns? If an actor destroys early, the state persists, but this might be intentional since the workflow is ending anyway.

2. Test Coverage ⚠️

No test files were found in the actor workflow directory. For a critical bug fix affecting actor lifecycle management, it would be valuable to have tests covering:

  • Actor with CrashPolicy::Restart receiving GoingAway → should restart, not destroy
  • Actor with CrashPolicy::Sleep receiving GoingAway → should sleep, not destroy
  • Actor with CrashPolicy::Destroy receiving GoingAway → should destroy (existing behavior)

3. Comment Placement (actor/mod.rs:575-576)

The comment // Reset retry count on successful exit was moved inside the match arm, which is fine but makes it less visible. Consider if this affects code clarity.


Code Style & Formatting ✅

The formatting changes (whitespace, indentation, import ordering) align with project conventions:

  • Hard tabs maintained
  • Import ordering fixed in custom_serve.rs:6
  • Structured logging format corrected in shared_state.rs:197-198 (index = instead of index=)
  • Consistent indentation in proxy_service.rs

These follow the CLAUDE.md guidelines.


Performance Considerations ✅

No performance concerns. The logic changes are minimal and don't introduce additional allocations or complexity.


Security Considerations ✅

No security concerns identified. The fix addresses correct actor lifecycle management without introducing vulnerabilities.


Recommendations

  1. Add integration tests for the going_away + crash policy interaction
  2. Verify the early return behavior - confirm that going_away state doesn't need cleanup when actors destroy early
  3. Consider adding a trace log at line 660 to help debug similar issues:
    tracing::debug!(
        ?variant,
        going_away = state.going_away,
        graceful_exit,
        "evaluating rescheduling behavior"
    );

Verdict: ✅ Approve with minor suggestions

The core fix is correct and addresses the reported issue. The logic inversion is sound, and the state management timing is appropriate. The main improvement would be adding test coverage for this critical path.

Great job identifying and fixing this subtle state management bug!

@NathanFlurry NathanFlurry marked this pull request as ready for review November 20, 2025 21:30
@MasterPtato MasterPtato force-pushed the 11-20-fix_pb_stop_actors_from_destroying_on_goingaway branch from c20b8cd to a78fb2c Compare November 20, 2025 22:26
@MasterPtato MasterPtato force-pushed the 11-19-fix_pegboard-gateway_ping_hibernating_requests_both_during_open_hws_connections_and_during_hibernation branch from 986289f to 815c918 Compare November 20, 2025 22:26
@claude
Copy link

claude bot commented Nov 20, 2025

Code Review

Summary

This PR fixes a critical bug where actors were being incorrectly destroyed when receiving a GoingAway signal. The fix properly handles the going_away flag lifecycle and adjusts the crash policy logic to consider graceful exits vs. failures.


Key Changes Analysis

1. Core Logic Fix in handle_stopped (actor/mod.rs:660-713)

Before:

let failed = !matches!(
    variant,
    StoppedVariant::Normal { code: protocol::StopCode::Ok }
);
match (input.crash_policy, failed) {
    (CrashPolicy::Restart, true) => { /* reschedule */ }
    (CrashPolicy::Sleep, true) => { /* sleep */ }
    _ => { /* destroy */ }
}
state.going_away = false; // Reset at line 586

After:

let graceful_exit = !state.going_away
    && matches!(
        variant,
        StoppedVariant::Normal { code: protocol::StopCode::Ok }
    );
match (input.crash_policy, graceful_exit) {
    (CrashPolicy::Restart, false) => { /* reschedule */ }
    (CrashPolicy::Sleep, false) => { /* sleep */ }
    _ => { /* destroy */ }
}
state.going_away = false; // Reset at line 713

Analysis:

  • Correct logic inversion: The new graceful_exit variable properly captures the intent. An actor exits gracefully only if:
    1. It's NOT going away (!state.going_away)
    2. AND it stopped with StopCode::Ok
  • Flag reset timing: Moving state.going_away = false from line 586 to line 713 ensures the flag is available throughout the stop handling logic
  • Proper crash policy handling:
    • CrashPolicy::Restart + non-graceful exit → reschedule (correct)
    • CrashPolicy::Sleep + non-graceful exit → sleep (correct)
    • Graceful exit → destroy and mark complete (correct)

Potential Issue:
The variable naming could be clearer. graceful_exit might be confused with "the actor gracefully exited" when it actually means "this is a normal completion that should result in actor destruction." Consider renaming to normal_completion or should_complete for clarity.


2. Formatting Changes

The PR includes several formatting changes (indentation, spacing):

  • guard-core/src/custom_serve.rs: Import reordering
  • guard-core/src/proxy_service.rs: Indentation fixes, space→tab conversion
  • pegboard-gateway/src/shared_state.rs: Structured logging formatting
  • pegboard/src/workflows/runner.rs: Import reordering

Note: According to CLAUDE.md, cargo fmt should not be run automatically. These formatting changes appear to be incidental. Per the project guidelines:

"DO NOT RUN CARGO FMT AUTOMATICALLY (note for humans: we need to run cargo fmt when everything is merged together and make sure lefthook is working)"


Testing Considerations

⚠️ No tests found: This is a critical bug fix but there are no tests in the pegboard package to verify the behavior.

Recommended test scenarios:

  1. Actor receives GoingAway signal with StopCode::Ok → should NOT be destroyed/rescheduled
  2. Actor with CrashPolicy::Restart crashes (non-Ok exit) → should reschedule
  3. Actor with CrashPolicy::Sleep crashes → should sleep
  4. Actor completes normally (not going away, StopCode::Ok) → should destroy/complete
  5. going_away flag state transitions through the lifecycle

Security & Performance

No security concerns identified
No performance impact - logic changes only
Correct state management - flag is properly reset after use


Recommendations

  1. High Priority: Add integration tests covering the GoingAway signal behavior and crash policy interactions
  2. Medium Priority: Consider renaming graceful_exit to normal_completion or similar for better code clarity
  3. Low Priority: Document the going_away flag's purpose and lifecycle in the LifecycleState struct comments
  4. Note: Ensure lefthook/pre-commit hooks are working as intended for formatting in future PRs

Verdict

Approved - The core logic fix is correct and addresses the stated issue. The change properly prevents actors from being incorrectly destroyed when receiving a GoingAway signal.

The inversion of the crash policy matching logic (false now triggers restart/sleep instead of true) makes sense because:\n- When graceful_exit = true: Actor completed normally → destroy it\n- When graceful_exit = false: Actor failed or is going away → apply crash policy (restart/sleep)

This PR should be merged after addressing the test coverage concern.

@claude
Copy link

claude bot commented Nov 20, 2025

PR Review: fix(pb): stop actors from destroying on goingaway

I've reviewed this PR and have some observations and feedback:

Summary

This PR fixes a critical bug where actors were being incorrectly destroyed when receiving a GoingAway signal. The key change is moving the state.going_away = false reset to after the crash policy logic, which prevents actors from being destroyed prematurely.


Code Quality & Logic

Core Fix is Sound

The main logic change at engine/packages/pegboard/src/workflows/actor/mod.rs:660-713 is correct:

Before: The going_away flag was cleared too early (line 586), causing the crash policy logic to treat going_away exits as graceful, leading to actor destruction.

After: The flag is cleared at the end (line 713), allowing the crash policy logic to correctly identify non-graceful exits and trigger rescheduling.

The key improvement is in the graceful_exit calculation:

let graceful_exit = !state.going_away
    && matches!(
        variant,
        StoppedVariant::Normal {
            code: protocol::StopCode::Ok
        }
    );

This correctly treats going_away + Ok as a non-graceful exit (requiring restart/sleep based on crash policy), rather than a normal termination.

Crash Policy Logic Inversion

The inversion of the crash policy matching (graceful_exit instead of failed) is correct:

  • (CrashPolicy::Restart, false) → restart on non-graceful exits
  • (CrashPolicy::Sleep, false) → sleep on non-graceful exits
  • _ → destroy on graceful exits

This aligns with the expected behavior where actors should be rescheduled on crashes/forced stops, not destroyed.


Potential Issues & Considerations

⚠️ State Reset Ordering

The going_away flag is now reset at line 713, after all the rescheduling logic. This is correct for the fix, but consider:

  1. Early returns: If any of the match arms return early (e.g., StoppedResult::Destroy at lines 676, 693, 707), the going_away flag will remain true.

    • Impact: If the actor is somehow resurrected or the state is inspected later, it may appear to still be "going away"
    • Severity: Low - actors that are destroyed shouldn't be inspected again, but worth documenting
  2. State consistency: Consider whether going_away should be reset alongside other state flags like wake_for_alarm and will_wake for consistency, or if it truly belongs at the end.

Recommendation: Add a comment explaining why going_away is reset after the crash policy logic, to prevent future refactors from moving it back up.

🤔 Missing Test Coverage

There don't appear to be explicit tests in this PR for:

  • Actor receiving GoingAway signal with CrashPolicy::Restart → should restart, not destroy
  • Actor receiving GoingAway signal with CrashPolicy::Sleep → should sleep, not destroy
  • Ensuring going_away flag is properly cleared after rescheduling

Recommendation: Add integration tests to prevent regression.

📝 Documentation

The GoingAway signal is used for graceful shutdown scenarios (e.g., runner scaling down). Consider adding:

  • Comment on the going_away field in LifecycleState explaining its purpose
  • Comment in handle_stopped explaining why going_away affects crash policy decisions

Minor Issues

Formatting Changes

The PR includes several formatting changes (tabs/spaces, line breaks). These are consistent with the project's style guide but make the diff slightly harder to review.

Import Ordering

Import ordering changes (e.g., engine/packages/pegboard/src/workflows/runner.rs:5) are minor and follow Rust conventions.


Security & Performance

  • Security: No security concerns identified
  • Performance: No performance impact - this is a logic fix with no algorithmic changes

Recommendations

  1. Approve the core fix - the logic is sound and addresses the bug
  2. 📝 Add comments explaining the going_away reset timing
  3. 🧪 Add test coverage for GoingAway + CrashPolicy scenarios
  4. 📖 Document the going_away field and its interaction with crash policies

Verdict

LGTM with suggestions - The core fix is correct and addresses the bug effectively. The recommendations are for improving maintainability and preventing future regressions, but shouldn't block merging if time-sensitive.

@NathanFlurry NathanFlurry force-pushed the 11-20-fix_pb_stop_actors_from_destroying_on_goingaway branch from a78fb2c to c20b8cd Compare November 20, 2025 22:32
@claude
Copy link

claude bot commented Nov 20, 2025

Code Review

Summary

This PR fixes a critical bug where actors were being destroyed when they received a GoingAway signal, even when the actor gracefully exited with StopCode::Ok. The fix correctly distinguishes between graceful shutdowns and crashes by checking the going_away flag.


Positive Changes

1. Correct Logic Fix (Critical)
The core change correctly addresses the bug:

// Before: Considered any non-Ok exit as a failure
let failed = !matches!(variant, StoppedVariant::Normal { code: protocol::StopCode::Ok });
match (input.crash_policy, failed) { /* ... */ }

// After: Only treat as non-graceful if going_away flag is set OR exit code is not Ok
let graceful_exit = !state.going_away && matches!(variant, StoppedVariant::Normal { code: protocol::StopCode::Ok });
match (input.crash_policy, graceful_exit) { /* ... */ }

This ensures that actors receiving a GoingAway signal (graceful shutdown request) followed by a successful exit won't incorrectly trigger restart/sleep policies.

2. Proper State Management
Moving state.going_away = false to line 713 (after all rescheduling logic) is correct. This ensures the flag persists through the entire stop handling flow and is only cleared once the actor is fully processed.

3. Code Formatting
The formatting changes (indentation, whitespace) improve consistency and follow project standards.


🔍 Potential Issues & Questions

1. Inverted Logic in Match Arms
The boolean logic was inverted but the match arms weren't updated:

match (input.crash_policy, graceful_exit) {
    (CrashPolicy::Restart, false) => { /* restart logic */ }  // Was: (Restart, true)
    (CrashPolicy::Sleep, false) => { /* sleep logic */ }      // Was: (Sleep, true)
    _ => { /* destroy */ }
}

Question: Is this intentional? The logic now says:

  • CrashPolicy::Restart + non-graceful exit → restart the actor
  • CrashPolicy::Sleep + non-graceful exit → put actor to sleep
  • Graceful exit → destroy/complete the actor

This seems correct for the intended behavior (actors that exit gracefully during shutdown should be destroyed, not restarted), but it's a significant semantic change that should be verified.

2. Crash Policy Semantics
The crash policies are named CrashPolicy::Restart and CrashPolicy::Sleep, but they now only apply to non-graceful exits. When going_away is set and the actor exits cleanly, it's destroyed regardless of the crash policy.

Recommendation: Consider if this behavior should be documented or if crash policies should be renamed to reflect they only apply to actual crashes, not graceful shutdowns.

3. Missing Test Coverage
There are no unit tests in engine/packages/pegboard/src/workflows/actor/ to cover this critical logic change.

Recommendation: Add tests covering these scenarios:

  • Actor receives GoingAway, exits with StopCode::Ok → should destroy
  • Actor crashes with CrashPolicy::Restart → should restart
  • Actor exits gracefully without going_away flag → should follow crash policy
  • going_away flag is properly reset after handling

4. Comment Location
The comment // Reset retry count on successful exit was moved inside the match arm at line 575. While this is fine, it's worth noting that this behavior only applies to normal successful exits, not lost actors.


🎯 Verification Needed

  1. Integration Testing: Has this been tested with actual actor lifecycles to confirm:

    • Actors receiving GoingAway signals are properly destroyed after graceful exit?
    • Crash policies still work correctly for actual crashes?
    • No edge cases where going_away flag persists incorrectly?
  2. Edge Case: What happens if an actor receives multiple GoingAway signals? The handler at line 482 checks if !state.going_away to prevent duplicate processing, which is good.

  3. Generation Mismatch: The GoingAway handler correctly checks generation at line 474. Verify this works with the new logic.


📝 Minor Issues

Import Ordering
The import reordering in runner.rs and custom_serve.rs is fine, though it's unclear why this was needed. The project uses rustfmt which should handle this automatically (though CLAUDE.md notes not to run it manually).


🔐 Security & Performance

  • ✅ No security concerns identified
  • ✅ No performance regressions expected
  • ✅ State management appears sound
  • ✅ No resource leaks introduced

📚 Recommendations

  1. High Priority: Add integration or unit tests for the new logic
  2. Medium Priority: Document the crash policy behavior with going_away flag
  3. Low Priority: Consider adding a log statement when going_away affects the decision:
    if state.going_away {
        tracing::debug!(actor_id=?input.actor_id, "actor marked going_away, treating as non-graceful exit");
    }

Overall Assessment

This is a good fix for a critical bug. The core logic change is sound and properly addresses the issue where actors were being destroyed incorrectly. The main concerns are around test coverage and verification that the inverted logic matches the intended semantics. Once confirmed through testing, this should be safe to merge.

Approval Status: ✅ Approve with recommendation to add tests in a follow-up PR

@MasterPtato MasterPtato force-pushed the 11-20-fix_pb_stop_actors_from_destroying_on_goingaway branch from c20b8cd to a78fb2c Compare November 20, 2025 22:38
@MasterPtato MasterPtato force-pushed the 11-19-fix_pegboard-gateway_ping_hibernating_requests_both_during_open_hws_connections_and_during_hibernation branch from 986289f to 815c918 Compare November 20, 2025 22:38
@claude
Copy link

claude bot commented Nov 20, 2025

Code Review

Summary

This PR fixes actor lifecycle management by preventing actors from being destroyed when they're in a "going away" state. The key change inverts the crash policy logic to treat graceful exits differently from error cases.


🔍 Critical Logic Change

Location: engine/packages/pegboard/src/workflows/actor/mod.rs:660-669

The core change inverts how the crash policy is evaluated:

Before:

let failed = \!matches\!(variant, StoppedVariant::Normal { code: protocol::StopCode::Ok });
match (input.crash_policy, failed) {
    (CrashPolicy::Restart, true) => { /* restart on failure */ }
    (CrashPolicy::Sleep, true) => { /* sleep on failure */ }
    _ => { /* destroy */ }
}

After:

let graceful_exit = \!state.going_away && matches\!(variant, StoppedVariant::Normal { code: protocol::StopCode::Ok });
match (input.crash_policy, graceful_exit) {
    (CrashPolicy::Restart, false) => { /* restart on non-graceful exit */ }
    (CrashPolicy::Sleep, false) => { /* sleep on non-graceful exit */ }
    _ => { /* destroy on graceful exit */ }
}

Analysis:

  1. ✅ The new logic correctly considers going_away state - if an actor is being phased out, even a clean stop (StopCode::Ok) should NOT be treated as a graceful exit
  2. ✅ This prevents actors from being destroyed when they're supposed to be rescheduled
  3. ⚠️ Semantic inversion: The match now triggers on false instead of true, which could be confusing

Potential Issue: The logic inverts the boolean (matching on false instead of true), which is less intuitive. Consider refactoring for clarity:

let should_reschedule = state.going_away || \!matches\!(variant, StoppedVariant::Normal { code: protocol::StopCode::Ok });
match (input.crash_policy, should_reschedule) {
    (CrashPolicy::Restart, true) => { /* restart */ }
    (CrashPolicy::Sleep, true) => { /* sleep */ }
    _ => { /* destroy */ }
}

🐛 State Management Issue

Location: engine/packages/pegboard/src/workflows/actor/mod.rs:584-586, 713

The PR moves the going_away = false reset from line 586 to line 713.

Before: going_away was cleared early (line 586), before crash policy evaluation
After: going_away is cleared late (line 713), after all rescheduling logic

Analysis:

  • ✅ This is the correct fix - going_away needs to be checked during rescheduling logic, so it can't be cleared until after that logic completes
  • ✅ The flag is properly reset at the end alongside other cleanup flags (wake_for_alarm, will_wake)
  • ⚠️ Edge case: If the function returns early (lines 655, 676, 693, 707) with StoppedResult::Destroy, the going_away flag might not be cleared. However, this is acceptable since the actor is being destroyed anyway.

📝 Formatting Changes

Locations:

  • engine/packages/guard-core/src/custom_serve.rs:6 - Import reordering
  • engine/packages/guard-core/src/proxy_service.rs:615-627 - Whitespace reformatting
  • engine/packages/pegboard-gateway/src/shared_state.rs:197-198 - Space around = in tracing
  • engine/packages/pegboard/src/workflows/runner.rs:5 - Import reordering

Issues:
⚠️ According to CLAUDE.md, developers should NOT run cargo fmt manually. These formatting changes suggest either:

  1. An automated formatter ran (possibly a pre-commit hook)
  2. Manual formatting was applied contrary to project guidelines

Recommendation: Verify these changes align with project conventions. If they were introduced by an IDE or manual edit, consider reverting the pure whitespace changes to keep the PR focused on the logic fix.


✅ Positive Aspects

  1. Correct Bug Fix: The core issue is properly addressed - actors won't be incorrectly destroyed when in going_away state
  2. Proper State Sequencing: Moving going_away = false to after the rescheduling logic is correct
  3. Consistent State Cleanup: All temporary state flags are now cleared together at the end
  4. Tracing Consistency: The structured logging changes (index = current_message_index) follow the project's tracing guidelines from CLAUDE.md

🔒 Security Considerations

✅ No security concerns identified


⚡ Performance Considerations

✅ No performance impact - this is purely logical flow control


🧪 Test Coverage

Concern: No test changes are included in this PR.

Recommendations:

  1. Add a test case for actors stopping with going_away = true and StopCode::Ok to ensure they're rescheduled, not destroyed
  2. Add a test for actors stopping with going_away = false and StopCode::Ok to ensure they're properly destroyed
  3. Consider edge cases around the early returns in the rescheduling logic

📋 Recommendations

  1. High Priority: Add test coverage for the going_away + graceful exit scenario
  2. Medium Priority: Consider refactoring the inverted boolean logic for clarity (see suggestion above)
  3. Low Priority: Review and potentially revert pure formatting changes not related to the bug fix
  4. Documentation: Consider adding a comment explaining the going_away flag's role in determining graceful exits

✅ Approval Status

Overall: The core logic fix is correct and addresses a real bug. However, test coverage would strengthen confidence in this change.

Suggested next steps:

  • Add test coverage
  • Consider the refactoring suggestion for improved code clarity

@claude
Copy link

claude bot commented Nov 20, 2025

Code Review

Summary

This PR fixes an important issue where actors were being incorrectly destroyed when receiving a GoingAway signal during runner shutdown. The fix ensures that actors properly reschedule to new runners instead of being destroyed.

Key Changes Analysis

1. Critical Fix: Delayed going_away flag reset

Location: engine/packages/pegboard/src/workflows/actor/mod.rs:713

The most important change - moving state.going_away = false from early in the function (previously line 586) to the end (now line 713). This ensures the going_away state is preserved during the rescheduling logic.

Why this matters: When a runner is shutting down, it sends GoingAway signals to all actors. The actor should reschedule to a new runner, not destroy itself. The previous code was clearing the flag before the rescheduling decision, causing actors to be incorrectly treated as graceful exits.

2. Improved Logic: graceful_exit check

Location: engine/packages/pegboard/src/workflows/actor/mod.rs:660-666

let graceful_exit = !state.going_away
    && matches!(
        variant,
        StoppedVariant::Normal {
            code: protocol::StopCode::Ok
        }
    );

This is a semantic improvement over the old failed variable. The logic is now:

  • Graceful exit: Actor stopped with OK status AND was not going away
  • Non-graceful: Actor crashed OR was stopped due to runner shutdown

3. Inverted match logic

Location: engine/packages/pegboard/src/workflows/actor/mod.rs:668-694

The pattern matching is now inverted to match on graceful_exit instead of failed:

  • (CrashPolicy::Restart, false) - Restart on non-graceful exit ✅
  • (CrashPolicy::Sleep, false) - Sleep on non-graceful exit ✅
  • Default case (graceful_exit = true) - Destroy and complete ✅

This correctly implements the expected behavior.

Code Quality

Positive ✅

  • Correct fix: Addresses the root cause by preserving going_away state
  • Better semantics: graceful_exit is more readable than !failed
  • Consistent with documentation: Aligns with docs/engine/RUNNER_SHUTDOWN.md
  • Proper formatting: All formatting changes are mechanical (rustfmt)

Minor Observations

Import ordering (lines changed in multiple files):
The PR includes some import reordering. While this follows Rust conventions (std/external/internal), these changes are cosmetic and likely from automated formatting.

Potential Concerns

1. State Management Edge Cases

The going_away flag is now cleared at the very end of handle_stopped. Consider:

  • What happens if handle_stopped returns early via StoppedResult::Destroy?
  • The flag would remain set, but since the actor is destroyed, this should be fine.
  • ✅ No issue - the flag is only relevant for running actors.

2. Missing Test Coverage ⚠️

I don't see test changes in this PR. Consider adding tests for:

// Test case: Actor with CrashPolicy::Restart receives GoingAway, 
// stops normally - should reschedule, not destroy

3. Crash Policy Documentation

The behavior change is subtle:

  • Old: StopCode::Ok = success, anything else = failure → reschedule/sleep on failure
  • New: StopCode::Ok AND !going_away = graceful → destroy on graceful, reschedule otherwise

This is correct but worth documenting in code comments at engine/packages/pegboard/src/workflows/actor/mod.rs:660.

Security & Performance

  • No security concerns: This is internal state management
  • No performance impact: Logic change only, same complexity
  • No resource leaks: Proper cleanup maintained

Recommendations

  1. Add inline comment at line 660:
// An exit is only graceful if the actor stopped with OK status and was NOT
// responding to a runner shutdown (going_away). Actors stopped during runner
// shutdown should be rescheduled to new runners, not destroyed.
let graceful_exit = !state.going_away
    && matches!(
        variant,
        StoppedVariant::Normal {
            code: protocol::StopCode::Ok
        }
    );
  1. Consider adding integration test to verify actors reschedule correctly when runners shut down gracefully.

  2. Verify CI passes - ensure all existing tests still pass with the inverted logic.

Verdict

Approve with suggestions

This is a well-reasoned fix that correctly addresses the issue described in the PR title. The logic is sound and the implementation is clean. The suggestions above are minor improvements for maintainability.


Review generated with assistance from Claude Code

@NathanFlurry NathanFlurry force-pushed the 11-20-fix_pb_stop_actors_from_destroying_on_goingaway branch from a78fb2c to c20b8cd Compare November 20, 2025 23:32
@NathanFlurry NathanFlurry force-pushed the 11-19-fix_pegboard-gateway_ping_hibernating_requests_both_during_open_hws_connections_and_during_hibernation branch from 815c918 to 986289f Compare November 20, 2025 23:32
@claude
Copy link

claude bot commented Nov 20, 2025

Code Review

Thanks for this PR! I've reviewed the changes and have the following feedback:

Summary

This PR fixes an issue where actors were being destroyed when receiving a GOINGAWAY signal. The core fix moves the state.going_away = false reset to after the rescheduling logic, ensuring that the graceful_exit check correctly identifies whether an actor stopped due to a GOINGAWAY signal vs. a normal exit.


✅ Code Quality & Best Practices

Good:

  • The logic change is minimal and surgical, addressing the specific issue without over-engineering
  • The comment move at line 575 (// Reset retry count on successful exit) improves code readability by placing it closer to the relevant code
  • Most changes are just formatting fixes (import ordering, whitespace) following the project's rustfmt conventions

Observations:

  • The core behavioral change is at lines 660-666 and 713 in engine/packages/pegboard/src/workflows/actor/mod.rs

🐛 Potential Issues

Critical Logic Concern:
The behavior change at lines 668-694 inverts the crash policy logic:

// OLD: match (input.crash_policy, failed)
// NEW: match (input.crash_policy, graceful_exit)
match (input.crash_policy, graceful_exit) {
    (CrashPolicy::Restart, false) => { /* reschedule */ }
    (CrashPolicy::Sleep, false) => { /* sleep */ }
    _ => { /* destroy */ }
}

This is a significant semantic change:

  • Before: failed = true would trigger rescheduling/sleeping
  • After: graceful_exit = false triggers rescheduling/sleeping

The new graceful_exit is true when:

  1. !state.going_away (actor was NOT sent a GOINGAWAY signal), AND
  2. code == StopCode::Ok (actor exited successfully)

Question: Is this the intended behavior? Specifically:

  • Should actors that exit with StopCode::Ok but were sent a GOINGAWAY signal (going_away = true) be destroyed instead of restarted?
  • The old logic would restart/sleep ANY failure. The new logic only restarts/sleeps if it's NOT a graceful exit (i.e., either going_away=true OR code!=Ok)

This seems correct for the GOINGAWAY case, but I want to confirm this doesn't inadvertently change behavior for actors that exit gracefully with Ok status under normal circumstances.


🔒 Security Considerations

No security concerns identified. The changes are internal state management logic.


⚡ Performance Considerations

No performance impact. The changes are pure logic fixes with no algorithmic changes.


🧪 Test Coverage

Concern: No test coverage added for this bug fix.

Recommendation: Consider adding a test case that:

  1. Starts an actor
  2. Sends it a GoingAway signal
  3. Actor stops with StopCode::Ok
  4. Verifies the actor is properly destroyed (not restarted)

This would prevent regression and document the expected behavior.


📝 Minor Observations

  1. Import ordering changes: The formatting changes (like reordering imports in custom_serve.rs and runner.rs) are good for consistency
  2. Whitespace fixes: The indentation fixes in proxy_service.rs and shared_state.rs follow project conventions (hard tabs)
  3. Comment positioning: Moving the comment to line 575 is a good improvement

❓ Questions for Author

  1. Can you confirm the inverted logic for crash policy is intentional and correct?
  2. Have you manually tested this with actors that:
    • Receive GOINGAWAY and exit with Ok (should destroy)
    • Crash without GOINGAWAY (should restart/sleep based on policy)
    • Exit normally with Ok without GOINGAWAY (should destroy based on new logic?)
  3. Would you consider adding a test case to prevent regression?

Overall Assessment

The fix appears to address the stated issue (actors destroying on GOINGAWAY), but the semantic inversion of the crash policy logic needs verification to ensure it doesn't introduce unintended behavior changes for other actor lifecycle scenarios.

Recommendation: ✅ Approve with verification - Please confirm the crash policy logic change is fully intentional and tested.

@claude
Copy link

claude bot commented Nov 20, 2025

Code Review - PR #3501: fix(pb): stop actors from destroying on goingaway

Summary

This PR fixes a bug where actors were being destroyed when they received a GoingAway signal with StopCode::Ok, even though they should have been rescheduled according to their crash policy. The fix inverts the logic to properly distinguish between graceful exits and crashes.


✅ Positive Aspects

  1. Critical Bug Fix: The core logic change correctly addresses the issue. The previous logic was treating going_away + StopCode::Ok as a failure, causing actors to be destroyed instead of restarted.

  2. Improved Variable Naming: Renaming the boolean from failed to graceful_exit significantly improves code readability and makes the intent clearer.

  3. Correct State Management: Moving state.going_away = false to after all rescheduling logic (line 713) ensures the flag is properly reset only after determining whether to reschedule.

  4. Code Formatting: The PR includes several formatting improvements (whitespace, import ordering) that align with the codebase standards.


🔍 Logic Analysis

Before (buggy):

let failed = !matches!(variant, StoppedVariant::Normal { code: protocol::StopCode::Ok });
match (input.crash_policy, failed) {
    (CrashPolicy::Restart, true) => { /* reschedule */ }
    (CrashPolicy::Sleep, true) => { /* sleep */ }
    _ => { /* destroy */ }
}
  • A going_away stop with StopCode::Ok would set failed = false
  • This would fall through to the _ branch and destroy the actor
  • Bug: Actors with CrashPolicy::Restart should restart, not destroy

After (fixed):

let graceful_exit = !state.going_away && matches!(variant, StoppedVariant::Normal { code: protocol::StopCode::Ok });
match (input.crash_policy, graceful_exit) {
    (CrashPolicy::Restart, false) => { /* reschedule */ }
    (CrashPolicy::Sleep, false) => { /* sleep */ }
    _ => { /* destroy */ }
}
  • A going_away stop with StopCode::Ok sets graceful_exit = false (because going_away == true)
  • This matches (CrashPolicy::Restart, false) and reschedules the actor
  • Correct: Actors restart according to their crash policy

🤔 Questions & Considerations

1. Edge Case: going_away with Non-Ok Exit Codes

What happens when state.going_away == true AND the exit code is not StopCode::Ok (e.g., crash during shutdown)?

graceful_exit = !state.going_away && matches!(variant, StoppedVariant::Normal { code: protocol::StopCode::Ok })
// If going_away==true and code!=Ok: graceful_exit = false
// This would trigger rescheduling for CrashPolicy::Restart

Question: Is this the intended behavior? Should a crash during going_away also trigger rescheduling, or should it be handled differently?

2. State Reset Timing

The going_away flag is now reset at line 713, after all the rescheduling logic. This is correct for the fix, but I want to confirm:

  • When an actor is rescheduled (lines 669-678), it gets a new runner_id and runner_workflow_id
  • The going_away flag persists across the rescheduling
  • Only after rescheduling completes is the flag cleared

Confirmation: This seems correct for maintaining state during the transition, but should going_away ever be carried over to a newly rescheduled actor instance?

3. Comment Placement

The comment "Reset retry count on successful exit" was moved from line 572 (before the logic) to line 575 (inside the match arm). While this works, the original placement was more visible. Consider if the comment should stay at the top of the match expression for better visibility.


🎯 Test Coverage

Recommendation: This PR should include tests for the following scenarios:

  1. Actor with CrashPolicy::Restart receives GoingAway signal and stops with StopCode::Ok

    • ✅ Expected: Actor should be rescheduled, not destroyed
    • ❌ Previous behavior: Actor was destroyed
  2. Actor with CrashPolicy::Restart crashes (non-Ok exit code) while going_away == true

    • What should happen in this scenario?
  3. Actor with CrashPolicy::Sleep receives GoingAway signal and stops with StopCode::Ok

    • Expected behavior based on the fix?
  4. Actor with CrashPolicy::Destroy in various going_away states

    • Should always destroy regardless of going_away state

Suggested test location: engine/packages/pegboard/src/workflows/actor/ (following existing test patterns)


🔒 Security Considerations

No security concerns identified. This is an internal state management fix.


⚡ Performance Considerations

No performance impact. The change is purely logical with the same computational complexity.


📝 Minor Formatting Notes

The formatting changes are good overall:

  • Import ordering fixes (engine/packages/guard-core/src/custom_serve.rs:5-6)
  • Consistent indentation (engine/packages/guard-core/src/proxy_service.rs:615-627)
  • Structured logging spacing (engine/packages/pegboard-gateway/src/shared_state.rs:197-198)

These align with the codebase's Rust formatting standards.


🎬 Recommendations

  1. Add test coverage for the scenarios mentioned above
  2. Clarify the intended behavior for going_away + crash (non-Ok exit code)
  3. Consider documenting the going_away flag's lifecycle in comments, as it has subtle behavior that affects rescheduling
  4. Verify in integration testing that actors properly restart after receiving GoingAway signals

✨ Conclusion

Approval Status: ✅ Approve with minor recommendations

The core fix is sound and addresses a real bug in the actor lifecycle management. The logic inversion correctly distinguishes between graceful exits and crashes, ensuring actors are rescheduled according to their crash policy rather than being incorrectly destroyed. The main recommendations are around test coverage and documentation to prevent regression and clarify edge case behavior.

Great catch on this bug! The going_away state interaction with crash policies is subtle, and this fix properly handles it.

@MasterPtato MasterPtato force-pushed the 11-20-fix_pb_stop_actors_from_destroying_on_goingaway branch from c20b8cd to a78fb2c Compare November 21, 2025 00:48
@MasterPtato MasterPtato force-pushed the 11-19-fix_pegboard-gateway_ping_hibernating_requests_both_during_open_hws_connections_and_during_hibernation branch from 986289f to 815c918 Compare November 21, 2025 00:48
@claude
Copy link

claude bot commented Nov 21, 2025

Code Review

Overview

This PR fixes a critical bug where actors were incorrectly destroyed when receiving a GoingAway signal, even when configured with CrashPolicy::Restart or CrashPolicy::Sleep. The fix correctly distinguishes between graceful exits and forced shutdowns during runner maintenance.


✅ Main Logic Fix (actor/mod.rs:660-713)

The Problem:
Previously, the code would destroy actors that stopped with StopCode::Ok during a runner shutdown (going_away=true), ignoring the crash policy.

The Solution:
The fix introduces graceful_exit which requires BOTH:

  1. !state.going_away (not during runner shutdown)
  2. StopCode::Ok (clean exit)
// Before
let failed = !matches!(variant, StoppedVariant::Normal { code: protocol::StopCode::Ok });
match (input.crash_policy, failed) {
    (CrashPolicy::Restart, true) => { /* reschedule */ }
    (CrashPolicy::Sleep, true) => { /* sleep */ }
    _ => { /* destroy */ }
}

// After
let graceful_exit = !state.going_away 
    && matches!(variant, StoppedVariant::Normal { code: protocol::StopCode::Ok });
match (input.crash_policy, graceful_exit) {
    (CrashPolicy::Restart, false) => { /* reschedule */ }
    (CrashPolicy::Sleep, false) => { /* sleep */ }
    _ => { /* destroy */ }
}

Analysis:

  • Boolean inversion is correctly handled (failed=truegraceful_exit=false)
  • When going_away=true, the exit is treated as non-graceful, triggering crash policy
  • This ensures actors are rescheduled to other runners during maintenance instead of being destroyed

✅ Critical State Management Fix (actor/mod.rs:713)

Before: state.going_away = false was reset at line 586 (too early)
After: Reset moved to line 713 (after all rescheduling logic)

Why this matters: The going_away flag must remain set through the entire handle_stopped function so the rescheduling logic (line 660) can properly determine if this is a graceful exit or a forced shutdown.


🎯 Correctness Verification

The fix correctly implements these scenarios:

Scenario going_away StopCode Crash Policy Expected Behavior
Runner shutdown true Ok Restart Reschedule to new runner
Runner shutdown true Ok Sleep Put to sleep
Normal exit false Ok Restart Destroy (graceful)
Normal exit false Ok Sleep Destroy (graceful)
Crash false Error Restart Reschedule
Crash false Error Sleep Put to sleep

📝 Code Style & Formatting

All formatting changes align with CLAUDE.md conventions:

  • Import ordering (custom_serve.rs:6, runner.rs:5): Now follows std → external → internal ordering
  • Structured logging (shared_state.rs:197-198): Properly uses key = value format instead of key=value
  • Indentation (proxy_service.rs:615-627): Formatting improved for better readability
  • Hard tabs: Maintained throughout (per rustfmt.toml)

⚠️ Recommendations

1. Test Coverage

This is a critical state machine bug fix but no test changes are included. Consider adding integration tests for:

  • Actor with CrashPolicy::Restart receiving GoingAway → verifies rescheduling
  • Actor with CrashPolicy::Sleep receiving GoingAway → verifies sleeping
  • Actor with normal StopCode::Ok (no going_away) → verifies destruction

2. Documentation

Consider adding a comment at line 660 explaining the graceful_exit logic:

// A graceful exit requires both a clean stop (StopCode::Ok) AND not being 
// forced to stop during runner shutdown (going_away=false). Otherwise, 
// treat as a crash and apply the crash policy (restart/sleep).
let graceful_exit = !state.going_away && matches!(...);

3. Edge Case Consideration

The going_away guard at line 482 prevents duplicate GoingAway signal processing, which is correct. However, verify behavior if a second GoingAway arrives during rescheduling.


🔒 Security & Performance

  • ✅ No security concerns
  • ✅ No performance impact
  • ✅ Improves system reliability by preventing incorrect actor destruction

Verdict

APPROVE

This is a well-reasoned fix that correctly addresses the reported issue. The logic is sound and more maintainable than before. Main suggestions are adding test coverage and documentation for future maintainers.

Files Changed:

  • engine/packages/pegboard/src/workflows/actor/mod.rs: Core logic fix (660-713)
  • engine/packages/guard-core/src/proxy_service.rs: Formatting improvements
  • engine/packages/guard-core/src/custom_serve.rs: Import ordering
  • engine/packages/pegboard-gateway/src/shared_state.rs: Logging format fix
  • engine/packages/pegboard/src/workflows/runner.rs: Import ordering

Review generated with Claude Code

@NathanFlurry NathanFlurry force-pushed the 11-20-fix_pb_stop_actors_from_destroying_on_goingaway branch from a78fb2c to c20b8cd Compare November 21, 2025 01:27
@NathanFlurry NathanFlurry force-pushed the 11-19-fix_pegboard-gateway_ping_hibernating_requests_both_during_open_hws_connections_and_during_hibernation branch from 815c918 to 986289f Compare November 21, 2025 01:27
@claude
Copy link

claude bot commented Nov 21, 2025

Code Review

Summary

This PR fixes an issue where actors were being destroyed when they received a GoingAway signal, when they should instead be rescheduled. The key change inverts the logic to check for a graceful exit rather than a failure.

Code Quality & Logic ✅

Primary Fix (actor/mod.rs:660-713)

The core logic change is sound:

Before:

let failed = !matches!(variant, StoppedVariant::Normal { code: protocol::StopCode::Ok });
match (input.crash_policy, failed) {
    (CrashPolicy::Restart, true) => { /* reschedule */ }
    (CrashPolicy::Sleep, true) => { /* sleep */ }
    _ => { /* destroy */ }
}

After:

let graceful_exit = !state.going_away 
    && matches!(variant, StoppedVariant::Normal { code: protocol::StopCode::Ok });
match (input.crash_policy, graceful_exit) {
    (CrashPolicy::Restart, false) => { /* reschedule */ }
    (CrashPolicy::Sleep, false) => { /* sleep */ }
    _ => { /* destroy */ }
}

This correctly identifies that:

  • An actor stopping with going_away=true should not be considered a graceful exit
  • The crash policy should trigger when graceful_exit=false, which includes both failures AND going_away scenarios

State Management (actor/mod.rs:713)

Moving state.going_away = false to line 713 (after all rescheduling logic) is correct. This ensures:

  1. The flag is checked during rescheduling decisions (line 660)
  2. It's reset only after the actor has been handled
  3. It's cleared alongside other transient state flags (wake_for_alarm, will_wake)

The original placement at line 586 was premature - it cleared the flag before using it in the rescheduling logic.

Formatting Changes ✅

The formatting changes in proxy_service.rs and other files appear to be automatic formatter corrections:

  • Fixing indentation (tabs vs spaces per rustfmt.toml)
  • Aligning struct field assignments
  • Import ordering

These follow the codebase conventions and improve consistency.

Potential Concerns

1. State Transition Safety
The going_away flag is now cleared after rescheduling. Consider edge cases:

  • What happens if rescheduling fails or returns Destroy early?
    • ✅ This is fine - the workflow either continues with the flag reset, or the actor is destroyed (flag doesn't matter)

2. Comment Relocation (actor/mod.rs:575)
The comment "Reset retry count on successful exit" was moved inside the match arm. This is better placement as it's more specific about when the reset occurs.

3. Generation Checking
The GoingAway signal handler (line 474) properly checks generation to ignore stale signals. The going_away flag is generation-specific and gets reset properly.

Testing Recommendations 🧪

Consider adding test coverage for:

  1. GoingAway + Restart Policy: Verify actor is rescheduled (not destroyed) when:

    • Actor receives GoingAway signal
    • Stops with StopCode::Ok
    • Has CrashPolicy::Restart
  2. GoingAway + Sleep Policy: Verify actor sleeps (not destroyed) when:

    • Actor receives GoingAway signal
    • Stops with StopCode::Ok
    • Has CrashPolicy::Sleep
  3. Normal Graceful Exit: Verify actor is destroyed when:

    • No going_away flag
    • Stops with StopCode::Ok
    • Has either crash policy
  4. State Cleanup: Verify going_away flag is properly reset after rescheduling

Example test structure:

#[test]
async fn test_going_away_respects_restart_policy() {
    // Setup actor with CrashPolicy::Restart
    // Send GoingAway signal
    // Stop actor with StopCode::Ok
    // Assert: actor was rescheduled, not destroyed
    // Assert: going_away flag is cleared
}

Performance Considerations ⚡

No performance concerns. The change:

  • Adds one boolean check (!state.going_away)
  • Moves flag reset to a slightly later position (negligible)

Security Considerations 🔒

No security concerns. The change:

  • Prevents premature actor destruction (availability improvement)
  • Properly respects crash policies
  • Maintains generation checking to prevent race conditions

Documentation 📝

Consider adding a code comment explaining the graceful_exit logic:

// Only treat as graceful exit if:
// 1. Actor stopped with StopCode::Ok, AND
// 2. Not marked as going_away (runner shutdown/eviction)
let graceful_exit = !state.going_away 
    && matches!(variant, StoppedVariant::Normal { code: protocol::StopCode::Ok });

Overall Assessment ✨

Verdict: LGTM with minor suggestions

The fix correctly addresses the issue where actors were being destroyed instead of rescheduled during GoingAway scenarios. The logic is sound, state management is correct, and the formatting changes improve code consistency.

Strengths:

  • Correct inversion of logic from "failed" to "graceful_exit"
  • Proper state flag management
  • Clean, readable code

Suggestions:

  • Add test coverage for the various going_away scenarios
  • Consider adding an inline comment explaining graceful_exit logic
  • Verify the fix works as expected in staging/integration tests

Generated with Claude Code

@claude
Copy link

claude bot commented Nov 21, 2025

Code Review

Summary

This PR fixes a critical bug where actors were being incorrectly destroyed when receiving a GoingAway signal. The fix moves the going_away flag reset to after the rescheduling logic, preventing graceful exits from being misclassified as crashes.

Core Logic Change (packages/pegboard/src/workflows/actor/mod.rs)

The Bug:
Previously, state.going_away = false was reset early in handle_stopped() (line 586), before the rescheduling logic. This caused the following issue:

  1. Actor receives GoingAway signal → state.going_away = true
  2. Actor stops gracefully with StopCode::Ok
  3. handle_stopped() immediately resets state.going_away = false (old line 586)
  4. Rescheduling logic checks graceful_exit = !state.going_away && StopCode::Ok → evaluates to true
  5. Actor is incorrectly treated as a graceful exit and destroyed instead of being rescheduled

The Fix:
The state.going_away = false reset is now moved to line 713, after all rescheduling logic. This ensures:

  • The going_away flag is preserved during rescheduling decisions
  • Actors that stopped due to GoingAway are correctly identified as non-graceful exits
  • With CrashPolicy::Restart, they are rescheduled instead of destroyed
  • The flag is still properly reset for the next lifecycle

Correctness:
✅ The logic inversion at lines 668-680 is correct:

  • graceful_exit = !state.going_away && StopCode::Ok
  • match (crash_policy, graceful_exit) now properly handles:
    • (Restart, false) → reschedule (was crashing/going away)
    • (Sleep, false) → put to sleep
    • (_, true) → destroy (truly graceful exit)

This matches the comment on line 575: "Reset retry count on successful exit" - only truly successful exits should reset the counter.

Code Quality

Strengths:

  • Minimal, surgical fix addressing the root cause
  • Preserves existing behavior for all other code paths
  • Comment placement improved (line 575 moved closer to the action)

Formatting Changes:
The PR includes several formatting-only changes that should ideally be in a separate commit:

  • guard-core/src/custom_serve.rs: Import reordering (line 6)
  • guard-core/src/proxy_service.rs: Whitespace/indentation (lines 615-628, 642)
  • pegboard-gateway/src/shared_state.rs: Logging format spacing (lines 197-198)
  • pegboard/src/workflows/runner.rs: Import reordering (line 5)

Note: Per CLAUDE.md, formatting should not be done manually - these changes appear to be from cargo fmt. Consider separating formatting changes from logic changes in future PRs.

Potential Issues

1. Test Coverage (Critical) ⚠️
This is a critical bugfix for actor lifecycle management, but I don't see:

  • Unit tests verifying the going_away flag behavior
  • Integration tests for the GoingAway → Stop → Reschedule sequence
  • Regression tests to prevent this bug from reoccurring

Recommendation: Add test coverage for:

// Test that actors with crash_policy=Restart reschedule after GoingAway+Stop
// Test that actors with crash_policy=Sleep sleep after GoingAway+Stop  
// Test that truly graceful exits (no GoingAway) are destroyed

2. Edge Case: Concurrent Signals 🤔
What happens if an actor receives:

  • GoingAway signal
  • Destroy signal (before stopping)
  • Actor stops with StopCode::Ok

The going_away flag would be set, but should the actor be destroyed (per Destroy signal) or rescheduled (per going_away logic)? Current code would reschedule, but this may not be intended. Consider if there's a priority order needed.

3. Documentation 📝
The going_away field in LifecycleState (runtime.rs:34) lacks documentation explaining:

  • When it's set (GoingAway signal received)
  • What it means (runner is shutting down, actor needs rescheduling)
  • Why it affects rescheduling decisions

Security & Performance

✅ No security concerns identified
✅ No performance impact - same code paths, just different flag timing

Verdict

Approve with recommendations:

  • ✅ Core fix is correct and addresses the stated bug
  • ⚠️ Add test coverage for this critical path
  • 📝 Consider separating formatting changes
  • 📝 Document the going_away flag behavior

Great catch on a subtle state management bug! The fix is minimal and correct.

@graphite-app
Copy link
Contributor

graphite-app bot commented Nov 21, 2025

Merge activity

  • Nov 21, 2:40 AM UTC: NathanFlurry added this pull request to the Graphite merge queue.
  • Nov 21, 2:42 AM UTC: CI is running for this pull request on a draft pull request (#3515) due to your merge queue CI optimization settings.
  • Nov 21, 2:43 AM UTC: Merged by the Graphite merge queue via draft PR: #3515.

@graphite-app graphite-app bot closed this Nov 21, 2025
@graphite-app graphite-app bot deleted the 11-20-fix_pb_stop_actors_from_destroying_on_goingaway branch November 21, 2025 02:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants