fix(pegboard-gateway): ping hibernating requests both during open hws connections and during hibernation #3498

NathanFlurry · 2025-11-20T07:59:04Z

No description provided.

vercel · 2025-11-20T07:59:10Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
rivetkit-serverless	Error			Nov 20, 2025 8:52pm

3 Skipped Deployments

Project	Deployment	Preview	Updated (UTC)
rivet-cloud	Ignored		Nov 20, 2025 8:52pm
rivet-inspector	Ignored	Preview	Nov 20, 2025 8:52pm
rivet-site	Ignored	Preview	Nov 20, 2025 8:52pm

NathanFlurry · 2025-11-20T07:59:22Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

fix(pegboard-gateway): ping hibernating requests both during open hws connections and during hibernation #3498 : 2 dependent PRs (#3499 , #3501 ) 👈 (View in Graphite)
fix(rivetkit): fix graceful runner shutdown #3497
fix(pegboard-gateway): write hws immediately on socket hibernate #3496
chore(rivetkit): fix hibernation implementation #3494
fix(tunnel): implement ping pong #3491 : 1 other dependent PR (#3475 )
chore(pegboard-gateway): add new message id format & add deprecated tunnel ack #3492
chore(pegboard-gateway): remove TunnelAck #3488
chore(engine-runner): remove echo close on ToClientWebSocketClose #3487
chore(rivetkit): implement new hibernating ws protocol #3464 : 1 other dependent PR (#3452 )
fix(actor-kv): fix listing keys #3484
fix(guard): handle actor stopped during ws/req #3481
chore: rename retry -> hibernate in runner protocol #3473
fix(udb): update backoff algo #3469
chore: write/update docs #3468
chore: add debug scripts #3467
main

How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

_{An organization admin has enabled the Graphite Merge Queue in this repository.} _{Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.}

This stack of pull requests is managed by Graphite. Learn more about stacking.

claude · 2025-11-20T08:01:20Z

Code Review

I've reviewed this PR and have the following feedback:

Overview

This PR extracts the keepalive task into a separate module and ensures it runs both during active WebSocket connections and during hibernation. The goal is to prevent hibernating requests from being garbage collected.

Positive Changes

Good refactoring: Extracting the keepalive logic into a dedicated module improves code organization and maintainability
Proper task lifecycle management: Using watch channels for graceful shutdown is the correct pattern
Consistent error handling: The new task follows the same pattern as other lifecycle tasks

Issues and Concerns

1. Documentation typo (Minor)

Line 13 in keepalive_task.rs - Missing space in comment. Should be "/// Only ran for hibernating requests." instead of "///Only ran for hibernating requests."

2. Potential unnecessary work during active connections (Performance)

Lines 480-507 in lib.rs - The keepalive task now runs during active WebSocket connections when can_hibernate is true. Previously, it only ran during actual hibernation (handle_websocket_hibernation).

Question: Is it necessary to ping the hibernating_request UDB entry while the connection is still active? This adds database writes during normal operation. The original implementation only ran the keepalive task during actual hibernation (lines 603-616), which seems more efficient.

Recommendation: Consider whether the keepalive task should only run during actual hibernation, not during active connections. If it must run during active connections, please document why this change is necessary in the PR description or code comments.

3. Task spawning pattern (Code Quality)

Lines 485-491 in lib.rs - The keepalive task is spawned inside a tokio::join! future, which is different from how the other three tasks are spawned (lines 398-418). For consistency, consider spawning it outside the join block conditionally and then awaiting it inside, similar to how tunnel_to_ws, ws_to_tunnel, and ping tasks are handled.

4. Result matching could be more robust (Correctness)

Lines 511-523 in lib.rs - The result matching logic prefers the first non-aborted result, but doesn't explicitly handle cases where multiple tasks return non-aborted results. Consider being more explicit about the precedence for different LifecycleResult variants (ServerClose vs ClientClose vs Aborted).

5. Clone variables optimization (Minor Performance)

Lines 424-428 in lib.rs - Several clones are performed that may not be necessary since most of these types are cheap to clone (Ids are Copy types, ctx is an Arc), but consider capturing them directly in the async block if possible.

Testing Recommendations

Add integration tests to verify that hibernating requests don't get garbage collected
Test the keepalive interval is working correctly (at threshold/2)
Verify that jitter prevents stampeding herds in high-load scenarios

Security/Safety

No security concerns identified. The changes maintain proper error handling and resource cleanup.

Verdict

The core fix looks sound, but there are some architectural questions (particularly issue 2 about running during active connections) that should be addressed before merging. The code quality could be improved with more consistent patterns and better documentation of the design decisions.

claude · 2025-11-20T19:40:15Z

Code Review - PR #3498

Summary

This PR refactors the keepalive task for hibernating requests by extracting it into a separate module and ensuring it runs both during active WebSocket connections and during hibernation. Overall, the changes look good and follow the repository's patterns well.

Positive Observations

✅ Good refactoring: Extracting the keepalive logic into keepalive_task.rs improves code organization and reusability

✅ Consistent patterns: The new task follows the same pattern as ping_task.rs with proper abort handling via watch channels

✅ Proper lifecycle management: The keepalive task is correctly spawned and cleaned up in both handle_websocket and handle_websocket_hibernation

✅ Jitter implementation: Good use of jitter (0-128ms) to prevent thundering herd problems

✅ Import organization: Follows the repository convention of keeping imports at the top of the file

Issues & Suggestions

1. Typo in documentation (Minor)

Location: keepalive_task.rs:10

/// Periodically pings writes keepalive in UDB. This is used to restore hibernating request IDs on

Should be:

/// Periodically writes keepalive pings in UDB. This is used to restore hibernating request IDs on

2. Formatting issue in documentation (Minor)

Location: keepalive_task.rs:13

There's a missing space:

///Only ran for hibernating requests.

Should be:

/// Only ran for hibernating requests.

3. Inconsistent lifecycle result handling (Moderate)

Location: lib.rs:511-522

The lifecycle result matching logic shows asymmetry. When multiple tasks complete successfully but with different results, the code only checks the first two positions:

(Ok(res), Ok(LifecycleResult::Aborted), _, _) => Ok(res),
(Ok(LifecycleResult::Aborted), Ok(res), _, _) => Ok(res),
// Unlikely case
(res, _, _, _) => res,

Potential issue: If tunnel_to_ws_res and ws_to_tunnel_res are both Aborted, but ping_res or keepalive_res contains a meaningful result (like ServerClose or ClientClose), that result will be lost.

Suggestion: Consider a more comprehensive pattern that checks all positions for non-Aborted results:

match (tunnel_to_ws_res, ws_to_tunnel_res, ping_res, keepalive_res) {
    // Prefer error
    (Err(err), _, _, _) => Err(err),
    (_, Err(err), _, _) => Err(err),
    (_, _, Err(err), _) => Err(err),
    (_, _, _, Err(err)) => Err(err),
    // Find first non-aborted result
    (Ok(res), _, _, _) if !matches!(res, LifecycleResult::Aborted) => Ok(res),
    (_, Ok(res), _, _) if !matches!(res, LifecycleResult::Aborted) => Ok(res),
    (_, _, Ok(res), _) if !matches!(res, LifecycleResult::Aborted) => Ok(res),
    (_, _, _, Ok(res)) if !matches!(res, LifecycleResult::Aborted) => Ok(res),
    // All aborted (unlikely)
    (res, _, _, _) => res,
}

4. Error handling observation (Low)

Location: keepalive_task.rs:21-26

The .try_into()? conversion could fail if hibernating_request_eligible_threshold / 2 exceeds u64::MAX, though this is extremely unlikely in practice. The current error handling is appropriate, but you might consider adding a more descriptive error context:

let mut ping_interval = tokio::time::interval(Duration::from_millis(
    (ctx.config()
        .pegboard()
        .hibernating_request_eligible_threshold()
        / 2)
    .try_into()
    .context("hibernating_request_eligible_threshold too large for interval")?,
));

However, this is optional since the threshold would need to be absurdly large to cause issues.

5. Missing context on abort handling (Low)

Location: keepalive_task.rs:37-39

The abort handling via keepalive_abort_rx.changed() is correct, but it might be worth adding a debug log when the task is aborted, similar to how other tasks in lib.rs log their completion states. This would help with debugging:

_ = keepalive_abort_rx.changed() => {
    tracing::debug!("keepalive task aborted");
    return Ok(LifecycleResult::Aborted);
}

Performance Considerations

✅ The keepalive interval calculation (threshold / 2) is appropriate to ensure pings happen well before the threshold expires

✅ Using MissedTickBehavior::Skip prevents backlog buildup if the task falls behind

✅ Jitter range (0-128ms) is reasonable for preventing synchronization

Security Considerations

✅ No security concerns identified
✅ Proper use of watch channels for cancellation
✅ No credentials or sensitive data handling issues

Test Coverage

⚠️ Missing tests: This package doesn't appear to have any tests. While not a blocker for this PR, consider adding integration tests for:

Keepalive task behavior during normal operation
Keepalive task abort handling
Proper cleanup when websockets close during hibernation

Verdict

Recommendation: Approve with minor fixes

The changes are well-structured and solve the stated problem effectively. The only critical item is the documentation typo, which is trivial to fix. The lifecycle result handling observation is worth considering but not blocking since the current logic works for the common case.

Great work on maintaining consistency with existing patterns! 🎯

claude · 2025-11-20T19:40:17Z

PR Review: Keepalive Task Refactoring

Summary

This PR extracts keepalive logic into a dedicated task module and ensures hibernating requests are pinged both during active WebSocket connections and during hibernation. The refactoring improves code organization and fixes a potential bug where keepalive pings weren't being sent during active connections.

Code Quality & Best Practices ✅

Strengths:

Good separation of concerns: Extracting keepalive_task.rs follows the existing pattern of separate task modules (ping_task.rs, tunnel_to_ws_task.rs, etc.)
Consistent error handling: Uses anyhow::Result appropriately
Proper abort handling: Uses watch::Receiver for graceful task cancellation, consistent with other tasks
Code reuse: DRY principle applied - same keepalive logic used in both contexts

Minor Issues:

Comment formatting (keepalive_task.rs:13)

///Only ran for hibernating requests.

Missing space after ///. Should be:

/// Only ran for hibernating requests.

Import ordering (lib.rs:10-16)
The imports are reordered in the diff, but this appears to be automatic formatting. Consider running cargo fmt if it hasn't been run yet (per CLAUDE.md: "DO NOT RUN CARGO FMT AUTOMATICALLY").

Potential Bugs 🐛

Medium Priority:

Result handling in keepalive task abort (lib.rs:615-616)
```
let _ = keepalive_abort_tx.send(());
let _ = keepalive_handle.await;
```
The await result is ignored. If the keepalive task panicked, we should probably log it:
```
let _ = keepalive_abort_tx.send(());
if let Err(e) = keepalive_handle.await {
    tracing::warn!(?e, "keepalive task failed during cleanup");
}
```
Same issue exists in lines 440-441, 457-458, 473-474, and others where abort signals are sent.

Lifecycle result priority logic (lib.rs:511-523)
The logic for determining the single result from all tasks seems incomplete:

// Prefer non aborted result if all succeed
(Ok(res), Ok(LifecycleResult::Aborted), _, _) => Ok(res),
(Ok(LifecycleResult::Aborted), Ok(res), _, _) => Ok(res),
// Unlikely case
(res, _, _, _) => res,

This only checks the first two tasks. What if:

(Aborted, Aborted, ServerClose, Aborted) → Would return Aborted instead of ServerClose
(Aborted, Aborted, Aborted, ClientClose) → Would return Aborted instead of ClientClose

Consider a more comprehensive approach:

let mut lifecycle_res = match (tunnel_to_ws_res, ws_to_tunnel_res, ping_res, keepalive_res) {
    // Prefer error
    (Err(err), _, _, _) => Err(err),
    (_, Err(err), _, _) => Err(err),
    (_, _, Err(err), _) => Err(err),
    (_, _, _, Err(err)) => Err(err),
    // Find first non-aborted result
    (Ok(res), _, _, _) if !matches!(res, LifecycleResult::Aborted) => Ok(res),
    (_, Ok(res), _, _) if !matches!(res, LifecycleResult::Aborted) => Ok(res),
    (_, _, Ok(res), _) if !matches!(res, LifecycleResult::Aborted) => Ok(res),
    (_, _, _, Ok(res)) if !matches!(res, LifecycleResult::Aborted) => Ok(res),
    // All aborted
    (res, _, _, _) => res,
};

Performance Considerations ⚡

Good:

Jitter implementation (keepalive_task.rs:42-44): Prevents thundering herd with 0-128ms jitter
Missed tick behavior: MissedTickBehavior::Skip prevents tick accumulation if system is slow
Efficient abort signaling: Uses watch::channel which is lightweight

Potential Improvement:

The keepalive task spawns unconditionally in handle_websocket (lib.rs:485) even when can_hibernate is false, then immediately returns Aborted. Consider spawning conditionally:
```
let keepalive = if can_hibernate {
    Some(tokio::spawn(keepalive_task::task(...)))
} else {
    None
};
```
Then handle None case in the join. This avoids spawning unnecessary tasks.

Security Concerns 🔒

No major security issues identified.

Minor observations:

The jitter range (0-128ms) is small and predictable, but this is acceptable for load distribution
No input validation needed as all parameters are internal IDs

Test Coverage 🧪

Critical Gap:

No tests found for the pegboard-gateway package
This PR adds significant functionality (keepalive during active connections) but has no automated test coverage

Recommendations:

Add integration tests for:
- Keepalive task sending periodic updates during active WebSocket connections
- Keepalive task sending periodic updates during hibernation
- Proper cleanup when connection closes
- Proper cleanup when tasks are aborted
Add unit tests for:
- Lifecycle result priority logic (especially the edge cases mentioned above)
- Jitter behavior

Architecture & Design 🏗️

Excellent:

Follows existing patterns in the codebase (separate task modules with abort channels)
The dual-use of keepalive (during connection and hibernation) makes sense architecturally
Clean separation between the two contexts where keepalive is needed

Adherence to CLAUDE.md Standards ✅

✅ Uses workspace dependencies correctly
✅ Follows structured logging patterns with tracing
✅ Error handling follows the custom error system
✅ Uses hard tabs (consistent with rustfmt.toml)
⚠️ Comment formatting needs minor fix

Summary & Recommendations

Approve with minor changes recommended:

Fix comment spacing in keepalive_task.rs:13
Improve error logging when awaiting task handles (don't ignore panics)
Fix lifecycle result priority logic to handle all 4 tasks correctly
Consider conditional spawning of keepalive task when can_hibernate is false
Add test coverage for this functionality

The core logic is sound and this is a good refactoring. The issues identified are relatively minor but should be addressed to improve robustness.

Priority order:

Fix lifecycle result priority logic (Medium - could cause incorrect behavior)
Add error logging for task cleanup (Low - helps debugging)
Fix comment formatting (Low - code quality)
Add tests (Medium - prevents regressions)
Conditional task spawning (Low - micro-optimization)

… connections and during hibernation

claude · 2025-11-20T20:54:40Z

Code Review

Overview

This PR refactors the keepalive logic for hibernating WebSocket requests by extracting it into a dedicated task module and running it alongside other lifecycle tasks. The change improves architecture and ensures hibernating requests are properly maintained both during active connections and during hibernation.

Positive Aspects ✅

Good separation of concerns: Extracting keepalive logic into keepalive_task.rs follows the same pattern as ping_task.rs and ws_to_tunnel_task.rs, improving code organization.
Consistent task lifecycle management: The new task integrates well with the existing abort channel pattern using watch::Receiver<()>.
Proper jitter implementation: The rand::thread_rng().gen_range(0..128) prevents thundering herd issues when multiple requests ping simultaneously.
Correct interval calculation: Using hibernating_request_eligible_threshold / 2 ensures pings happen well before the eligibility threshold expires.

Issues and Suggestions 🔍

1. Typo in Comment (Minor)

Location: engine/packages/pegboard-gateway/src/keepalive_task.rs:13

///Only ran for hibernating requests.

Issue: Missing space after ///.

Fix: Should be /// Only ran for hibernating requests.

2. Import Organization (Style)

Location: engine/packages/pegboard-gateway/src/lib.rs:10-17

The diff shows imports were reordered:

use rivet_guard_core::{
	WebSocketHandle,
	custom_serve::{CustomServeTrait, HibernationResult},
	errors::{ServiceUnavailable, WebSocketServiceUnavailable},
	proxy_service::{ResponseBody, is_ws_hibernate},
	request_context::RequestContext,
	websocket_handle::WebSocketReceiver,
};

Issue: While this appears to be an auto-formatting change, the CLAUDE.md explicitly states:

DO NOT RUN CARGO FMT AUTOMATICALLY

Recommendation: Verify if these import reorderings were intentional or accidental. If accidental, they should be reverted to avoid conflicts with the team's formatting workflow.

3. Potential Integer Overflow (Low Risk)

Location: engine/packages/pegboard-gateway/src/keepalive_task.rs:21-26

let mut ping_interval = tokio::time::interval(Duration::from_millis(
	(ctx.config()
		.pegboard()
		.hibernating_request_eligible_threshold()
		/ 2)
	.try_into()?,
));

Issue: The code divides i64 by 2 then converts to u64 with try_into()?. While hibernating_request_eligible_threshold() returns a positive value (defaults to 90,000ms), there's no guarantee at the type level.

Consideration:

If hibernating_request_eligible_threshold is ever negative, try_into() will fail
This is likely acceptable since negative thresholds don't make sense
The default value is 90,000ms (90 seconds), so the ping interval would be 45 seconds

Recommendation: Current implementation is acceptable, but consider adding a debug assertion or comment explaining the assumption.

4. Task Lifecycle Abort Pattern Inconsistency (Minor)

Location: engine/packages/pegboard-gateway/src/lib.rs:601-616

In handle_websocket_hibernation, the new keepalive task uses:

let _ = keepalive_abort_tx.send(());
let _ = keepalive_handle.await;

Compare to the old implementation in the same function:

keepalive_handle.abort();

Observation: The new pattern is cleaner and more consistent with other tasks in handle_websocket. However, note that let _ = keepalive_handle.await; ignores the join result.

Recommendation: Consider whether you want to log if the keepalive task panicked:

if let Err(e) = keepalive_handle.await {
    if e.is_panic() {
        tracing::error!(error = ?e, "keepalive task panicked");
    }
}

5. Keepalive Task Only Runs for `can_hibernate` Requests

Location: engine/packages/pegboard-gateway/src/lib.rs:480-482

if !can_hibernate {
	return Ok(LifecycleResult::Aborted);
}

Question: Is this the intended behavior? The PR title mentions "ping hibernating requests both during open hws connections and during hibernation", but the task immediately aborts if can_hibernate is false.

Clarification Needed:

Should the keepalive task run for all requests during the active WebSocket connection phase?
Or is it correct that only hibernatable requests need keepalive pings?

Based on the code, it appears correct that only hibernatable requests need keepalive tracking in UDB, but this should be confirmed.

6. Error Handling in Infinite Loop (Design Question)

Location: engine/packages/pegboard-gateway/src/keepalive_task.rs:43-51

ctx.op(pegboard::ops::actor::hibernating_request::upsert::Input {
	actor_id,
	gateway_id,
	request_id,
})
.await?;

Question: If the UDB upsert operation fails, the task will exit with an error. This will cascade to abort all other tasks.

Consideration: Is this the desired behavior? Alternatives could include:

Retry logic with exponential backoff
Logging the error and continuing the loop
Metric emission for failed keepalive attempts

Recommendation: The current "fail fast" approach is reasonable, but consider if transient database errors should be more resilient.

Security Considerations 🔒

No security concerns identified. The changes don't introduce new attack vectors or data exposure risks.

Performance Considerations ⚡

Additional task overhead: The keepalive task adds one more concurrent task during WebSocket connections, but the overhead is minimal.
Jitter range: The 0-128ms jitter is reasonable but fairly small. For large-scale deployments, consider if a larger range would better distribute database load.
Database writes: Each hibernating request now writes to UDB every ~45 seconds. Ensure this scales appropriately with your expected number of concurrent hibernating connections.

Testing Recommendations 🧪

Consider adding tests for:

Keepalive task properly aborts when signaled
Keepalive pings occur at the expected interval
Task cleanup when other tasks fail
Behavior when hibernating_request_eligible_threshold has edge case values

Adherence to Project Conventions ✓

✅ Uses anyhow::Result correctly
✅ Structured logging with tracing (lowercase messages)
✅ Follows existing task pattern architecture
✅ Uses workspace dependencies appropriately
⚠️ Potential auto-formatting changes to imports (see Issue [SVC-2479] Send cluster events to PostHog #2)

Summary

This is a solid refactoring that improves code organization and maintainability. The main concerns are:

A minor typo in the comment
Potential unintended import reformatting
Questions about error handling strategy for database failures

The core logic appears sound and follows established patterns in the codebase. Once the minor issues are addressed, this should be good to merge.

Overall Assessment: ✅ Approve with minor suggested improvements

claude · 2025-11-20T20:56:20Z

Code Review

I've reviewed this PR and overall the changes look good! The refactoring properly extracts the keepalive logic into a dedicated module and correctly ensures it runs both during active WebSocket connections and hibernation.

Strengths

Good separation of concerns: Extracting the keepalive logic improves code organization
Proper task lifecycle management: Correctly integrated with the abort signal pattern
Correct hibernation check: Only runs when can_hibernate is true
Consistent error handling: Uses the same LifecycleResult pattern as other tasks

Issues Found

1. Typo in documentation (keepalive_task.rs:10)

"pings writes keepalive" should be "writes keepalive pings"

2. Missing space in documentation (keepalive_task.rs:13)

Should be: /// Only ran for hibernating requests.

3. Comment reference outdated (keepalive_task.rs:30-31)

The comment says "we've already called this above" but there's no code above in this new module that calls the upsert operation. This was copied from the original inline implementation and should be updated.

Recommendations

Fix the typos in the documentation comments
Update the misleading comment to accurately reflect the code flow
Consider adding error context to .try_into()? on line 26 using .context() for easier debugging

Performance & Security

The jitter (0-128ms) helps prevent thundering herd issues ✅
The ping interval at half the threshold ensures timely updates ✅
MissedTickBehavior::Skip is appropriate ✅
No security concerns identified ✅

Overall Assessment: Solid refactoring with minor documentation fixes needed. Core logic is correct and properly integrated.

Great work! 🎉

claude · 2025-11-20T22:30:01Z

Code Review

This PR extracts keepalive logic into a separate task and ensures it runs both during active WebSocket connections and during hibernation. Overall, the change is well-structured and improves code organization.

✅ Strengths

Good separation of concerns: Extracting keepalive logic into keepalive_task.rs improves modularity and mirrors the pattern used for ping_task.rs.
Proper task lifecycle management: The use of watch channels for aborting tasks is consistent with existing patterns.
Correct concurrent task handling: The tokio::join! properly handles all four tasks with proper abort coordination.
Hibernation guard: The keepalive task correctly checks can_hibernate and immediately returns Aborted if not enabled.

🔍 Code Quality Observations

1. Minor typo in documentation (keepalive_task.rs:10-13)

Issues: pings writes keepalive should be writes keepalive pings, missing space before Only, grammar: Only ran should be Only runs.

2. Code duplication

Both handle_websocket and handle_websocket_hibernation spawn the keepalive task with nearly identical code (lib.rs:424-428 vs 604-611). Consider extracting to a helper method.

3. Result handling pattern (lib.rs:511-523)

The unlikely case comment suggests the logic might not handle all edge cases explicitly. Consider documenting what scenarios this covers.

4. Comment mismatch (keepalive_task.rs:30-32)

The comment says we've already called this above but there's no call above in the extracted function. Consider simplifying to: Discard the first tick since it fires immediately.

5. Import ordering (lib.rs:10-17)

The imports have inconsistent ordering within the braces. Consider grouping alphabetically for consistency.

🛡️ Security & Performance

No security concerns identified. The code properly validates inputs, uses safe interval-based timing with jitter to prevent thundering herd, and handles errors appropriately.

🧪 Testing

No tests found for the pegboard-gateway package. Consider adding tests for: keepalive task abort behavior, keepalive task continuing to write during both connection phases, edge cases around hibernation transitions.

📋 Minor Suggestions

The try_into()? at keepalive_task.rs:26 could fail if threshold is too large. Consider adding context.
Consider adding a trace log in the keepalive task when the upsert succeeds to help with debugging hibernation issues.

Summary

This is a solid refactoring that improves code organization and fixes the keepalive behavior for hibernating requests. The main issues are minor documentation and style improvements. The logic appears sound and properly handles the concurrent task lifecycle.

Recommendation: Approve with minor suggestions ✅

claude · 2025-11-20T22:43:16Z

Pull Request Review

Summary

This PR refactors the keepalive ping mechanism for hibernating WebSocket requests by extracting it into a dedicated task module and ensuring it runs both during active WebSocket connections (when hibernation is possible) and during actual hibernation periods.

Code Quality ✅

Strengths:

Good separation of concerns: The new keepalive_task.rs module cleanly separates the keepalive logic from the main lib, matching the existing pattern used for ping_task.rs and other task modules
Consistent error handling: Properly uses Result<LifecycleResult> return type and integrates well with the abort channel pattern
Proper cleanup: Uses watch channels for graceful shutdown instead of abort(), which is cleaner and more reliable
Jitter implementation: Smart addition of jitter (0-128ms) to prevent thundering herd problems when many connections ping simultaneously

Minor Issues:

Documentation typo (line 13 in keepalive_task.rs):
```
///Only ran for hibernating requests.
```
Should be:
```
/// Only ran for hibernating requests.
```
Missing space after ///
Inconsistent clone naming (lines 425-428 in lib.rs):
The clones use _clone suffix which is verbose. Consider either:
- Using the variables directly in the spawn (preferred)
- Or using more descriptive names if needed elsewhere
Current:
```
let ctx_clone = self.ctx.clone();
let actor_id_clone = self.actor_id;
let gateway_id_clone = self.shared_state.gateway_id();
let request_id_clone = request_id;
```

Potential Bugs 🔍

Missing variable in abort chain (lines 422, 501-502 in lib.rs):
On line 422, you create ping_abort_tx2 but never create keepalive_abort_tx2. In the keepalive task (lines 499-502), you're using ws_to_tunnel_abort_tx2 and tunnel_to_ws_abort_tx2, but you should probably also have keepalive_abort_tx2 for consistency. Currently you're using variables from outer scope which works but breaks the pattern.

Consider:
```
let keepalive_abort_tx2 = keepalive_abort_tx.clone();
```
And use it in the keepalive task's abort chain.
Early return bypasses keepalive spawn (lines 481-483 in lib.rs):
The keepalive task immediately returns Ok(LifecycleResult::Aborted) if can_hibernate is false, which is correct. However, this means the task never actually spawns the inner keepalive task. While functionally correct, this could be more explicit:
```
async {
    if !can_hibernate {
        // Don't spawn keepalive for non-hibernating connections
        return Ok(LifecycleResult::Aborted);
    }
    // ... rest of code
}
```
Adding a comment would make the intent clearer.

Performance Considerations ⚡

Interval calculation (lines 21-26 in keepalive_task.rs):
Good: The interval is set to half the hibernating request threshold, which is appropriate for keepalive

Potential issue: If hibernating_request_eligible_threshold is an odd number, integer division could cause issues. The .try_into()? will catch overflow but not precision loss. Consider documenting expected ranges or adding validation.
Jitter range (line 43 in keepalive_task.rs):
The 0-128ms jitter is reasonable but hardcoded. Consider:
- Is 128ms appropriate for all scales?
- Should it be configurable or proportional to the ping interval?
- Document why 128ms was chosen
Task spawning overhead: The refactor adds a new concurrent task that runs alongside other tasks. This is fine, but ensure that spawning 4 tasks per WebSocket connection scales appropriately with expected load.

Security Concerns 🔒

No significant security issues identified. The code properly:

Uses timeout mechanisms
Handles errors appropriately
Doesn't expose sensitive data
Uses proper channel-based communication

Test Coverage ⚠️

Major gap: No tests found for the pegboard-gateway package.

Recommendations:

Add unit tests for keepalive_task::task:
- Test that it pings at the correct interval
- Test that it aborts properly when signaled
- Test error handling for UDB operations
- Test jitter behavior
Add integration tests for the full lifecycle:
- Test keepalive during active connection with can_hibernate=true
- Test keepalive during actual hibernation
- Test that keepalive properly aborts when connection closes
- Test interaction between all 4 concurrent tasks
Test edge cases:
- What happens if hibernating_request_eligible_threshold is very small (e.g., 1)?
- Verify no race conditions in task abortion

Additional Observations

Pattern matching logic (lines 511-523 in lib.rs):
The error prioritization logic is correct but has grown complex with 4 tasks. The current pattern:
```
(Err(err), _, _, _) => Err(err),
(_, Err(err), _, _) => Err(err),
(_, _, Err(err), _) => Err(err),
(_, _, _, Err(err)) => Err(err),
```
This works but is verbose. No change needed now, but if you add more tasks, consider refactoring to iterate over results.
Code style compliance: ✅ Follows the repository conventions:
- Uses hard tabs (as specified in rustfmt.toml)
- Proper structured logging with tracing
- Follows existing patterns from neighboring files
- Proper error handling with anyhow

Verdict

Approval recommendation: ✅ Approve with minor suggestions

The PR successfully addresses the issue of pinging hibernating requests during both connection phases. The refactoring is well-structured and follows existing patterns. The identified issues are minor and the code is functionally correct.

Before merging:

Fix the documentation typo
Consider adding test coverage (can be a follow-up PR)
Consider the variable clone naming suggestion

Nice to have:

Add comments explaining the early return in keepalive task
Document the jitter value choice
Add validation for ping interval edge cases

Great work on maintaining consistency with the existing codebase architecture!

claude · 2025-11-20T22:44:35Z

Pull Request Review

Summary

This PR refactors the keepalive logic for hibernating WebSocket requests by extracting it into a dedicated task module. The keepalive task now runs during both active WebSocket connections and hibernation periods, ensuring requests maintain their "alive" status consistently.

✅ Positive Observations

1. Good Code Organization

Extracting the keepalive logic into a separate module (keepalive_task.rs) improves maintainability and follows the existing pattern used by ping_task.rs
Clear separation of concerns makes the code easier to understand

2. Consistent Lifecycle Management

The keepalive task properly integrates into the existing abort channel pattern
Graceful shutdown is handled correctly via watch channels
The task aborts cleanly when other tasks complete

3. Proper Hibernation Handling

Keepalive only runs for hibernating requests (can_hibernate check on line 481)
Correctly maintains the hibernating request entry in both scenarios (active connection and hibernation)

🔍 Issues & Suggestions

1. Documentation Typo (Minor)

Location: keepalive_task.rs:10-13

/// Periodically pings writes keepalive in UDB. This is used to restore hibernating request IDs on
/// next actor start.
///
///Only ran for hibernating requests.

Issues:

Line 10: "pings writes" should be "writes" or "pings"
Line 13: Missing space after ///

Suggestion:

/// Periodically writes keepalive pings to UDB. This is used to restore hibernating request IDs on
/// next actor start.
///
/// Only ran for hibernating requests.

2. Potential JoinHandle Panic Unwrapping (Medium)

Location: lib.rs:493 and lib.rs:616

let res = keepalive.await?;

Issue:
The JoinHandle::await returns Result<T, JoinError> which is being unwrapped with ?. If the task panics, this will propagate a JoinError up through the task result matching logic. While this eventually gets handled, the error type mismatch could be confusing.

Current behavior:

If keepalive panics: Err(JoinError) → matched as (_, _, _, Err(err)) on line 517
This works but loses context about which task failed

Suggestion:
Consider explicitly handling the JoinError case or adding a comment explaining this behavior:

let res = keepalive.await.unwrap_or_else(|e| {
    tracing::error\!(error = ?e, "keepalive task panicked");
    Err(anyhow::anyhow\!("keepalive task panicked: {:?}", e))
});

Or add a comment:

// JoinError from panic will be treated as task error
let res = keepalive.await?;

3. Incomplete Result Matching Logic (Medium)

Location: lib.rs:510-523

The result matching logic prioritizes errors but doesn't handle all non-aborted result combinations:

// Prefer non aborted result if all succeed
(Ok(res), Ok(LifecycleResult::Aborted), _, _) => Ok(res),
(Ok(LifecycleResult::Aborted), Ok(res), _, _) => Ok(res),
// Unlikely case
(res, _, _, _) => res,

Issues:

Only checks first two tasks for non-aborted results
What if ping_res or keepalive_res contains the meaningful result?
The fallback (res, _, _, _) => res always returns tunnel_to_ws_res, ignoring the other three

Example problematic scenario:

tunnel_to_ws_res = Ok(LifecycleResult::Aborted)
ws_to_tunnel_res = Ok(LifecycleResult::Aborted)
ping_res = Ok(LifecycleResult::Aborted)
keepalive_res = Ok(LifecycleResult::ClientClose(...))

Result: Returns Aborted instead of ClientClose

Suggestion:
Use a more comprehensive matching strategy:

let mut lifecycle_res = match (tunnel_to_ws_res, ws_to_tunnel_res, ping_res, keepalive_res) {
    // Prefer error from any task
    (Err(err), _, _, _) | (_, Err(err), _, _) | (_, _, Err(err), _) | (_, _, _, Err(err)) => Err(err),
    
    // Find first non-aborted result
    (Ok(res @ (LifecycleResult::ServerClose(_) | LifecycleResult::ClientClose(_))), _, _, _) => Ok(res),
    (_, Ok(res @ (LifecycleResult::ServerClose(_) | LifecycleResult::ClientClose(_))), _, _) => Ok(res),
    (_, _, Ok(res @ (LifecycleResult::ServerClose(_) | LifecycleResult::ClientClose(_))), _) => Ok(res),
    (_, _, _, Ok(res @ (LifecycleResult::ServerClose(_) | LifecycleResult::ClientClose(_)))) => Ok(res),
    
    // All aborted
    _ => Ok(LifecycleResult::Aborted),
};

Or use an iterator approach:

let results = [tunnel_to_ws_res, ws_to_tunnel_res, ping_res, keepalive_res];

let mut lifecycle_res = results
    .into_iter()
    .find_map(|r| match r {
        Err(e) => Some(Err(e)),
        Ok(LifecycleResult::Aborted) => None,
        Ok(res) => Some(Ok(res)),
    })
    .unwrap_or(Ok(LifecycleResult::Aborted));

4. Variable Naming Inconsistency (Minor)

Location: lib.rs:424-428

let ctx_clone = self.ctx.clone();
let actor_id_clone = self.actor_id;
let gateway_id_clone = self.shared_state.gateway_id();
let request_id_clone = request_id;

Issue:
These variables use _clone suffix but some are already clones (ctx) while others are Copy types (UUIDs). The naming suggests they're all clones when some are just copies.

Suggestion:
Since these are only used once to pass to the keepalive spawn, consider passing them directly:

let keepalive = tokio::spawn(keepalive_task::task(
    self.ctx.clone(),
    self.actor_id,
    self.shared_state.gateway_id(),
    request_id,
    keepalive_abort_rx,
));

Or use more descriptive names if kept:

let keepalive_ctx = self.ctx.clone();
let keepalive_actor_id = self.actor_id;
// etc.

5. Missing Error Context (Minor)

Location: keepalive_task.rs:46-51

ctx.op(pegboard::ops::actor::hibernating_request::upsert::Input {
    actor_id,
    gateway_id,
    request_id,
})
.await?;

Suggestion:
Add context to the operation error for better debugging:

use anyhow::Context;

ctx.op(pegboard::ops::actor::hibernating_request::upsert::Input {
    actor_id,
    gateway_id,
    request_id,
})
.await
.context("failed to upsert hibernating request keepalive")?;

6. Race Condition Comment Mismatch (Minor)

Location: keepalive_task.rs:30-32

// Discard the first tick since it fires immediately and we've already called this
// above
ping_interval.tick().await;

Issue:
The comment says "we've already called this above" but there's no call above in the keepalive task. This comment appears to be copied from the original inline implementation where it made sense.

In lib.rs:579-587, there IS a call before the task starts:

self.ctx
    .op(pegboard::ops::actor::hibernating_request::upsert::Input {
        actor_id: self.actor_id,
        gateway_id: self.shared_state.gateway_id(),
        request_id,
    })
    .await?;

Suggestion:
Update the comment to reflect the actual situation:

// Discard the first tick since it fires immediately and the initial upsert
// has already been called before starting this task
ping_interval.tick().await;

🔒 Security Considerations

No security issues identified. The changes maintain the existing security posture.

🎯 Performance Considerations

Positive:

Jitter (0-128ms) prevents thundering herd problems ✅
MissedTickBehavior::Skip prevents backlog accumulation ✅
Interval is appropriately half of the threshold (45s default) ✅

Note:
The keepalive task now runs during active connections (when can_hibernate is true), adding periodic UDB writes that weren't happening before. This is the intended behavior per the PR title but worth noting for monitoring database load.

🧪 Test Coverage

Missing:
No test coverage found for the keepalive task. Consider adding tests for:

Keepalive interval behavior - Verify it pings at the correct interval
Graceful abort - Test that the abort channel correctly stops the task
Error handling - Test behavior when UDB operations fail
Integration test - Verify keepalive maintains request eligibility across hibernation cycles

Example test structure:

#[tokio::test]
async fn test_keepalive_task_aborts_cleanly() {
    let (abort_tx, abort_rx) = watch::channel(());
    let handle = tokio::spawn(keepalive_task::task(
        /* ... */
        abort_rx,
    ));
    
    abort_tx.send(()).unwrap();
    let result = handle.await.unwrap();
    
    assert\!(matches\!(result, Ok(LifecycleResult::Aborted)));
}

📝 Summary

Category	Status
Code Quality	✅ Good - Clean refactoring
Potential Bugs	⚠️ Medium - Result matching logic needs improvement
Performance	✅ Good - Proper jitter and interval handling
Security	✅ Good - No issues identified
Test Coverage	❌ Missing - No tests for new module

Recommendation: Address the result matching logic issue (#3) before merging. Other issues are minor and can be addressed at discretion.

🔧 Action Items

Must Fix:

Fix incomplete result matching logic (Issue [SVC-2504] Fix 5 GB upload limit for local development from Cloudflare #3)

Should Fix:

Fix documentation typo (Issue [SVC-2555] Set up issue templates #1)
Add error context to upsert operation (Issue [SVC-2358] Enable Redis memory overcommit #5)
Update misleading comment (Issue Remove servers not in salt #6)

Nice to Have:

Improve JoinHandle error handling (Issue [SVC-2479] Send cluster events to PostHog #2)
Simplify variable naming (Issue [SVC-2483] Remove hardcoded uses of rivet.gg #4)
Add test coverage

Great work on improving the architecture! The refactoring makes the code more maintainable and the keepalive logic is now properly integrated into both connection modes. 🚀

claude · 2025-11-20T22:45:39Z

Code Review

Thanks for this PR! Overall, the changes look good and the refactoring improves code organization. Here's my detailed feedback:

✅ Strengths

Good code organization: Extracting the keepalive logic into a separate module (keepalive_task.rs) improves readability and maintainability.
Consistent abort handling: The keepalive task now uses the same watch channel pattern as other tasks (ping, tunnel_to_ws, ws_to_tunnel), which is consistent and clean.
Proper cleanup: The change from keepalive_handle.abort() to using the watch channel for graceful shutdown (line 615-616 in lib.rs) is better practice and ensures proper task termination.

🔍 Code Quality Observations

Comment typo (keepalive_task.rs:10)
- Line 10 has a typo: "Periodically pings writes keepalive" should be "Periodically writes keepalive" or "Periodically pings by writing keepalive"
- Missing space in line 13: "///Only ran" should be "/// Only ran"

Import ordering (lib.rs:10-17)

The imports are slightly inconsistent. According to Rust conventions, imports from the same crate should be grouped together. The reordering from:

use rivet_guard_core::{
    WebSocketHandle,
    custom_serve::{CustomServeTrait, HibernationResult},
    errors::{ServiceUnavailable, WebSocketServiceUnavailable},
    proxy_service::{ResponseBody, is_ws_hibernate},
    request_context::RequestContext,
    websocket_handle::WebSocketReceiver,
};

To the current order is fine, but ensure it's consistent with the project's style (hard to tell from this snippet).

Potential integer conversion issue (keepalive_task.rs:21-26)
- The try_into()? on line 26 could theoretically fail if hibernating_request_eligible_threshold exceeds u64::MAX. Given the default is 90,000ms (90 seconds), this is unlikely but worth documenting or asserting at config validation time.

🐛 Potential Issues

Task result handling inconsistency (lib.rs:511-523)
- The result matching logic appears incomplete. Looking at lines 519-520:
```
(Ok(res), Ok(LifecycleResult::Aborted), _, _) => Ok(res),
(Ok(LifecycleResult::Aborted), Ok(res), _, _) => Ok(res),
```
- This only handles cases where the first two tasks have results. What about cases where:
  - ping_res is the non-aborted result?
  - keepalive_res is the non-aborted result?
- Consider a more comprehensive pattern match or add a comment explaining why these cases don't need handling.
Race condition consideration
- When can_hibernate is false, the keepalive task immediately returns LifecycleResult::Aborted (lib.rs:481-483). This is correct, but ensure that during the window between checking can_hibernate and spawning the task, state changes are handled properly.

🔒 Security Considerations

No security concerns identified. The code properly handles abort signals and doesn't introduce any new attack surfaces.

⚡ Performance Considerations

Jitter implementation (keepalive_task.rs:42-44)
- The jitter range (0-128ms) is reasonable for preventing thundering herd. Good practice!
- However, this adds up to 128ms latency on each ping. For the default 90-second threshold, this is negligible, but worth documenting the rationale.

Task spawning overhead

The keepalive task is now spawned even when can_hibernate is false, then immediately returns. Consider:

// Option 1: Conditional spawn (more efficient)
let keepalive_res = if can_hibernate {
    let keepalive = tokio::spawn(keepalive_task::task(...));
    keepalive.await?
} else {
    Ok(LifecycleResult::Aborted)
};

This would avoid unnecessary task creation, though the current approach is cleaner for the tokio::join! pattern.

📝 Test Coverage

I don't see any tests modified or added for this functionality. Consider adding tests for:

Keepalive task abort behavior
Keepalive task ping interval timing
Integration test for the 4-task coordination in handle_websocket
Edge case: what happens if hibernating_request_eligible_threshold is very small (e.g., < 256ms)?

🎯 Recommendations

High Priority:

Fix the comment typos in keepalive_task.rs
Review the result handling logic at lib.rs:511-523 to ensure all cases are covered

Medium Priority:
3. Add unit tests for the keepalive task
4. Document why the task spawn overhead (when can_hibernate=false) is acceptable

Low Priority:
5. Consider adding a minimum threshold validation for hibernating_request_eligible_threshold / 2 to ensure it's > 256ms (or document why it's not needed)

📊 Overall Assessment

This is a solid refactoring that improves code organization and consistency. The main concern is ensuring the result handling logic is complete and well-tested. Once the comment typos are fixed and the result matching is verified, this should be good to merge!

Recommendation: Approve with minor changes

claude · 2025-11-20T22:47:37Z

Code Review

Summary

This PR refactors the keepalive task for hibernating requests by extracting it into a separate module and running it both during active WebSocket connections and during hibernation. This ensures that hibernating request IDs are properly maintained in the database.

Positive Changes

✅ Good Separation of Concerns: Extracting the keepalive logic into keepalive_task.rs improves code organization and reusability.

✅ Correct Conditional Spawn: The keepalive task is now conditionally spawned based on can_hibernate (line 481-483 in lib.rs), preventing unnecessary work for non-hibernating connections.

✅ Proper Task Coordination: The abort signal pattern is consistently implemented across all four concurrent tasks (tunnel_to_ws, ws_to_tunnel, ping, keepalive).

✅ Graceful Cleanup: The code properly awaits the keepalive task completion (line 616) instead of using .abort(), which is a better practice.

Issues & Recommendations

1. Typo in Documentation Comment (keepalive_task.rs:10-13)

/// Periodically pings writes keepalive in UDB. This is used to restore hibernating request IDs on
/// next actor start.
///
///Only ran for hibernating requests.

Issues:

Line 10: "pings writes" should be "pings" or "writes"
Line 13: Missing space after /// (should be /// Only ran)
Line 13: Grammar - "Only ran" should be "Only runs" (present tense) or "Only run" (imperative)

Suggested fix:

/// Periodically writes keepalive in UDB. This is used to restore hibernating request IDs on
/// next actor start.
///
/// Only runs for hibernating requests.

2. Result Preference Logic May Drop Non-Aborted Results (lib.rs:511-523)

The result preference logic only checks the first two results for non-aborted values:

(Ok(res), Ok(LifecycleResult::Aborted), _, _) => Ok(res),
(Ok(LifecycleResult::Aborted), Ok(res), _, _) => Ok(res),
// Unlikely case
(res, _, _, _) => res,

Issue: If tunnel_to_ws and ws_to_tunnel both abort, but ping or keepalive returns a non-aborted result (like ServerClose or ClientClose), the fallback (res, _, _, _) will return the first result (Aborted) instead of the meaningful third or fourth result.

Suggested fix:

// Prefer error
(Err(err), _, _, _) => Err(err),
(_, Err(err), _, _) => Err(err),
(_, _, Err(err), _) => Err(err),
(_, _, _, Err(err)) => Err(err),
// Prefer non-aborted result if all succeed
(Ok(res @ LifecycleResult::ServerClose(_)), _, _, _)
| (_, Ok(res @ LifecycleResult::ServerClose(_)), _, _)
| (_, _, Ok(res @ LifecycleResult::ServerClose(_)), _)
| (_, _, _, Ok(res @ LifecycleResult::ServerClose(_))) => Ok(res),
(Ok(res @ LifecycleResult::ClientClose(_)), _, _, _)
| (_, Ok(res @ LifecycleResult::ClientClose(_)), _, _)
| (_, _, Ok(res @ LifecycleResult::ClientClose(_)), _)
| (_, _, _, Ok(res @ LifecycleResult::ClientClose(_))) => Ok(res),
// All aborted (or unlikely mix)
(res, _, _, _) => res,

Alternatively, iterate through all results to find the first non-aborted one.

3. Minor: Import Ordering (lib.rs:10-17)

The imports are slightly inconsistent with alphabetical/logical ordering:

use rivet_guard_core::{
	WebSocketHandle,
	custom_serve::{CustomServeTrait, HibernationResult},
	errors::{ServiceUnavailable, WebSocketServiceUnavailable},
	proxy_service::{ResponseBody, is_ws_hibernate},
	request_context::RequestContext,
	websocket_handle::WebSocketReceiver,
};

The WebSocketHandle import is before nested imports. According to CLAUDE.md, imports should be at the top of the file (not inline), but there's no specific ordering requirement. Consider grouping by alphabetical order for consistency.

4. Missing Error Context (keepalive_task.rs:21-26)

The .try_into()? on line 26 could fail with a non-descriptive error:

let mut ping_interval = tokio::time::interval(Duration::from_millis(
	(ctx.config()
		.pegboard()
		.hibernating_request_eligible_threshold()
		/ 2)
	.try_into()?,
));

Recommendation: Add context to the error:

.try_into()
.context("hibernating_request_eligible_threshold overflow when converting to u64")?,

Performance & Security

✅ Performance: The jitter mechanism (line 43) is good for preventing thundering herd problems.

✅ Security: No security concerns identified. The keepalive task only updates timestamps in the database.

Test Coverage

⚠️ Test Coverage: This PR doesn't include tests. Consider adding:

Unit test for keepalive_task to verify it sends periodic updates
Integration test to verify keepalive runs during both active connections and hibernation
Test to verify keepalive is NOT spawned when can_hibernate is false

Conclusion

Overall, this is a solid refactoring that improves code organization. The main issues are:

Documentation typo (minor but should be fixed)
Result preference logic may miss meaningful non-aborted results (should be fixed)
Missing error context (nice to have)
Missing tests (recommended)

Recommendation: Request changes for items 1 and 2, consider adding item 3, and encourage adding tests in a follow-up PR if not in this one.

claude · 2025-11-20T23:43:30Z

PR Review: Ping Hibernating Requests During Open Connections

This PR refactors the keepalive task for hibernating WebSocket requests. Previously, the keepalive task only ran during actual hibernation. Now it runs continuously during open WebSocket connections (when hibernation is enabled).

Code Quality - Strengths

Good separation of concerns - Extracting keepalive logic into a dedicated module follows the existing pattern
Consistent error handling using anyhow::Result and LifecycleResult enum
Proper task coordination with abort signaling
Follows project conventions (hard tabs, structured logging, workspace dependencies)
Import cleanup - Removed unused JoinHandle import

Issues Found

Minor: Documentation Typos

keepalive_task.rs:10 - "Periodically pings writes keepalive" should be "Periodically writes keepalive pings"
keepalive_task.rs:13 - Missing space: "///Only" should be "/// Only"

Logic: Result Matching (lib.rs:511-522)

The match expression only handles cases where the first two tasks produce results. If ping_res or keepalive_res contain non-aborted results while both tunnel tasks are aborted, those results would be ignored. Likely not a practical issue given abort signaling.

Performance

Jitter (0-128ms) prevents thundering herd - good
Interval (45s default) provides good margin before 90s threshold
Task spawning overhead is minimal with efficient early return

Security

No issues identified - no new external inputs, no auth changes, uses existing DB operations.

Test Coverage

Concern: No tests added. Recommend adding tests for:

Keepalive pings during active connections
Proper abort on connection close
Timing verification

Verdict

Overall: Solid refactoring with good code organization

Approve with minor suggestions

Action items (non-blocking):

Fix documentation typos
Consider result matching logic
Add test coverage

claude · 2025-11-21T00:58:41Z

Code Review

Overview

This PR refactors the keepalive task for hibernating WebSocket requests by extracting it into a separate module and running it during both active WebSocket connections and hibernation periods. The change ensures that hibernating requests are consistently pinged to maintain their eligibility status.

✅ Strengths

Good Code Organization: Extracting the keepalive logic into a dedicated module (keepalive_task.rs) improves code maintainability and reusability.
Proper Cleanup: The PR correctly handles task abortion using watch channels and waits for task completion before proceeding, avoiding resource leaks.
Consistent Patterns: The new task follows the same patterns as ping_task.rs with abort handling via watch::Receiver.
Jitter Implementation: The random jitter (0-128ms) prevents stampeding herd problems when multiple requests ping simultaneously.

🔍 Issues & Concerns

1. Documentation Typo (Minor)

/// Periodically pings writes keepalive in UDB. This is used to restore hibernating request IDs on
///
///Only ran for hibernating requests.

Line 10: "pings writes" should be "writes" or "pings"
Line 13: Missing space after ///
Line 13: "Only ran" should be "Only run"

Suggestion:

/// Periodically writes keepalive data to UDB. This is used to restore hibernating request IDs on
/// next actor start.
///
/// Only run for hibernating requests.

2. Result Precedence Logic May Be Incomplete (Medium)

// Prefer non aborted result if all succeed
(Ok(res), Ok(LifecycleResult::Aborted), _, _) => Ok(res),
(Ok(LifecycleResult::Aborted), Ok(res), _, _) => Ok(res),
// Unlikely case
(res, _, _, _) => res,

This logic only handles cases where one of the first two tasks returns a non-aborted result. It doesn't handle:

Ping task returning non-aborted while others are aborted: (Ok(Aborted), Ok(Aborted), Ok(res), _)
Keepalive task returning non-aborted while others are aborted: (Ok(Aborted), Ok(Aborted), Ok(Aborted), Ok(res))

Impact: If the ping or keepalive task completes first with a non-aborted result, it would fall through to the "unlikely case" and return the first task's aborted result, potentially losing important lifecycle information.

Suggestion: Add comprehensive pattern matching:

// Prefer non aborted result if all succeed
(Ok(res @ LifecycleResult::ServerClose(_)), _, _, _)
| (_, Ok(res @ LifecycleResult::ServerClose(_)), _, _)
| (_, _, Ok(res @ LifecycleResult::ServerClose(_)), _)
| (_, _, _, Ok(res @ LifecycleResult::ServerClose(_))) => Ok(res),

(Ok(res @ LifecycleResult::ClientClose(_)), _, _, _)
| (_, Ok(res @ LifecycleResult::ClientClose(_)), _, _)
| (_, _, Ok(res @ LifecycleResult::ClientClose(_)), _)
| (_, _, _, Ok(res @ LifecycleResult::ClientClose(_))) => Ok(res),

// All aborted
_ => Ok(LifecycleResult::Aborted),

3. Potential Race Condition (Low)

In handle_websocket_hibernation, the keepalive task is spawned and then aborted immediately after handle_websocket_hibernation_inner completes. If the inner function returns quickly, the keepalive task might not have completed its first upsert operation at line 580-587 before being aborted at line 615.

While this is unlikely to cause issues (the initial upsert at line 580-587 happens before spawning the task), it's worth verifying that the timing is correct.

4. Error Handling in TryInto (Low)

(ctx.config()
    .pegboard()
    .hibernating_request_eligible_threshold()
    / 2)
.try_into()?,

The error from try_into could be more descriptive. If hibernating_request_eligible_threshold / 2 doesn't fit into the target type, the error message won't indicate what went wrong.

Suggestion: Add context:

.try_into()
.context("hibernating_request_eligible_threshold / 2 overflow")?

5. Import Organization (Style)

The imports in lib.rs were reformatted (lines 10-25), mixing alphabetical ordering with grouped imports. While this follows the codebase style, ensure it's consistent with the project's import conventions.

🧪 Test Coverage

Concern: No tests were added or modified in this PR. Given the complexity of the lifecycle management and the new keepalive task integration, consider adding tests for:

Keepalive task properly updates UDB at the expected interval
Keepalive task respects abort signals
Result precedence logic correctly prioritizes non-aborted results
Keepalive task only runs when can_hibernate is true

📊 Performance Considerations

The keepalive task runs continuously during active WebSocket connections when can_hibernate is true, adding periodic database operations. Ensure that:

The UDB upsert operation is efficient
The interval (hibernating_request_eligible_threshold / 2) is appropriately tuned
The jitter range (0-128ms) is sufficient for your scale

🔒 Security

No security concerns identified. The keepalive mechanism appears to be a legitimate internal operation.

📝 Recommendations

High Priority: Fix the result precedence logic to handle all task completion scenarios
Medium Priority: Correct documentation typos
Medium Priority: Add test coverage for the new keepalive task
Low Priority: Improve error messages for try_into
Low Priority: Verify timing in handle_websocket_hibernation to ensure no race conditions

Overall, this is a solid refactoring that improves code organization. The main concern is the result precedence logic which should be addressed before merging.

claude · 2025-11-21T01:02:19Z

PR Review: Hibernating Request Keepalive Enhancement

Summary

This PR improves the keepalive mechanism for hibernating WebSocket requests by ensuring they are pinged both during active WebSocket connections and during hibernation periods. The implementation extracts the keepalive logic into a dedicated module and runs it as a parallel task.

Positive Aspects

✅ Good architectural separation - Extracting keepalive logic into keepalive_task.rs follows the same pattern as ping_task.rs and ws_to_tunnel_task.rs, improving code organization and maintainability.

✅ Proper task lifecycle management - The keepalive task is correctly integrated with abort channels and proper cleanup on task completion.

✅ Consistent error handling - Uses the same LifecycleResult pattern as other tasks in the module.

✅ Jitter implementation - The random jitter (0-128ms) prevents thundering herd problems when multiple requests update simultaneously.

Issues & Recommendations

🔴 Critical: Missing Space in Comment (keepalive_task.rs:13)

///Only ran for hibernating requests.

Should be:

/// Only ran for hibernating requests.

Missing space after /// violates Rust documentation standards.

🟡 Medium: Potential Integer Overflow (keepalive_task.rs:21-26)

let mut ping_interval = tokio::time::interval(Duration::from_millis(
    (ctx.config()
        .pegboard()
        .hibernating_request_eligible_threshold()
        / 2)
    .try_into()?,
));

Issue: The division by 2 happens on i64, then converts to u64 for Duration::from_millis. While try_into() handles negative values, the default threshold is 90,000ms → 45,000ms interval, which is reasonable. However, if someone configures a negative value, try_into() will error.

Recommendation: Add validation or use saturating_div(2) and handle edge cases more explicitly:

let threshold_ms = ctx.config()
    .pegboard()
    .hibernating_request_eligible_threshold()
    .max(1000) // Ensure minimum reasonable threshold
    / 2;
let ping_interval_ms: u64 = threshold_ms.try_into()
    .context("hibernating_request_eligible_threshold must be positive")?;

🟡 Medium: Task Result Priority Logic (lib.rs:511-523)

let mut lifecycle_res = match (tunnel_to_ws_res, ws_to_tunnel_res, ping_res, keepalive_res) {
    // Prefer error
    (Err(err), _, _, _) => Err(err),
    (_, Err(err), _, _) => Err(err),
    (_, _, Err(err), _) => Err(err),
    (_, _, _, Err(err)) => Err(err),
    // Prefer non aborted result if all succeed
    (Ok(res), Ok(LifecycleResult::Aborted), _, _) => Ok(res),
    (Ok(LifecycleResult::Aborted), Ok(res), _, _) => Ok(res),
    // Unlikely case
    (res, _, _, _) => res,
};

Issue: The priority logic only checks the first two tasks for non-aborted results. If ping_res or keepalive_res complete with a non-aborted result while the first two are aborted, they won't be selected.

Recommendation: Make the priority logic exhaustive:

let mut lifecycle_res = match (tunnel_to_ws_res, ws_to_tunnel_res, ping_res, keepalive_res) {
    // Prefer error
    (Err(err), _, _, _) => Err(err),
    (_, Err(err), _, _) => Err(err),
    (_, _, Err(err), _) => Err(err),
    (_, _, _, Err(err)) => Err(err),
    // Prefer non-aborted result if all succeed
    (Ok(res @ LifecycleResult::ServerClose(_)), _, _, _) 
    | (_, Ok(res @ LifecycleResult::ServerClose(_)), _, _)
    | (_, _, Ok(res @ LifecycleResult::ServerClose(_)), _)
    | (_, _, _, Ok(res @ LifecycleResult::ServerClose(_))) => Ok(res),
    (Ok(res @ LifecycleResult::ClientClose(_)), _, _, _)
    | (_, Ok(res @ LifecycleResult::ClientClose(_)), _, _)
    | (_, _, Ok(res @ LifecycleResult::ClientClose(_)), _)
    | (_, _, _, Ok(res @ LifecycleResult::ClientClose(_))) => Ok(res),
    // All aborted
    (Ok(LifecycleResult::Aborted), Ok(LifecycleResult::Aborted), _, _) => Ok(LifecycleResult::Aborted),
    (res, _, _, _) => res,
};

Or use a helper function for clearer logic.

🟢 Minor: Redundant Variable Cloning (lib.rs:424-428)

let ctx_clone = self.ctx.clone();
let actor_id_clone = self.actor_id;
let gateway_id_clone = self.shared_state.gateway_id();
let request_id_clone = request_id;

Issue: These clones are only used once in the tokio::join\! below. The naming with _clone suffix is slightly verbose.

Recommendation: Either inline these into the spawn call or use more descriptive names if they serve a documentation purpose. This is purely stylistic.

🟢 Minor: Inconsistent Abort Order (lib.rs:439-441, 455-457, 471-473, 499-501)

The order of abort signals varies across the different task completion handlers. While this doesn't affect correctness (all are independent), consistency would improve readability.

Recommendation: Use consistent ordering, e.g., always ping → ws_to_tunnel → tunnel_to_ws → keepalive.

🔵 Question: Performance - Unnecessary Task Spawn in Non-Hibernating Case

In lib.rs:480-507, when can_hibernate is false, the code still wraps the immediate return in a tokio::spawn:

async {
    if \!can_hibernate {
        return Ok(LifecycleResult::Aborted);
    }
    // ... spawn keepalive task
}

Question: Could this be optimized to conditionally spawn the task outside the tokio::join\! to avoid the overhead of spawning an async block that immediately returns?

let keepalive_res = if can_hibernate {
    let keepalive = tokio::spawn(keepalive_task::task(...));
    // Handle in join
} else {
    // Skip entirely or create a completed future
};

However, this would require restructuring the tokio::join\! pattern. Current approach is simpler and overhead is minimal, so this is a minor optimization opportunity.

Test Coverage

⚠️ No tests included - This PR doesn't add tests for the new keepalive task behavior. Consider adding:

Unit tests for keepalive_task::task to verify interval timing and abort handling
Integration tests to ensure keepalive pings occur during both active connections and hibernation
Tests verifying that the keepalive task properly cleans up on abort

Security & Performance

✅ No security concerns - The changes don't introduce new attack vectors.

✅ Performance impact minimal - The keepalive task runs at 45-second intervals (default), which is very low overhead.

Conclusion

Overall Assessment: Approve with minor fixes recommended

The PR successfully addresses the issue of keeping hibernating requests alive during both active connections and hibernation. The code follows existing patterns well and integrates cleanly with the existing task lifecycle management.

Required fixes:

Fix the documentation comment spacing in keepalive_task.rs:13

Recommended improvements:

Add validation for the threshold configuration
Improve the result priority logic for better coverage
Add test coverage

Style improvements (optional):

Consistent abort signal ordering
Consider optimization for non-hibernating case

Great work on improving the hibernation reliability! 🚀

claude · 2025-11-21T01:35:49Z

Code Review - PR #3498

Summary

This PR refactors the keepalive task for hibernating WebSocket requests by extracting it into a separate module and ensuring it runs both during active WebSocket connections (when can_hibernate is true) and during actual hibernation periods.

Positive Aspects

Good separation of concerns: Extracting the keepalive logic into keepalive_task.rs improves code organization and reusability.
Consistent cleanup pattern: The use of watch channels for graceful task shutdown is well-implemented and follows the existing pattern used by other tasks (ping, tunnel_to_ws, ws_to_tunnel).
Proper jitter implementation: The random jitter (0-128ms) prevents stampeding herd issues when multiple requests update their keepalive timestamps.
Correct interval calculation: Using hibernating_request_eligible_threshold / 2 ensures keepalive pings happen frequently enough to prevent requests from being considered disconnected.

Issues & Concerns

1. Typo in Documentation (Minor)

Location: keepalive_task.rs:10

///Only ran for hibernating requests.

Should be:

/// Only ran for hibernating requests.

Missing space after ///.

2. Potential Integer Overflow (Low severity)

Location: keepalive_task.rs:21-26

let mut ping_interval = tokio::time::interval(Duration::from_millis(
    (ctx.config()
        .pegboard()
        .hibernating_request_eligible_threshold()
        / 2)
    .try_into()?,
));

While the try_into() will catch overflow at runtime, the division by 2 is performed on i64 before conversion to u64 for Duration::from_millis. The default threshold is 90,000ms, so overflow is unlikely, but consider using checked arithmetic for robustness:

let interval_ms = ctx.config()
    .pegboard()
    .hibernating_request_eligible_threshold()
    .checked_div(2)
    .context("invalid hibernating_request_eligible_threshold")?;
let interval_ms: u64 = interval_ms.try_into()?;
let mut ping_interval = tokio::time::interval(Duration::from_millis(interval_ms));

3. Task Ordering Logic Could Be Clearer (Minor)

Location: lib.rs:511-523

The pattern matching logic for determining the final lifecycle_res is somewhat difficult to reason about:

let mut lifecycle_res = match (tunnel_to_ws_res, ws_to_tunnel_res, ping_res, keepalive_res) {
    (Err(err), _, _, _) => Err(err),
    (_, Err(err), _, _) => Err(err),
    (_, _, Err(err), _) => Err(err),
    (_, _, _, Err(err)) => Err(err),
    (Ok(res), Ok(LifecycleResult::Aborted), _, _) => Ok(res),
    (Ok(LifecycleResult::Aborted), Ok(res), _, _) => Ok(res),
    (res, _, _, _) => res,
};

This prioritizes errors from tasks in order (tunnel_to_ws > ws_to_tunnel > ping > keepalive), but the last two patterns only handle the first two tasks' results when deciding between Aborted and non-Aborted. The ping and keepalive results are not considered in the "prefer non-aborted" logic. This might be intentional, but it's worth verifying if this is the desired behavior.

4. Missing Error Context (Minor)

Location: keepalive_task.rs:46-51

When the database operation fails, it propagates up without context:

ctx.op(pegboard::ops::actor::hibernating_request::upsert::Input {
    actor_id,
    gateway_id,
    request_id,
})
.await?;

Consider adding context for debugging:

ctx.op(pegboard::ops::actor::hibernating_request::upsert::Input {
    actor_id,
    gateway_id,
    request_id,
})
.await
.context("failed to upsert hibernating request keepalive")?;

Performance Considerations

Database load: The keepalive task now runs during active WebSocket connections (when can_hibernate is true), not just during hibernation. This means more frequent database writes (every ~45 seconds with the default 90s threshold). Ensure this doesn't cause excessive database load in scenarios with many concurrent hibernation-capable WebSocket connections.
Task overhead: Adding a fourth concurrent task (keepalive) to the WebSocket lifecycle adds some overhead. However, since it only runs when can_hibernate is true and the logic is simple, this should be minimal.

Test Coverage

No tests were added or modified in this PR. Consider adding tests for:

Keepalive task starting/stopping correctly based on can_hibernate flag
Keepalive pings occurring at the expected interval
Proper cleanup when tasks are aborted
Behavior when database operations fail

Security Concerns

No security issues identified. The changes maintain the existing security boundaries.

Recommendations

Fix the documentation typo in keepalive_task.rs:10
Consider using checked arithmetic for the interval calculation
Add error context to the database operation in the keepalive loop
Verify the task result prioritization logic matches intended behavior
Add test coverage for the new keepalive task behavior

Conclusion

The PR successfully addresses the goal of ensuring hibernating requests are pinged during both active connections and hibernation. The code quality is good overall with just minor improvements suggested. The refactoring improves maintainability by extracting the keepalive logic into its own module.

Status: Approve with minor suggestions for improvement.

claude · 2025-11-21T01:43:37Z

Code Review

Overview

This PR extracts the keepalive logic for hibernating requests into a dedicated task module and ensures it runs both during active WebSocket connections and during hibernation. This is a good refactoring that improves code organization and fixes a potential issue with keepalive timing.

Positive Aspects

✅ Good Refactoring: Extracting the keepalive logic into keepalive_task.rs follows the existing pattern used by ping_task.rs and improves code maintainability.

✅ Proper Cancellation: The task correctly uses watch::Receiver for graceful shutdown and returns LifecycleResult::Aborted when cancelled.

✅ Consistent Pattern: The implementation follows the same abort/cleanup pattern as the other lifecycle tasks (tunnel_to_ws, ws_to_tunnel, ping).

✅ Anti-Stampede: The jitter mechanism (rand::thread_rng().gen_range(0..128)) helps prevent thundering herd issues.

Issues Found

🐛 Critical: Typo in Documentation (Line 13)

///Only ran for hibernating requests.

Missing space after ///. Should be:

/// Only ran for hibernating requests.

🐛 Critical: Comment Inaccuracy (Line 30-31)

// Discard the first tick since it fires immediately and we've already called this
// above

This comment refers to "already called this above" but that's incorrect in this context. In the hibernation path (handle_websocket_hibernation at lib.rs:571), the upsert operation IS called before spawning this task (line 582-587), so the comment makes sense there. However, in the WebSocket connection path (handle_websocket at lib.rs:396+), this task only runs if can_hibernate is true (line 474-475), and there's no prior upsert call in that path.

Recommendation: Update the comment to be more accurate:

// Discard the first tick since it fires immediately

⚠️ Potential Issue: Inconsistent Keepalive Behavior

The keepalive task now runs in two different contexts:

During active WebSocket connection (lib.rs:474-507): Only if can_hibernate is true
During hibernation (lib.rs:603-616): Always runs

In the active connection case, the task is spawned alongside the other lifecycle tasks but immediately returns Aborted if can_hibernate is false (line 474-476). This works but adds an unnecessary task spawn for non-hibernating connections.

Recommendation: Consider using an Option<JoinHandle> pattern to avoid spawning the task when not needed:

let keepalive = if can_hibernate {
    Some(tokio::spawn(keepalive_task::task(...)))
} else {
    None
};

⚠️ Minor: Potential Integer Conversion Issues

At keepalive_task.rs:21-27, the conversion from i64 to u64 via try_into()? could theoretically fail if hibernating_request_eligible_threshold is negative (though the default is 90,000ms). The error handling is correct, but this seems unlikely to occur in practice given the configuration.

Code Quality

Good:

Follows Rust idioms and error handling patterns
Proper use of structured logging with tracing
Clean separation of concerns
Follows existing code style in the repository

Import Organization: The imports in lib.rs have been auto-sorted (lines 10-17), which is fine, but note that the CLAUDE.md states "Always add imports at the top of the file inside of inline within the function." This appears to be a misunderstanding of that guideline - the imports at module level are correct.

Security Considerations

✅ No security concerns identified. The keepalive mechanism correctly updates distributed state and doesn't introduce any new attack vectors.

Performance Considerations

✅ Jitter Implementation: The 0-128ms jitter is a good practice for distributed systems.

⚠️ Minor Overhead: Spawning the keepalive task even when can_hibernate is false adds minimal overhead (one task spawn that immediately returns), but it's worth considering the optimization mentioned above.

Test Coverage

❌ Missing Tests: There are no test files for the pegboard-gateway package. While this PR maintains existing behavior, integration tests would be valuable for:

Verifying keepalive task runs during active connections (when can_hibernate is true)
Verifying keepalive task runs during hibernation
Verifying proper cleanup when tasks are aborted
Testing the jitter mechanism doesn't cause timing issues

Note: This is not a blocker for this PR, but worth considering for future work.

Recommendations Summary

Must Fix:

Fix the typo in the doc comment (line 13)

Should Consider:
2. Update the comment at line 30-31 to be more accurate
3. Consider using Option<JoinHandle> to avoid spawning unnecessary tasks

Nice to Have:
4. Add integration tests for keepalive behavior

Verdict

This is a solid refactoring that improves code organization. The critical issues are minor (typo and comment accuracy). The code correctly implements the desired behavior of running keepalive both during active hibernation-capable connections and during actual hibernation.

Recommendation: Approve with minor fixes requested.

graphite-app · 2025-11-21T02:40:45Z

Merge activity

Nov 21, 2:40 AM UTC: NathanFlurry added this pull request to the Graphite merge queue.
Nov 21, 2:42 AM UTC: CI is running for this pull request on a draft pull request (#3515) due to your merge queue CI optimization settings.
Nov 21, 2:43 AM UTC: Merged by the Graphite merge queue via draft PR: #3515.

… connections and during hibernation (#3498)

NathanFlurry mentioned this pull request Nov 20, 2025

chore(rivetkit): fix hibernation implementation #3494

Closed

vercel bot temporarily deployed to Preview – rivetkit-serverless November 20, 2025 07:59 Inactive

This was referenced Nov 20, 2025

fix(pegboard-gateway): write hws immediately on socket hibernate #3496

Closed

fix(rivetkit): fix graceful runner shutdown #3497

Closed

fix(pegboard): fix actors getting destroyed on runner shutdown #3499

Closed

NathanFlurry marked this pull request as ready for review November 20, 2025 18:24

NathanFlurry mentioned this pull request Nov 20, 2025

chore(rivetkit): make action cancelled error more descriptive #3500

Closed

MasterPtato force-pushed the 11-19-fix_rivetkit_fix_graceful_runner_shutdown branch from 4f192f3 to a11d203 Compare November 20, 2025 19:38

MasterPtato force-pushed the 11-19-fix_pegboard-gateway_ping_hibernating_requests_both_during_open_hws_connections_and_during_hibernation branch from 5552b74 to 815c918 Compare November 20, 2025 19:38

vercel bot had a problem deploying to Preview – rivetkit-serverless November 20, 2025 19:38 Failure

MasterPtato mentioned this pull request Nov 20, 2025

fix(pb): stop actors from destroying on goingaway #3501

Closed

MasterPtato mentioned this pull request Nov 20, 2025

fix(pb): add ability to timeout force resceduling pending state #3502

Closed

fix(pegboard-gateway): ping hibernating requests both during open hws…

986289f

… connections and during hibernation

NathanFlurry force-pushed the 11-19-fix_pegboard-gateway_ping_hibernating_requests_both_during_open_hws_connections_and_during_hibernation branch from 815c918 to 986289f Compare November 20, 2025 20:51

NathanFlurry force-pushed the 11-19-fix_rivetkit_fix_graceful_runner_shutdown branch from a11d203 to 0ed702c Compare November 20, 2025 20:51

vercel bot had a problem deploying to Preview – rivetkit-serverless November 20, 2025 20:52 Failure

This was referenced Nov 20, 2025

chore(rivetkit): fix lockfile #3503

Closed

fix(rivetkit): uri encode actor ids to fix cloudflare workers actor id with generation #3504

Closed

MasterPtato force-pushed the 11-19-fix_rivetkit_fix_graceful_runner_shutdown branch from 0ed702c to a11d203 Compare November 20, 2025 22:26

MasterPtato force-pushed the 11-19-fix_pegboard-gateway_ping_hibernating_requests_both_during_open_hws_connections_and_during_hibernation branch from 986289f to 815c918 Compare November 20, 2025 22:26

MasterPtato mentioned this pull request Nov 20, 2025

fix(gateway): prevent gc from removing hibernating in flight req, check actor started after sub when hibernating #3505

Closed

NathanFlurry force-pushed the 11-19-fix_pegboard-gateway_ping_hibernating_requests_both_during_open_hws_connections_and_during_hibernation branch from 815c918 to 986289f Compare November 20, 2025 22:32

MasterPtato force-pushed the 11-19-fix_rivetkit_fix_graceful_runner_shutdown branch from 0ed702c to a11d203 Compare November 20, 2025 22:38

MasterPtato force-pushed the 11-19-fix_pegboard-gateway_ping_hibernating_requests_both_during_open_hws_connections_and_during_hibernation branch from 986289f to 815c918 Compare November 20, 2025 22:38

NathanFlurry force-pushed the 11-19-fix_pegboard-gateway_ping_hibernating_requests_both_during_open_hws_connections_and_during_hibernation branch from 815c918 to 986289f Compare November 20, 2025 23:32

NathanFlurry mentioned this pull request Nov 20, 2025

chore(examples): add counter-next-js example #3506

Closed

NathanFlurry mentioned this pull request Nov 21, 2025

fix(engine): ensure first keepalive tick is not skipped to prevent timeout on ws hibernation #3507

Closed

MasterPtato force-pushed the 11-19-fix_rivetkit_fix_graceful_runner_shutdown branch from 0ed702c to a11d203 Compare November 21, 2025 00:48

MasterPtato force-pushed the 11-19-fix_pegboard-gateway_ping_hibernating_requests_both_during_open_hws_connections_and_during_hibernation branch from 986289f to 815c918 Compare November 21, 2025 00:48

MasterPtato mentioned this pull request Nov 21, 2025

chore: remove msg id parts from protocol, move id utils to runner protocol crate #3508

Closed

NathanFlurry force-pushed the 11-19-fix_pegboard-gateway_ping_hibernating_requests_both_during_open_hws_connections_and_during_hibernation branch from 815c918 to 986289f Compare November 21, 2025 01:27

NathanFlurry force-pushed the 11-19-fix_rivetkit_fix_graceful_runner_shutdown branch from a11d203 to 0ed702c Compare November 21, 2025 01:27

This was referenced Nov 21, 2025

chore(rivetkit): buffer outbound tunnel messages #3509

Closed

docs: runner lifecycle #3510

Closed

chore(engine): upgrade vbare 0.0.4 #3511

Closed

This was referenced Nov 21, 2025

fix(runner-protocol): add noop converters for v1 -> v3 #3512

Closed

fix(runner-protocol): fix decoding gateway & req id from req id instead of message id #3513

Closed

chore(pegboard-runner): log runner messages #3514

Closed

NathanFlurry requested a review from MasterPtato November 21, 2025 02:34

graphite-app bot pushed a commit that referenced this pull request Nov 21, 2025

fix(pegboard-gateway): ping hibernating requests both during open hws…

9f2bc6e

… connections and during hibernation (#3498)

graphite-app bot closed this Nov 21, 2025

graphite-app bot deleted the 11-19-fix_pegboard-gateway_ping_hibernating_requests_both_during_open_hws_connections_and_during_hibernation branch November 21, 2025 02:43

fix(pegboard-gateway): ping hibernating requests both during open hws connections and during hibernation #3498

fix(pegboard-gateway): ping hibernating requests both during open hws connections and during hibernation #3498

Uh oh!

Conversation

NathanFlurry commented Nov 20, 2025

Uh oh!

vercel bot commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NathanFlurry commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to use the Graphite Merge Queue

Uh oh!

claude bot commented Nov 20, 2025

Code Review

Overview

Positive Changes

Issues and Concerns

1. Documentation typo (Minor)

2. Potential unnecessary work during active connections (Performance)

3. Task spawning pattern (Code Quality)

4. Result matching could be more robust (Correctness)

5. Clone variables optimization (Minor Performance)

Testing Recommendations

Security/Safety

Verdict

Uh oh!

claude bot commented Nov 20, 2025

Code Review - PR #3498

Summary

Positive Observations

Issues & Suggestions

1. Typo in documentation (Minor)

2. Formatting issue in documentation (Minor)

3. Inconsistent lifecycle result handling (Moderate)

4. Error handling observation (Low)

5. Missing context on abort handling (Low)

Performance Considerations

Security Considerations

Test Coverage

Verdict

Uh oh!

claude bot commented Nov 20, 2025

PR Review: Keepalive Task Refactoring

Summary

Code Quality & Best Practices ✅

Potential Bugs 🐛

Performance Considerations ⚡

Security Concerns 🔒

Test Coverage 🧪

Architecture & Design 🏗️

Adherence to CLAUDE.md Standards ✅

Summary & Recommendations

Uh oh!

claude bot commented Nov 20, 2025

Code Review

Overview

Positive Aspects ✅

Issues and Suggestions 🔍

1. Typo in Comment (Minor)

2. Import Organization (Style)

3. Potential Integer Overflow (Low Risk)

4. Task Lifecycle Abort Pattern Inconsistency (Minor)

5. Keepalive Task Only Runs for can_hibernate Requests

6. Error Handling in Infinite Loop (Design Question)

Security Considerations 🔒

Performance Considerations ⚡

Testing Recommendations 🧪

Adherence to Project Conventions ✓

Summary

Uh oh!

claude bot commented Nov 20, 2025

Code Review

Strengths

Issues Found

Recommendations

Performance & Security

Uh oh!

claude bot commented Nov 20, 2025

Code Review

vercel bot commented Nov 20, 2025 •

edited

Loading

NathanFlurry commented Nov 20, 2025 •

edited

Loading

5. Keepalive Task Only Runs for `can_hibernate` Requests