-
Notifications
You must be signed in to change notification settings - Fork 135
fix(pegboard-gateway): ping hibernating requests both during open hws connections and during hibernation #3498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
3 Skipped Deployments
|
Code ReviewI've reviewed this PR and have the following feedback: OverviewThis PR extracts the keepalive task into a separate module and ensures it runs both during active WebSocket connections and during hibernation. The goal is to prevent hibernating requests from being garbage collected. Positive Changes
Issues and Concerns1. Documentation typo (Minor)Line 13 in keepalive_task.rs - Missing space in comment. Should be "/// Only ran for hibernating requests." instead of "///Only ran for hibernating requests." 2. Potential unnecessary work during active connections (Performance)Lines 480-507 in lib.rs - The keepalive task now runs during active WebSocket connections when can_hibernate is true. Previously, it only ran during actual hibernation (handle_websocket_hibernation). Question: Is it necessary to ping the hibernating_request UDB entry while the connection is still active? This adds database writes during normal operation. The original implementation only ran the keepalive task during actual hibernation (lines 603-616), which seems more efficient. Recommendation: Consider whether the keepalive task should only run during actual hibernation, not during active connections. If it must run during active connections, please document why this change is necessary in the PR description or code comments. 3. Task spawning pattern (Code Quality)Lines 485-491 in lib.rs - The keepalive task is spawned inside a tokio::join! future, which is different from how the other three tasks are spawned (lines 398-418). For consistency, consider spawning it outside the join block conditionally and then awaiting it inside, similar to how tunnel_to_ws, ws_to_tunnel, and ping tasks are handled. 4. Result matching could be more robust (Correctness)Lines 511-523 in lib.rs - The result matching logic prefers the first non-aborted result, but doesn't explicitly handle cases where multiple tasks return non-aborted results. Consider being more explicit about the precedence for different LifecycleResult variants (ServerClose vs ClientClose vs Aborted). 5. Clone variables optimization (Minor Performance)Lines 424-428 in lib.rs - Several clones are performed that may not be necessary since most of these types are cheap to clone (Ids are Copy types, ctx is an Arc), but consider capturing them directly in the async block if possible. Testing Recommendations
Security/SafetyNo security concerns identified. The changes maintain proper error handling and resource cleanup. VerdictThe core fix looks sound, but there are some architectural questions (particularly issue 2 about running during active connections) that should be addressed before merging. The code quality could be improved with more consistent patterns and better documentation of the design decisions. |
4f192f3 to
a11d203
Compare
5552b74 to
815c918
Compare
Code Review - PR #3498SummaryThis PR refactors the keepalive task for hibernating requests by extracting it into a separate module and ensuring it runs both during active WebSocket connections and during hibernation. Overall, the changes look good and follow the repository's patterns well. Positive Observations✅ Good refactoring: Extracting the keepalive logic into ✅ Consistent patterns: The new task follows the same pattern as ✅ Proper lifecycle management: The keepalive task is correctly spawned and cleaned up in both ✅ Jitter implementation: Good use of jitter (0-128ms) to prevent thundering herd problems ✅ Import organization: Follows the repository convention of keeping imports at the top of the file Issues & Suggestions1. Typo in documentation (Minor)Location: /// Periodically pings writes keepalive in UDB. This is used to restore hibernating request IDs on
Should be: /// Periodically writes keepalive pings in UDB. This is used to restore hibernating request IDs on
2. Formatting issue in documentation (Minor)Location: There's a missing space: ///Only ran for hibernating requests.
Should be: /// Only ran for hibernating requests.
3. Inconsistent lifecycle result handling (Moderate)Location: The lifecycle result matching logic shows asymmetry. When multiple tasks complete successfully but with different results, the code only checks the first two positions: (Ok(res), Ok(LifecycleResult::Aborted), _, _) => Ok(res),
(Ok(LifecycleResult::Aborted), Ok(res), _, _) => Ok(res),
// Unlikely case
(res, _, _, _) => res,Potential issue: If Suggestion: Consider a more comprehensive pattern that checks all positions for non-Aborted results: match (tunnel_to_ws_res, ws_to_tunnel_res, ping_res, keepalive_res) {
// Prefer error
(Err(err), _, _, _) => Err(err),
(_, Err(err), _, _) => Err(err),
(_, _, Err(err), _) => Err(err),
(_, _, _, Err(err)) => Err(err),
// Find first non-aborted result
(Ok(res), _, _, _) if !matches!(res, LifecycleResult::Aborted) => Ok(res),
(_, Ok(res), _, _) if !matches!(res, LifecycleResult::Aborted) => Ok(res),
(_, _, Ok(res), _) if !matches!(res, LifecycleResult::Aborted) => Ok(res),
(_, _, _, Ok(res)) if !matches!(res, LifecycleResult::Aborted) => Ok(res),
// All aborted (unlikely)
(res, _, _, _) => res,
}4. Error handling observation (Low)Location: The let mut ping_interval = tokio::time::interval(Duration::from_millis(
(ctx.config()
.pegboard()
.hibernating_request_eligible_threshold()
/ 2)
.try_into()
.context("hibernating_request_eligible_threshold too large for interval")?,
));However, this is optional since the threshold would need to be absurdly large to cause issues. 5. Missing context on abort handling (Low)Location: The abort handling via _ = keepalive_abort_rx.changed() => {
tracing::debug!("keepalive task aborted");
return Ok(LifecycleResult::Aborted);
}Performance Considerations✅ The keepalive interval calculation ( ✅ Using ✅ Jitter range (0-128ms) is reasonable for preventing synchronization Security Considerations✅ No security concerns identified Test Coverage
VerdictRecommendation: Approve with minor fixes The changes are well-structured and solve the stated problem effectively. The only critical item is the documentation typo, which is trivial to fix. The lifecycle result handling observation is worth considering but not blocking since the current logic works for the common case. Great work on maintaining consistency with existing patterns! 🎯 |
PR Review: Keepalive Task RefactoringSummaryThis PR extracts keepalive logic into a dedicated task module and ensures hibernating requests are pinged both during active WebSocket connections and during hibernation. The refactoring improves code organization and fixes a potential bug where keepalive pings weren't being sent during active connections. Code Quality & Best Practices ✅Strengths:
Minor Issues:
Potential Bugs 🐛Medium Priority:
Performance Considerations ⚡Good:
Potential Improvement:
Security Concerns 🔒No major security issues identified. Minor observations:
Test Coverage 🧪Critical Gap:
Recommendations:
Architecture & Design 🏗️Excellent:
Adherence to CLAUDE.md Standards ✅
Summary & RecommendationsApprove with minor changes recommended:
The core logic is sound and this is a good refactoring. The issues identified are relatively minor but should be addressed to improve robustness. Priority order:
|
… connections and during hibernation
815c918 to
986289f
Compare
a11d203 to
0ed702c
Compare
Code ReviewOverviewThis PR refactors the keepalive logic for hibernating WebSocket requests by extracting it into a dedicated task module and running it alongside other lifecycle tasks. The change improves architecture and ensures hibernating requests are properly maintained both during active connections and during hibernation. Positive Aspects ✅
Issues and Suggestions 🔍1. Typo in Comment (Minor)Location: ///Only ran for hibernating requests.
Issue: Missing space after Fix: Should be 2. Import Organization (Style)Location: The diff shows imports were reordered: use rivet_guard_core::{
WebSocketHandle,
custom_serve::{CustomServeTrait, HibernationResult},
errors::{ServiceUnavailable, WebSocketServiceUnavailable},
proxy_service::{ResponseBody, is_ws_hibernate},
request_context::RequestContext,
websocket_handle::WebSocketReceiver,
};Issue: While this appears to be an auto-formatting change, the CLAUDE.md explicitly states:
Recommendation: Verify if these import reorderings were intentional or accidental. If accidental, they should be reverted to avoid conflicts with the team's formatting workflow. 3. Potential Integer Overflow (Low Risk)Location: let mut ping_interval = tokio::time::interval(Duration::from_millis(
(ctx.config()
.pegboard()
.hibernating_request_eligible_threshold()
/ 2)
.try_into()?,
));Issue: The code divides Consideration:
Recommendation: Current implementation is acceptable, but consider adding a debug assertion or comment explaining the assumption. 4. Task Lifecycle Abort Pattern Inconsistency (Minor)Location: In let _ = keepalive_abort_tx.send(());
let _ = keepalive_handle.await;Compare to the old implementation in the same function: keepalive_handle.abort();Observation: The new pattern is cleaner and more consistent with other tasks in Recommendation: Consider whether you want to log if the keepalive task panicked: if let Err(e) = keepalive_handle.await {
if e.is_panic() {
tracing::error!(error = ?e, "keepalive task panicked");
}
}5. Keepalive Task Only Runs for
|
Code ReviewI've reviewed this PR and overall the changes look good! The refactoring properly extracts the keepalive logic into a dedicated module and correctly ensures it runs both during active WebSocket connections and hibernation. Strengths
Issues Found1. Typo in documentation (keepalive_task.rs:10)
2. Missing space in documentation (keepalive_task.rs:13)
3. Comment reference outdated (keepalive_task.rs:30-31)
Recommendations
Performance & Security
Overall Assessment: Solid refactoring with minor documentation fixes needed. Core logic is correct and properly integrated. Great work! 🎉 |
0ed702c to
a11d203
Compare
986289f to
815c918
Compare
Code ReviewThis PR extracts keepalive logic into a separate task and ensures it runs both during active WebSocket connections and during hibernation. Overall, the change is well-structured and improves code organization. ✅ Strengths
🔍 Code Quality Observations1. Minor typo in documentation (keepalive_task.rs:10-13) Issues: pings writes keepalive should be writes keepalive pings, missing space before Only, grammar: Only ran should be Only runs. 2. Code duplication Both handle_websocket and handle_websocket_hibernation spawn the keepalive task with nearly identical code (lib.rs:424-428 vs 604-611). Consider extracting to a helper method. 3. Result handling pattern (lib.rs:511-523) The unlikely case comment suggests the logic might not handle all edge cases explicitly. Consider documenting what scenarios this covers. 4. Comment mismatch (keepalive_task.rs:30-32) The comment says we've already called this above but there's no call above in the extracted function. Consider simplifying to: Discard the first tick since it fires immediately. 5. Import ordering (lib.rs:10-17) The imports have inconsistent ordering within the braces. Consider grouping alphabetically for consistency. 🛡️ Security & PerformanceNo security concerns identified. The code properly validates inputs, uses safe interval-based timing with jitter to prevent thundering herd, and handles errors appropriately. 🧪 TestingNo tests found for the pegboard-gateway package. Consider adding tests for: keepalive task abort behavior, keepalive task continuing to write during both connection phases, edge cases around hibernation transitions. 📋 Minor Suggestions
SummaryThis is a solid refactoring that improves code organization and fixes the keepalive behavior for hibernating requests. The main issues are minor documentation and style improvements. The logic appears sound and properly handles the concurrent task lifecycle. Recommendation: Approve with minor suggestions ✅ |
815c918 to
986289f
Compare
0ed702c to
a11d203
Compare
986289f to
815c918
Compare
Pull Request ReviewSummaryThis PR refactors the keepalive ping mechanism for hibernating WebSocket requests by extracting it into a dedicated task module and ensuring it runs both during active WebSocket connections (when hibernation is possible) and during actual hibernation periods. Code Quality ✅Strengths:
Minor Issues:
Potential Bugs 🔍
Performance Considerations ⚡
Security Concerns 🔒No significant security issues identified. The code properly:
Test Coverage
|
Pull Request ReviewSummaryThis PR refactors the keepalive logic for hibernating WebSocket requests by extracting it into a dedicated task module. The keepalive task now runs during both active WebSocket connections and hibernation periods, ensuring requests maintain their "alive" status consistently. ✅ Positive Observations1. Good Code Organization
2. Consistent Lifecycle Management
3. Proper Hibernation Handling
🔍 Issues & Suggestions1. Documentation Typo (Minor)Location: /// Periodically pings writes keepalive in UDB. This is used to restore hibernating request IDs on
/// next actor start.
///
///Only ran for hibernating requests.
Issues:
Suggestion: /// Periodically writes keepalive pings to UDB. This is used to restore hibernating request IDs on
/// next actor start.
///
/// Only ran for hibernating requests.
2. Potential JoinHandle Panic Unwrapping (Medium)Location: let res = keepalive.await?;Issue: Current behavior:
Suggestion: let res = keepalive.await.unwrap_or_else(|e| {
tracing::error\!(error = ?e, "keepalive task panicked");
Err(anyhow::anyhow\!("keepalive task panicked: {:?}", e))
});Or add a comment: // JoinError from panic will be treated as task error
let res = keepalive.await?;3. Incomplete Result Matching Logic (Medium)Location: The result matching logic prioritizes errors but doesn't handle all non-aborted result combinations: // Prefer non aborted result if all succeed
(Ok(res), Ok(LifecycleResult::Aborted), _, _) => Ok(res),
(Ok(LifecycleResult::Aborted), Ok(res), _, _) => Ok(res),
// Unlikely case
(res, _, _, _) => res,Issues:
Example problematic scenario: Result: Returns Suggestion: let mut lifecycle_res = match (tunnel_to_ws_res, ws_to_tunnel_res, ping_res, keepalive_res) {
// Prefer error from any task
(Err(err), _, _, _) | (_, Err(err), _, _) | (_, _, Err(err), _) | (_, _, _, Err(err)) => Err(err),
// Find first non-aborted result
(Ok(res @ (LifecycleResult::ServerClose(_) | LifecycleResult::ClientClose(_))), _, _, _) => Ok(res),
(_, Ok(res @ (LifecycleResult::ServerClose(_) | LifecycleResult::ClientClose(_))), _, _) => Ok(res),
(_, _, Ok(res @ (LifecycleResult::ServerClose(_) | LifecycleResult::ClientClose(_))), _) => Ok(res),
(_, _, _, Ok(res @ (LifecycleResult::ServerClose(_) | LifecycleResult::ClientClose(_)))) => Ok(res),
// All aborted
_ => Ok(LifecycleResult::Aborted),
};Or use an iterator approach: let results = [tunnel_to_ws_res, ws_to_tunnel_res, ping_res, keepalive_res];
let mut lifecycle_res = results
.into_iter()
.find_map(|r| match r {
Err(e) => Some(Err(e)),
Ok(LifecycleResult::Aborted) => None,
Ok(res) => Some(Ok(res)),
})
.unwrap_or(Ok(LifecycleResult::Aborted));4. Variable Naming Inconsistency (Minor)Location: let ctx_clone = self.ctx.clone();
let actor_id_clone = self.actor_id;
let gateway_id_clone = self.shared_state.gateway_id();
let request_id_clone = request_id;Issue: Suggestion: let keepalive = tokio::spawn(keepalive_task::task(
self.ctx.clone(),
self.actor_id,
self.shared_state.gateway_id(),
request_id,
keepalive_abort_rx,
));Or use more descriptive names if kept: let keepalive_ctx = self.ctx.clone();
let keepalive_actor_id = self.actor_id;
// etc.5. Missing Error Context (Minor)Location: ctx.op(pegboard::ops::actor::hibernating_request::upsert::Input {
actor_id,
gateway_id,
request_id,
})
.await?;Suggestion: use anyhow::Context;
ctx.op(pegboard::ops::actor::hibernating_request::upsert::Input {
actor_id,
gateway_id,
request_id,
})
.await
.context("failed to upsert hibernating request keepalive")?;6. Race Condition Comment Mismatch (Minor)Location: // Discard the first tick since it fires immediately and we've already called this
// above
ping_interval.tick().await;Issue: In self.ctx
.op(pegboard::ops::actor::hibernating_request::upsert::Input {
actor_id: self.actor_id,
gateway_id: self.shared_state.gateway_id(),
request_id,
})
.await?;Suggestion: // Discard the first tick since it fires immediately and the initial upsert
// has already been called before starting this task
ping_interval.tick().await;🔒 Security ConsiderationsNo security issues identified. The changes maintain the existing security posture. 🎯 Performance ConsiderationsPositive:
Note: 🧪 Test CoverageMissing:
Example test structure: #[tokio::test]
async fn test_keepalive_task_aborts_cleanly() {
let (abort_tx, abort_rx) = watch::channel(());
let handle = tokio::spawn(keepalive_task::task(
/* ... */
abort_rx,
));
abort_tx.send(()).unwrap();
let result = handle.await.unwrap();
assert\!(matches\!(result, Ok(LifecycleResult::Aborted)));
}📝 Summary
Recommendation: Address the result matching logic issue (#3) before merging. Other issues are minor and can be addressed at discretion. 🔧 Action ItemsMust Fix:
Should Fix:
Nice to Have:
Great work on improving the architecture! The refactoring makes the code more maintainable and the keepalive logic is now properly integrated into both connection modes. 🚀 |
Code ReviewThanks for this PR! Overall, the changes look good and the refactoring improves code organization. Here's my detailed feedback: ✅ Strengths
🔍 Code Quality Observations
🐛 Potential Issues
🔒 Security ConsiderationsNo security concerns identified. The code properly handles abort signals and doesn't introduce any new attack surfaces. ⚡ Performance Considerations
📝 Test CoverageI don't see any tests modified or added for this functionality. Consider adding tests for:
🎯 RecommendationsHigh Priority:
Medium Priority: Low Priority: 📊 Overall AssessmentThis is a solid refactoring that improves code organization and consistency. The main concern is ensuring the result handling logic is complete and well-tested. Once the comment typos are fixed and the result matching is verified, this should be good to merge! Recommendation: Approve with minor changes |
Code ReviewSummaryThis PR refactors the keepalive task for hibernating requests by extracting it into a separate module and running it both during active WebSocket connections and during hibernation. This ensures that hibernating request IDs are properly maintained in the database. Positive Changes✅ Good Separation of Concerns: Extracting the keepalive logic into ✅ Correct Conditional Spawn: The keepalive task is now conditionally spawned based on ✅ Proper Task Coordination: The abort signal pattern is consistently implemented across all four concurrent tasks (tunnel_to_ws, ws_to_tunnel, ping, keepalive). ✅ Graceful Cleanup: The code properly awaits the keepalive task completion (line 616) instead of using Issues & Recommendations1. Typo in Documentation Comment (keepalive_task.rs:10-13)/// Periodically pings writes keepalive in UDB. This is used to restore hibernating request IDs on
/// next actor start.
///
///Only ran for hibernating requests.
Issues:
Suggested fix: /// Periodically writes keepalive in UDB. This is used to restore hibernating request IDs on
/// next actor start.
///
/// Only runs for hibernating requests.
2. Result Preference Logic May Drop Non-Aborted Results (lib.rs:511-523)The result preference logic only checks the first two results for non-aborted values: (Ok(res), Ok(LifecycleResult::Aborted), _, _) => Ok(res),
(Ok(LifecycleResult::Aborted), Ok(res), _, _) => Ok(res),
// Unlikely case
(res, _, _, _) => res,Issue: If Suggested fix: // Prefer error
(Err(err), _, _, _) => Err(err),
(_, Err(err), _, _) => Err(err),
(_, _, Err(err), _) => Err(err),
(_, _, _, Err(err)) => Err(err),
// Prefer non-aborted result if all succeed
(Ok(res @ LifecycleResult::ServerClose(_)), _, _, _)
| (_, Ok(res @ LifecycleResult::ServerClose(_)), _, _)
| (_, _, Ok(res @ LifecycleResult::ServerClose(_)), _)
| (_, _, _, Ok(res @ LifecycleResult::ServerClose(_))) => Ok(res),
(Ok(res @ LifecycleResult::ClientClose(_)), _, _, _)
| (_, Ok(res @ LifecycleResult::ClientClose(_)), _, _)
| (_, _, Ok(res @ LifecycleResult::ClientClose(_)), _)
| (_, _, _, Ok(res @ LifecycleResult::ClientClose(_))) => Ok(res),
// All aborted (or unlikely mix)
(res, _, _, _) => res,Alternatively, iterate through all results to find the first non-aborted one. 3. Minor: Import Ordering (lib.rs:10-17)The imports are slightly inconsistent with alphabetical/logical ordering: use rivet_guard_core::{
WebSocketHandle,
custom_serve::{CustomServeTrait, HibernationResult},
errors::{ServiceUnavailable, WebSocketServiceUnavailable},
proxy_service::{ResponseBody, is_ws_hibernate},
request_context::RequestContext,
websocket_handle::WebSocketReceiver,
};The 4. Missing Error Context (keepalive_task.rs:21-26)The let mut ping_interval = tokio::time::interval(Duration::from_millis(
(ctx.config()
.pegboard()
.hibernating_request_eligible_threshold()
/ 2)
.try_into()?,
));Recommendation: Add context to the error: .try_into()
.context("hibernating_request_eligible_threshold overflow when converting to u64")?,Performance & Security✅ Performance: The jitter mechanism (line 43) is good for preventing thundering herd problems. ✅ Security: No security concerns identified. The keepalive task only updates timestamps in the database. Test Coverage
ConclusionOverall, this is a solid refactoring that improves code organization. The main issues are:
Recommendation: Request changes for items 1 and 2, consider adding item 3, and encourage adding tests in a follow-up PR if not in this one. |
815c918 to
986289f
Compare
PR Review: Ping Hibernating Requests During Open ConnectionsThis PR refactors the keepalive task for hibernating WebSocket requests. Previously, the keepalive task only ran during actual hibernation. Now it runs continuously during open WebSocket connections (when hibernation is enabled). Code Quality - Strengths
Issues FoundMinor: Documentation Typos
Logic: Result Matching (lib.rs:511-522)The match expression only handles cases where the first two tasks produce results. If ping_res or keepalive_res contain non-aborted results while both tunnel tasks are aborted, those results would be ignored. Likely not a practical issue given abort signaling. Performance
SecurityNo issues identified - no new external inputs, no auth changes, uses existing DB operations. Test CoverageConcern: No tests added. Recommend adding tests for:
VerdictOverall: Solid refactoring with good code organization Approve with minor suggestions Action items (non-blocking):
|
0ed702c to
a11d203
Compare
986289f to
815c918
Compare
Code ReviewOverviewThis PR refactors the keepalive task for hibernating WebSocket requests by extracting it into a separate module and running it during both active WebSocket connections and hibernation periods. The change ensures that hibernating requests are consistently pinged to maintain their eligibility status. ✅ Strengths
🔍 Issues & Concerns1. Documentation Typo (Minor)/// Periodically pings writes keepalive in UDB. This is used to restore hibernating request IDs on
///
///Only ran for hibernating requests.
Suggestion: /// Periodically writes keepalive data to UDB. This is used to restore hibernating request IDs on
/// next actor start.
///
/// Only run for hibernating requests.
2. Result Precedence Logic May Be Incomplete (Medium)// Prefer non aborted result if all succeed
(Ok(res), Ok(LifecycleResult::Aborted), _, _) => Ok(res),
(Ok(LifecycleResult::Aborted), Ok(res), _, _) => Ok(res),
// Unlikely case
(res, _, _, _) => res,This logic only handles cases where one of the first two tasks returns a non-aborted result. It doesn't handle:
Impact: If the ping or keepalive task completes first with a non-aborted result, it would fall through to the "unlikely case" and return the first task's aborted result, potentially losing important lifecycle information. Suggestion: Add comprehensive pattern matching: // Prefer non aborted result if all succeed
(Ok(res @ LifecycleResult::ServerClose(_)), _, _, _)
| (_, Ok(res @ LifecycleResult::ServerClose(_)), _, _)
| (_, _, Ok(res @ LifecycleResult::ServerClose(_)), _)
| (_, _, _, Ok(res @ LifecycleResult::ServerClose(_))) => Ok(res),
(Ok(res @ LifecycleResult::ClientClose(_)), _, _, _)
| (_, Ok(res @ LifecycleResult::ClientClose(_)), _, _)
| (_, _, Ok(res @ LifecycleResult::ClientClose(_)), _)
| (_, _, _, Ok(res @ LifecycleResult::ClientClose(_))) => Ok(res),
// All aborted
_ => Ok(LifecycleResult::Aborted),3. Potential Race Condition (Low)In While this is unlikely to cause issues (the initial upsert at line 580-587 happens before spawning the task), it's worth verifying that the timing is correct. 4. Error Handling in TryInto (Low)(ctx.config()
.pegboard()
.hibernating_request_eligible_threshold()
/ 2)
.try_into()?,The error from Suggestion: Add context: .try_into()
.context("hibernating_request_eligible_threshold / 2 overflow")?5. Import Organization (Style)The imports in 🧪 Test CoverageConcern: No tests were added or modified in this PR. Given the complexity of the lifecycle management and the new keepalive task integration, consider adding tests for:
📊 Performance ConsiderationsThe keepalive task runs continuously during active WebSocket connections when
🔒 SecurityNo security concerns identified. The keepalive mechanism appears to be a legitimate internal operation. 📝 Recommendations
Overall, this is a solid refactoring that improves code organization. The main concern is the result precedence logic which should be addressed before merging. |
PR Review: Hibernating Request Keepalive EnhancementSummaryThis PR improves the keepalive mechanism for hibernating WebSocket requests by ensuring they are pinged both during active WebSocket connections and during hibernation periods. The implementation extracts the keepalive logic into a dedicated module and runs it as a parallel task. Positive Aspects✅ Good architectural separation - Extracting keepalive logic into ✅ Proper task lifecycle management - The keepalive task is correctly integrated with abort channels and proper cleanup on task completion. ✅ Consistent error handling - Uses the same ✅ Jitter implementation - The random jitter (0-128ms) prevents thundering herd problems when multiple requests update simultaneously. Issues & Recommendations🔴 Critical: Missing Space in Comment (keepalive_task.rs:13)///Only ran for hibernating requests.
Should be: /// Only ran for hibernating requests.
Missing space after 🟡 Medium: Potential Integer Overflow (keepalive_task.rs:21-26)let mut ping_interval = tokio::time::interval(Duration::from_millis(
(ctx.config()
.pegboard()
.hibernating_request_eligible_threshold()
/ 2)
.try_into()?,
));Issue: The division by 2 happens on Recommendation: Add validation or use let threshold_ms = ctx.config()
.pegboard()
.hibernating_request_eligible_threshold()
.max(1000) // Ensure minimum reasonable threshold
/ 2;
let ping_interval_ms: u64 = threshold_ms.try_into()
.context("hibernating_request_eligible_threshold must be positive")?;🟡 Medium: Task Result Priority Logic (lib.rs:511-523)let mut lifecycle_res = match (tunnel_to_ws_res, ws_to_tunnel_res, ping_res, keepalive_res) {
// Prefer error
(Err(err), _, _, _) => Err(err),
(_, Err(err), _, _) => Err(err),
(_, _, Err(err), _) => Err(err),
(_, _, _, Err(err)) => Err(err),
// Prefer non aborted result if all succeed
(Ok(res), Ok(LifecycleResult::Aborted), _, _) => Ok(res),
(Ok(LifecycleResult::Aborted), Ok(res), _, _) => Ok(res),
// Unlikely case
(res, _, _, _) => res,
};Issue: The priority logic only checks the first two tasks for non-aborted results. If Recommendation: Make the priority logic exhaustive: let mut lifecycle_res = match (tunnel_to_ws_res, ws_to_tunnel_res, ping_res, keepalive_res) {
// Prefer error
(Err(err), _, _, _) => Err(err),
(_, Err(err), _, _) => Err(err),
(_, _, Err(err), _) => Err(err),
(_, _, _, Err(err)) => Err(err),
// Prefer non-aborted result if all succeed
(Ok(res @ LifecycleResult::ServerClose(_)), _, _, _)
| (_, Ok(res @ LifecycleResult::ServerClose(_)), _, _)
| (_, _, Ok(res @ LifecycleResult::ServerClose(_)), _)
| (_, _, _, Ok(res @ LifecycleResult::ServerClose(_))) => Ok(res),
(Ok(res @ LifecycleResult::ClientClose(_)), _, _, _)
| (_, Ok(res @ LifecycleResult::ClientClose(_)), _, _)
| (_, _, Ok(res @ LifecycleResult::ClientClose(_)), _)
| (_, _, _, Ok(res @ LifecycleResult::ClientClose(_))) => Ok(res),
// All aborted
(Ok(LifecycleResult::Aborted), Ok(LifecycleResult::Aborted), _, _) => Ok(LifecycleResult::Aborted),
(res, _, _, _) => res,
};Or use a helper function for clearer logic. 🟢 Minor: Redundant Variable Cloning (lib.rs:424-428)let ctx_clone = self.ctx.clone();
let actor_id_clone = self.actor_id;
let gateway_id_clone = self.shared_state.gateway_id();
let request_id_clone = request_id;Issue: These clones are only used once in the Recommendation: Either inline these into the spawn call or use more descriptive names if they serve a documentation purpose. This is purely stylistic. 🟢 Minor: Inconsistent Abort Order (lib.rs:439-441, 455-457, 471-473, 499-501)The order of abort signals varies across the different task completion handlers. While this doesn't affect correctness (all are independent), consistency would improve readability. Recommendation: Use consistent ordering, e.g., always 🔵 Question: Performance - Unnecessary Task Spawn in Non-Hibernating CaseIn async {
if \!can_hibernate {
return Ok(LifecycleResult::Aborted);
}
// ... spawn keepalive task
}Question: Could this be optimized to conditionally spawn the task outside the let keepalive_res = if can_hibernate {
let keepalive = tokio::spawn(keepalive_task::task(...));
// Handle in join
} else {
// Skip entirely or create a completed future
};However, this would require restructuring the Test Coverage
Security & Performance✅ No security concerns - The changes don't introduce new attack vectors. ✅ Performance impact minimal - The keepalive task runs at 45-second intervals (default), which is very low overhead. ConclusionOverall Assessment: Approve with minor fixes recommended The PR successfully addresses the issue of keeping hibernating requests alive during both active connections and hibernation. The code follows existing patterns well and integrates cleanly with the existing task lifecycle management. Required fixes:
Recommended improvements:
Style improvements (optional):
Great work on improving the hibernation reliability! 🚀 |
815c918 to
986289f
Compare
a11d203 to
0ed702c
Compare
Code Review - PR #3498SummaryThis PR refactors the keepalive task for hibernating WebSocket requests by extracting it into a separate module and ensuring it runs both during active WebSocket connections (when Positive Aspects
Issues & Concerns1. Typo in Documentation (Minor)Location: ///Only ran for hibernating requests.
Should be: /// Only ran for hibernating requests.
Missing space after 2. Potential Integer Overflow (Low severity)Location: let mut ping_interval = tokio::time::interval(Duration::from_millis(
(ctx.config()
.pegboard()
.hibernating_request_eligible_threshold()
/ 2)
.try_into()?,
));While the let interval_ms = ctx.config()
.pegboard()
.hibernating_request_eligible_threshold()
.checked_div(2)
.context("invalid hibernating_request_eligible_threshold")?;
let interval_ms: u64 = interval_ms.try_into()?;
let mut ping_interval = tokio::time::interval(Duration::from_millis(interval_ms));3. Task Ordering Logic Could Be Clearer (Minor)Location: The pattern matching logic for determining the final let mut lifecycle_res = match (tunnel_to_ws_res, ws_to_tunnel_res, ping_res, keepalive_res) {
(Err(err), _, _, _) => Err(err),
(_, Err(err), _, _) => Err(err),
(_, _, Err(err), _) => Err(err),
(_, _, _, Err(err)) => Err(err),
(Ok(res), Ok(LifecycleResult::Aborted), _, _) => Ok(res),
(Ok(LifecycleResult::Aborted), Ok(res), _, _) => Ok(res),
(res, _, _, _) => res,
};This prioritizes errors from tasks in order (tunnel_to_ws > ws_to_tunnel > ping > keepalive), but the last two patterns only handle the first two tasks' results when deciding between Aborted and non-Aborted. The ping and keepalive results are not considered in the "prefer non-aborted" logic. This might be intentional, but it's worth verifying if this is the desired behavior. 4. Missing Error Context (Minor)Location: When the database operation fails, it propagates up without context: ctx.op(pegboard::ops::actor::hibernating_request::upsert::Input {
actor_id,
gateway_id,
request_id,
})
.await?;Consider adding context for debugging: ctx.op(pegboard::ops::actor::hibernating_request::upsert::Input {
actor_id,
gateway_id,
request_id,
})
.await
.context("failed to upsert hibernating request keepalive")?;Performance Considerations
Test CoverageNo tests were added or modified in this PR. Consider adding tests for:
Security ConcernsNo security issues identified. The changes maintain the existing security boundaries. Recommendations
ConclusionThe PR successfully addresses the goal of ensuring hibernating requests are pinged during both active connections and hibernation. The code quality is good overall with just minor improvements suggested. The refactoring improves maintainability by extracting the keepalive logic into its own module. Status: Approve with minor suggestions for improvement. |
Code ReviewOverviewThis PR extracts the keepalive logic for hibernating requests into a dedicated task module and ensures it runs both during active WebSocket connections and during hibernation. This is a good refactoring that improves code organization and fixes a potential issue with keepalive timing. Positive Aspects✅ Good Refactoring: Extracting the keepalive logic into ✅ Proper Cancellation: The task correctly uses ✅ Consistent Pattern: The implementation follows the same abort/cleanup pattern as the other lifecycle tasks (tunnel_to_ws, ws_to_tunnel, ping). ✅ Anti-Stampede: The jitter mechanism ( Issues Found🐛 Critical: Typo in Documentation (Line 13)///Only ran for hibernating requests.
Missing space after /// Only ran for hibernating requests.
🐛 Critical: Comment Inaccuracy (Line 30-31)// Discard the first tick since it fires immediately and we've already called this
// aboveThis comment refers to "already called this above" but that's incorrect in this context. In the hibernation path ( Recommendation: Update the comment to be more accurate: // Discard the first tick since it fires immediately
|
Merge activity
|
… connections and during hibernation (#3498)

No description provided.