-
Notifications
You must be signed in to change notification settings - Fork 135
fix(pegboard-gateway): write hws immediately on socket hibernate #3496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
3 Skipped Deployments
|
PR Review: fix(pegboard-gateway): write hws immediately on socket hibernateSummaryThis PR fixes a race condition in WebSocket hibernation by ensuring the hibernating request entry is written to the database before checking for pending messages. This prevents a scenario where an actor could be immediately rewoken without having a hibernating request entry in the database. Code Quality ✅Strengths:
Code structure follows Rivet conventions:
Logic Analysis ✅The fix addresses this sequence: Before (buggy):
After (fixed):
This ensures the entry exists for the cleanup logic at lib.rs:609-615, which deletes entries when returning Potential Issues 🔍Minor: Unnecessary database write in early-return pathWhen Question: Does the entry get cleaned up elsewhere when returning Suggestion: Consider whether the early-return path at lib.rs:563 should delete the entry it just created, or document why it's safe to leave it: if self.shared_state.has_pending_websocket_messages(request_id).await? {
tracing::debug!(?unique_request_id, "detected pending requests on websocket hibernation, rewaking actor");
// Delete the entry since we're not actually hibernating
self.ctx.op(pegboard::ops::actor::hibernating_request::delete::Input {
actor_id: self.actor_id,
gateway_id: self.shared_state.gateway_id(),
request_id,
}).await?;
return Ok(HibernationResult::Continue);
}Performance Considerations ⚡Database write overhead: The fix adds an unconditional database write at the start of hibernation. This is the correct tradeoff for correctness, but worth noting:
Mitigation already in place: The keepalive loop has jitter (lib.rs:589-590) to prevent stampeding herds, which is good. Security Considerations 🔒No security concerns identified. The change maintains existing access patterns and doesn't introduce new attack surfaces. Test Coverage 📋Observation: No test files were modified in this PR. Recommendation: Consider adding integration tests for:
Example test scenarios:
Questions for Author
Verdict: ✅ Approve with Minor SuggestionThe core fix is correct and addresses a real race condition. My only concern is potential cleanup of the entry in the early-return path, which needs clarification from the author. Otherwise, the implementation is solid and follows project conventions well. |
34ee3a4 to
e1152dc
Compare
b960df1 to
8145fc3
Compare
PR Review: fix(pegboard-gateway): write hws immediately on socket hibernateSummaryThis PR fixes a race condition in WebSocket hibernation by ensuring the hibernating request entry is written to the database before checking for pending messages. This prevents a scenario where an actor could be immediately rewoken without having a hibernating request entry in the database. Code Quality ✅Strengths:
Code structure follows Rivet conventions:
Logic Analysis ✅The fix addresses this sequence: Before (buggy):
After (fixed):
This ensures the entry exists for the cleanup logic at lib.rs:609-615, which deletes entries when returning Potential Issues 🔍Minor: Unnecessary database write in early-return pathWhen Question: Does the entry get cleaned up elsewhere when returning Suggestion: Consider whether the early-return path at lib.rs:563 should delete the entry it just created, or document why it's safe to leave it: if self.shared_state.has_pending_websocket_messages(request_id).await? {
tracing::debug!(?unique_request_id, "detected pending requests on websocket hibernation, rewaking actor");
// Delete the entry since we're not actually hibernating
self.ctx.op(pegboard::ops::actor::hibernating_request::delete::Input {
actor_id: self.actor_id,
gateway_id: self.shared_state.gateway_id(),
request_id,
}).await?;
return Ok(HibernationResult::Continue);
}Performance Considerations ⚡Database write overhead: The fix adds an unconditional database write at the start of hibernation. This is the correct tradeoff for correctness, but worth noting:
Mitigation already in place: The keepalive loop has jitter (lib.rs:589-590) to prevent stampeding herds, which is good. Security Considerations 🔒No security concerns identified. The change maintains existing access patterns and doesn't introduce new attack surfaces. Test Coverage 📋Observation: No test files were modified in this PR. Recommendation: Consider adding integration tests for:
Example test scenarios:
Questions for Author
Verdict: ✅ Approve with Minor SuggestionThe core fix is correct and addresses a real race condition. My only concern is potential cleanup of the entry in the early-return path, which needs clarification from the author. Otherwise, the implementation is solid and follows project conventions well. |
Code ReviewSummaryThis PR fixes a race condition in websocket hibernation by ensuring the hibernating request entry is written to the database before checking for pending messages. This is a critical fix that prevents a scenario where an actor could be rewoken immediately but the hibernating request entry was never created. Positive Aspects✅ Race Condition Fix: The primary fix correctly addresses a race condition where:
This ensures the entry exists in the database even if the actor is immediately rewoken, which is crucial for proper lifecycle management. ✅ Interval Tick Fix: Discarding the first tick of ✅ Code Documentation: Good inline comments explaining the reasoning behind both changes. Code QualityStrong Points:
Potential Considerations
Minor Observations
Performance & Security
RecommendationLGTM with minor suggestion ✅ The fix is correct and addresses a real race condition. The implementation is clean and well-commented. Optional Enhancement: Consider adding a metric or trace to track how often immediate rewakes occur (after the upsert but before hibernation), which would help validate this fix in production and monitor the frequency of this edge case. Conventional Commit Format: ✅ Correctly follows the pattern: |
8145fc3 to
193f67a
Compare
e1152dc to
968a92b
Compare
Code ReviewThis PR fixes a race condition in the websocket hibernation flow by ensuring the hibernating request entry is written to the database before checking for pending messages. Overall, the changes look good and address an important timing issue. Here are my findings: ✅ Positive Observations
🔍 Considerations & Questions
🎯 Suggested Improvements (Optional)// Consider adding cleanup on early return:
if self.shared_state.has_pending_websocket_messages(request_id).await? {
tracing::debug!(
?unique_request_id,
"detected pending requests on websocket hibernation, rewaking actor"
);
// Clean up the entry since we're not actually hibernating
self.ctx
.op(pegboard::ops::actor::hibernating_request::delete::Input {
actor_id: self.actor_id,
gateway_id: self.shared_state.gateway_id(),
request_id,
})
.await?;
return Ok(HibernationResult::Continue);
}📋 SummaryApproval Status: ✅ Approve with minor suggestions The PR correctly fixes the race condition and improves the hibernation flow. The main question is whether the early-return path (line 563) should clean up the database entry. If that's intended behavior, this is ready to merge as-is. If not, consider the suggested cleanup addition. Security: ✅ No security concerns |
PR Review: WebSocket Hibernation Race Condition FixSummaryThis PR fixes a race condition in the WebSocket hibernation flow by ensuring the hibernating request entry is written to the database before checking for pending messages that might immediately wake the actor. Code Quality ✅Positive aspects:
Code organization:
Potential Issues & Considerations1. Error Handling Gap 🔍The initial
Verdict: Current behavior appears correct, but consider adding a debug/info log before the operation for observability: tracing::debug!(?request_id, "writing initial hibernating request entry");
self.ctx.op(/* ... */).await?;2. Duplicate Write Pattern 📝The same
Consideration: This repetition is acceptable given the different contexts, but ensure all three locations use identical parameters. The code does this correctly. 3. Race Condition with
|
968a92b to
e1152dc
Compare
193f67a to
8145fc3
Compare
Code ReviewSummaryThis PR fixes a race condition in WebSocket hibernation by ensuring the hibernating request entry is written to the database before checking for pending messages. This is a critical fix that prevents data loss in edge cases. Positive Aspects✅ Race Condition Fix: The main fix correctly addresses a timing issue where:
✅ Proper Interval Handling: The addition of ✅ Good Code Comments: The inline comments clearly explain the "why" behind both changes, making the intent obvious to future maintainers. Code QualityFollows Repository Standards:
Potential Concerns & Questions1. Error Handling After First Upsert (Minor)Lines 544-550: If the initial upsert fails, the function returns an error. However, there's no corresponding delete operation since we haven't entered the hibernation state yet. This seems correct, but worth confirming:
2. Test Coverage (Important)I don't see test changes in this PR. Given that this fixes a race condition, consider adding a test case that:
Example test pattern: #[tokio::test]
async fn test_hibernation_with_pending_messages() {
// Setup: Create actor with pending message
// Trigger: Send hibernate command
// Assert: Verify hibernating_request entry was created
// Assert: Verify immediate rewake (HibernationResult::Continue)
}3. Duplicate Work Consideration (Performance - Minor)The upsert operation is now called twice in quick succession when there are no pending messages:
While this ensures correctness, it means we're writing the same timestamp twice within a short period. This is likely acceptable given the importance of correctness over this minor performance cost, but worth documenting. Security & Performance🔒 Security: No security concerns identified. ⚡ Performance:
Recommendations
Overall AssessmentLGTM with suggestions - The core fix is sound and addresses a real race condition. The main recommendation is to add test coverage to prevent regression. Great catch on this race condition! The fix is clean and well-documented. 🤖 Generated with Claude Code |
Code ReviewSummaryThis PR fixes a race condition in the WebSocket hibernation flow by ensuring the hibernating request entry is written to the database before checking for pending messages. This is a critical fix that prevents data inconsistency when actors are immediately rewoken. AnalysisWhat the code does: The bug this fixes:
The fix:
Positive Observations
Considerations & Questions1. Cleanup on immediate rewakeWhen HibernationResult::Continue is returned at line 563, the cleanup logic at lines 606-616 doesn't execute, leaving the hibernating request entry in the database. Question: Is this intentional? If the actor immediately rewakes and continues processing, should we delete the hibernating request entry since the actor never truly hibernated? Recommendation: Consider whether this case needs explicit cleanup: if self
.shared_state
.has_pending_websocket_messages(request_id)
.await?
{
tracing::debug!(
?unique_request_id,
"detected pending requests on websocket hibernation, rewaking actor"
);
// Delete the entry we just created since we're not actually hibernating
self.ctx
.op(pegboard::ops::actor::hibernating_request::delete::Input {
actor_id: self.actor_id,
gateway_id: self.shared_state.gateway_id(),
request_id,
})
.await?;
return Ok(HibernationResult::Continue);
}2. Error handling gapIf the initial upsert fails (line 544-550), the function returns early with an error. However, there's no entry to clean up in this case, which is fine. But if the has_pending_websocket_messages check fails after the upsert succeeds, we'd have an orphaned entry. Question: Should there be error handling that cleans up the entry if subsequent operations fail? 3. Test coverageThere are no tests for this hibernation logic, which makes regression testing difficult for this critical path. Recommendation: Consider adding integration tests for:
4. Database transaction considerationsThe upsert operation uses a database transaction (udb), but the subsequent has_pending_websocket_messages check happens outside this transaction. Question: Could there be a TOCTOU (time-of-check-time-of-use) issue where messages arrive between the upsert and the pending check? Code QualityStrengths:
Minor style notes:
Security & PerformanceSecurity: No security concerns identified. The operations use proper authorization through the operation context. Performance:
Verdict✅ Approve with suggestions This is a solid fix for a race condition. The core logic is correct and the implementation is clean. The main question is whether immediate rewake scenarios need explicit cleanup of the hibernating request entry. I'd recommend:
The PR is safe to merge as-is if the orphaned entry on immediate rewake is acceptable behavior (perhaps it's cleaned up elsewhere or times out naturally). |
8145fc3 to
193f67a
Compare
PR Review: fix(pegboard-gateway): write hws immediately on socket hibernateSummaryThis PR fixes a race condition in the websocket hibernation flow by ensuring the hibernating request entry is written to the database before checking for pending messages. Additionally, it discards the first tick of the ping interval to avoid an immediate duplicate write. Code Quality ✅Positive aspects:
Observations:
Bug Fixes & Race Conditions ✅Primary fix - Race condition:
This ensures that if the actor is immediately rewoken (due to pending messages), the hibernating request entry already exists in the database. This prevents a window where the system might try to rewake an actor whose hibernation state hasn't been persisted yet. Secondary fix - Duplicate ping: Performance Considerations
|
968a92b to
e1152dc
Compare
193f67a to
8145fc3
Compare
Code ReviewI've reviewed the changes in this PR. Overall, the fix looks good and addresses a race condition in the websocket hibernation flow. Here's my detailed feedback: ✅ Positives
🔍 Minor Observations
🎯 SuggestionsOptional Enhancement: Consider adding a trace-level log after the initial upsert to aid in debugging: self.ctx
.op(pegboard::ops::actor::hibernating_request::upsert::Input {
actor_id: self.actor_id,
gateway_id: self.shared_state.gateway_id(),
request_id,
})
.await?;
tracing::trace!(?request_id, "inserted initial hibernating request entry");However, this is truly optional and may be unnecessary noise. 📋 SummaryThe changes correctly fix the race condition by ensuring the hibernating request entry is written immediately upon hibernation, before any early returns. The interval tick discard is a nice touch that prevents redundant writes. The code follows the project's conventions and error handling patterns. Recommendation: ✅ Approve - This is a solid fix that addresses the root cause of the issue. |
Code ReviewSummaryThis PR fixes a race condition in the websocket hibernation flow by ensuring the hibernating request entry is written to the database before checking for pending messages. This is a critical fix that prevents a scenario where the actor could be rewoken before the hibernation state is properly recorded. Changes AnalysisThe PR makes two key changes to
Positive Aspects ✅
Potential Concerns & Questions 🤔
Performance Considerations
Security & Best Practices
Test Coverage RecommendationsConsider adding test cases for:
VerdictLGTM with minor questions ✅ This is a solid fix for a subtle race condition. The code quality is good with clear comments. The only questions are around the intended behavior of leaving the hibernation entry in the DB when returning The fix is minimal, focused, and addresses a real correctness issue without introducing unnecessary complexity. Review generated by Claude Code |
Code ReviewOverviewThis PR fixes a race condition in the websocket hibernation logic by ensuring the hibernating request entry is written to the database before checking for pending messages. This is a critical fix for the hibernation workflow. Positive Aspects ✅
Concerns & Suggestions1. Error Handling on Early Return
|
e1152dc to
968a92b
Compare
Code ReviewSummaryThis PR fixes a race condition in websocket hibernation by ensuring the hibernating request entry is written to the database before checking for pending messages. This is a critical fix that prevents data loss when an actor immediately rewakes. ✅ Strengths
🔍 Observations & Questions1. Cleanup on Early Return (Minor)When Question: Is this intentional? I see the cleanup logic at lines 607-616 only runs if we get past the early return. This means:
Suggestion: Consider whether we need to delete the entry when immediately rewaking: if self
.shared_state
.has_pending_websocket_messages(request_id)
.await?
{
tracing::debug!(
?unique_request_id,
"detected pending requests on websocket hibernation, rewaking actor"
);
// Clean up the entry we just created since we're not hibernating
self.ctx
.op(pegboard::ops::actor::hibernating_request::delete::Input {
actor_id: self.actor_id,
gateway_id: self.shared_state.gateway_id(),
request_id,
})
.await?;
return Ok(HibernationResult::Continue);
}Or document if the stale entry is handled elsewhere (e.g., TTL-based cleanup). 2. Test Coverage (Minor)I do not see tests covering this specific race condition scenario. The existing test at Suggestion: Consider adding a test that:
📊 Code Quality
🔒 Security & Performance
🎯 RecommendationApprove with minor suggestions. The core fix is solid and addresses a real race condition. The only question is about cleanup of the database entry when immediately rewaking - this may already be handled by other mechanisms (TTL, periodic cleanup, etc.), but it would be good to confirm or add a cleanup call. Great catch on this subtle race condition! 🎉 |
968a92b to
e1152dc
Compare
193f67a to
8145fc3
Compare
PR Review: fix(pegboard-gateway): write hws immediately on socket hibernateSummaryThis PR fixes a race condition in the websocket hibernation flow by ensuring the hibernating request entry is written to the database before checking for pending messages. This prevents a scenario where an actor could be woken immediately but the hibernating request entry doesn't exist yet. Code Quality & Best Practices ✅Strengths:
Code style compliance:
Race Condition Fix 🔧The Problem:
This created a race window where:
The Solution:
This ensures the entry exists before any potential wake signals. Potential Issues & Questions 🤔1. Interval First Tick Handling// Discard the first tick since it fires immediately and we've already called this
// above
ping_interval.tick().await;✅ Good catch! This prevents a duplicate immediate upsert since:
Minor suggestion: Consider if this could cause timing issues if the function returns early (before the keepalive task starts). Currently, if 2. Database Transaction TimingThe
Question: What happens if this transaction takes longer than expected? Could there be a race between:
Recommendation: The current implementation seems safe because the upsert completes fully before the pending check, but consider adding a comment about the transaction boundaries if this becomes relevant in the future. 3. Cleanup on Early ReturnWhen returning early due to pending messages (line 563), the hibernating request entry is not cleaned up. This means:
Is this intentional? Looking at the code:
This could potentially leave stale entries if the actor is immediately woken. However, looking at line 605-607, the Recommendation: Add a comment explaining that the entry is left in place for Performance Considerations ⚡
Security Concerns 🔒No security issues identified. The change:
Test Coverage 🧪Existing tests: The
Recommendation: Consider adding a test that specifically exercises the race condition: #[tokio::test]
async fn test_hibernation_with_immediate_wake() {
// Test scenario:
// 1. Trigger hibernation
// 2. Immediately send a message (while hibernation is starting)
// 3. Verify the hibernating request entry exists
// 4. Verify the actor wakes correctly
}This would validate that the fix actually prevents the race condition. However, this may be difficult to test reliably due to timing sensitivity. Recommendations Summary
Final Verdict: ✅ LGTM with minor suggestionsThis is a well-thought-out fix for a subtle race condition. The code is clean, follows project conventions, and the solution is elegant. The minor suggestions above are for future hardening but shouldn't block merging. Changes are:
Great work catching this race condition! 🎉 |
PR Review: fix(pegboard-gateway): write hws immediately on socket hibernateSummaryThis PR fixes a race condition in the websocket hibernation flow by ensuring the hibernating request entry is written to the database before checking for pending messages, rather than relying solely on the keepalive task. Code Quality ✅Positive aspects:
Minor observations:
Correctness ✅The fix is correct and addresses a real race condition: Problem being solved: Why this matters: The solution:
Performance Considerations ✅Positive:
Note:
Security Considerations ✅No security concerns identified:
Potential Issues & Suggestions1. Error Handling in Early Return Path Currently at lines 542-550, if the upsert operation fails, the function will return an error before checking for pending messages. This could prevent an immediate rewake even when one is needed. Suggestion: Consider whether it's acceptable to fail-fast here, or if you should: // Insert hibernating request entry before checking for pending messages
// This ensures the entry exists even if we immediately rewake the actor
if let Err(e) = self.ctx
.op(pegboard::ops::actor::hibernating_request::upsert::Input {
actor_id: self.actor_id,
gateway_id: self.shared_state.gateway_id(),
request_id,
})
.await
{
tracing::warn!(?e, "failed to upsert hibernating request, continuing anyway");
}Though the current strict error handling may be intentional for correctness guarantees. 2. Cleanup on Early Return When returning Is this intentional? If the actor immediately continues without actually hibernating, should we delete the entry we just created? Currently:
Suggestion: Consider adding cleanup before the early return: if self
.shared_state
.has_pending_websocket_messages(request_id)
.await?
{
tracing::debug!(
?unique_request_id,
"detected pending requests on websocket hibernation, rewaking actor"
);
// Clean up the entry we just created since we're not actually hibernating
self.ctx
.op(pegboard::ops::actor::hibernating_request::delete::Input {
actor_id: self.actor_id,
gateway_id: self.shared_state.gateway_id(),
request_id,
})
.await?;
return Ok(HibernationResult::Continue);
}3. Test Coverage Missing: No tests found for the pegboard-gateway package. This race condition fix would benefit from:
Architecture & Design ✅The fix aligns well with the existing architecture:
Verdict: APPROVE with suggestionsThis is a solid fix for a real race condition. The core logic is correct and addresses the issue effectively. The suggestions above are minor enhancements around error handling and cleanup that could make the code more robust, but they don't block approval. Recommended next steps:
Great work identifying and fixing this race condition! 🎯 |
e1152dc to
968a92b
Compare
Code ReviewSummaryThis PR fixes a race condition in the websocket hibernation flow by ensuring the hibernating request entry is written to the database before checking for pending messages. This prevents a bug where the entry would never be created if the actor immediately rewakes. What Changed
Code Quality ✅Strengths:
Observations:
Logic & Correctness ✅Analysis of the fix: Before this PR:
After this PR:
The fix ensures the database entry exists regardless of which code path is taken. Interval behavior:
Potential Issues 🟡1. Cleanup in early-return case When
Question: Is this intentional? When we immediately rewake due to pending messages, should we:
If Option B is desired, consider adding: if self.shared_state.has_pending_websocket_messages(request_id).await? {
tracing::debug!(?unique_request_id, "detected pending requests on websocket hibernation, rewaking actor");
// Clean up since we're not actually hibernating
self.ctx.op(pegboard::ops::actor::hibernating_request::delete::Input {
actor_id: self.actor_id,
gateway_id: self.shared_state.gateway_id(),
request_id,
}).await?;
return Ok(HibernationResult::Continue);
}However, this might be correct as-is if the entry serves as a "this request is eligible for hibernation" marker that persists across rewake cycles. Performance ✅
Security ✅
Testing Recommendations 📋Consider adding tests for:
Since there are no test files found for Verdict: ✅ Approve with Minor QuestionThe fix is correct and well-implemented. The only question is about the cleanup behavior in the early-return case - this might be intentional, but worth confirming the expected behavior. Great work on catching and fixing this race condition! The comments explaining the fix are particularly helpful for future maintainers. Recommendation: Approve and merge once the cleanup behavior question is addressed (either confirming it's intentional or adding cleanup if needed). |
Merge activity
|

No description provided.