Skip to content

Conversation

@NathanFlurry
Copy link
Member

No description provided.

@vercel
Copy link

vercel bot commented Nov 6, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Ready Ready Preview Comment Nov 13, 2025 2:48am
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 13, 2025 2:48am
rivet-inspector Ignored Ignored Preview Nov 13, 2025 2:48am
rivet-site Ignored Ignored Preview Nov 13, 2025 2:48am

Copy link
Member Author

NathanFlurry commented Nov 6, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@NathanFlurry NathanFlurry force-pushed the 11-06-fix_rivetkit_properly_handle_msgindex_for_hibernatable_websocket_reconnection branch from cf4dd44 to 8f43e55 Compare November 6, 2025 20:40
@claude
Copy link

claude bot commented Nov 6, 2025

Code Review for PR #3401

Summary

This PR fixes hibernatable WebSocket reconnection by properly handling msgIndex tracking in the engine driver. The main changes involve:

  1. Renaming: hibernatableWebSockethibernatableWebSockets (plural, more idiomatic)
  2. Centralized msgIndex tracking: Moved msgIndex updates from ActorInstance.handleRawWebSocket() to EngineActorDriver
  3. Simplified actor instance code: Removed ~100 lines of WebSocket event handling logic

Positive Aspects

1. Improved Separation of Concerns
The msgIndex tracking now lives in the engine driver where it belongs. The driver is responsible for managing hibernation state, so tracking message indices there makes architectural sense.

2. Proper msgIndex Initialization

// src/drivers/engine/actor-driver.ts:191-195
const lastMsgIndex =
    existingWs.msgIndex >= 0n
        ? Number(existingWs.msgIndex)
        : undefined;

Good handling of the sentinel value -1n to represent "no messages processed yet", converting it to undefined for the hibernation config.

3. Message Sequence Gap Detection

// src/drivers/engine/actor-driver.ts:652-668
if (event.rivetMessageIndex > currentEntry.messageIndex + 1) {
    logger().warn({
        msg: "detected gap in websocket message sequence",
        // ...
    });
}

Excellent defensive programming - this will help detect message loss or ordering issues early.

4. Comprehensive Logging
Good use of structured logging throughout, following the project's conventions:

  • Lowercase messages
  • Using ?field syntax for structured data
  • Including relevant context (actorId, requestId, msgIndex)

Issues & Concerns

1. CRITICAL: Potential Race Condition with msgIndex Updates

The msgIndex is now updated directly in the message event handler (actor-driver.ts:633) without any synchronization:

hibernatableWs.msgIndex = BigInt(event.rivetMessageIndex);

However, the actor's persist data is accessed from multiple places:

  • Message handlers (updating msgIndex)
  • Reconnection logic (reading msgIndex)
  • Persistence serialization (reading all state)

Concern: If a persistence operation reads the hibernatableWebSockets array while a message handler is updating msgIndex, there could be race conditions. TypeScript/JavaScript is single-threaded, but async operations could interleave.

Recommendation: Consider if the msgIndex updates should be queued and applied during the next persistence cycle, or if explicit locking/coordination is needed.


2. Missing Event Listener Cleanup in handleRawWebSocket()

The simplified handleRawWebSocket() method at instance.ts:1675-1684 adds event listeners but has incomplete cleanup:

const onSocketClosed = (_event: any) => {
    this.#activeRawWebSockets.delete(websocket);
    this.#resetSleepTimer();
};

try {
    websocket.addEventListener("close", onSocketClosed);
    websocket.addEventListener("error", onSocketClosed);
} catch {}

Issue: The event listeners are never explicitly removed. While they'll be garbage collected when the websocket is closed, the old code explicitly removed them.

Recommendation: Add cleanup in the finally block or ensure the listeners are removed on close:

const onSocketClosed = (_event: any) => {
    try {
        websocket.removeEventListener("close", onSocketClosed);
        websocket.removeEventListener("error", onSocketClosed);
    } catch {}
    this.#activeRawWebSockets.delete(websocket);
    this.#resetSleepTimer();
};

3. Inconsistent Type Handling

In fromBare() at instance.ts:2228, there's an any type annotation:

hibernatableWebSockets: bareData.hibernatableWebSockets.map(
    (ws: any) => ({
        requestId: ws.requestId,
        // ...
    }),
),

Issue: Using any defeats TypeScript's type checking. The BARE schema should have proper types.

Recommendation: Import and use the proper type from the BARE schema definitions.


4. Potential Memory Leak in Ack Queue

The #hibernatableWebSocketAckQueue map (actor-driver.ts:82-86) accumulates entries but only clears them on flush. If a WebSocket disconnects before a flush, the entry remains.

Current behavior:

  • Entries added on message (line 679)
  • Cleared on periodic flush (line 327)
  • Flushed on close (line 699)

Issue: If a websocket closes between flush intervals, and the close event doesn't trigger for some reason, the entry could leak.

Recommendation: Explicitly remove from ack queue in the WebSocket close handler, before flushing.


5. Missing msgIndex Update on WebSocket Close

When a hibernatable WebSocket closes, the code removes it from the tracking array but doesn't verify the msgIndex is up-to-date.

Consider: Should the close handler flush pending acks before removing from hibernatableWebSockets array? The current order (line 699) flushes acks, but by that point, if the actor has been destroyed, the msgIndex might not be saved.


6. Lack of Test Coverage

No test files were modified/added. Given the critical nature of msgIndex tracking for message ordering and the complexity of hibernation:

Recommendation: Add tests covering:

  • WebSocket reconnection with msgIndex restoration
  • Message sequence gap detection
  • Multiple reconnections
  • Edge case: msgIndex = -1n (no messages)
  • Race conditions between message handling and persistence

Minor Issues

7. Method Rename May Break External Code

The rename from handleWebSocket() to handleRawWebSocket() (instance.ts:1657) is a breaking API change.

Impact: Any external code calling actor.handleWebSocket() will break.
Mitigation: If this is internal-only API, it's fine. Otherwise, consider deprecation path.


8. Unclear Comment About "-1" in Logging

// actor-driver.ts:199
lastMsgIndex: lastMsgIndex ?? -1,

This logs -1 when lastMsgIndex is undefined, but -1 is also a valid msgIndex value (as bigint). Could be confusing in logs.

Recommendation: Log "undefined" or "none" for clarity.


Performance Considerations

Positive:

  • Debounced ack flushing (1s interval) reduces network overhead
  • Simplified event handling reduces memory footprint

Concern:

  • Every message triggers a linear search through hibernatableWebSockets array (actor-driver.ts:626). For actors with many hibernatable WebSockets, this could be O(n) per message.
  • Recommendation: Consider using a Map keyed by requestId for O(1) lookups.

Security Considerations

No obvious security issues. The changes are mostly refactoring and don't introduce new attack surfaces. Proper input validation (invariant checks) remains in place.


Summary of Recommendations

High Priority:

  1. ✅ Review and document thread-safety guarantees for msgIndex updates
  2. ✅ Add proper event listener cleanup in handleRawWebSocket
  3. ✅ Add test coverage for hibernatable WebSocket reconnection scenarios

Medium Priority:
4. ✅ Fix memory leak potential in ack queue on abnormal WebSocket close
5. ✅ Replace any type with proper BARE schema type
6. ✅ Optimize hibernatable WebSocket lookup with Map data structure

Low Priority:
7. ✅ Clarify logging for undefined msgIndex values
8. ✅ Verify API compatibility for handleWebSocket → handleRawWebSocket rename


Overall, this is a solid refactoring that improves code organization and fixes the msgIndex tracking issue. The main concerns are around potential race conditions and missing cleanup logic. With the recommended fixes, this should be good to merge.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Nov 6, 2025

More templates

@rivetkit/actor

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/actor@3401

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3401

@rivetkit/core

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/core@3401

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3401

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3401

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3401

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3401

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3401

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3401

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3401

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3401

commit: 9f1d38e

@claude
Copy link

claude bot commented Nov 6, 2025

Code Review: Fix hibernatable WebSocket msgIndex handling

Summary

This PR fixes a critical issue with message index tracking for hibernatable WebSocket reconnections. The changes move msgIndex management from the actor instance layer to the engine driver layer, ensuring proper message sequencing during WebSocket hibernation/restoration.


✅ Positive Aspects

  1. Correct Architecture: Moving msgIndex tracking to the engine driver (actor-driver.ts:619-695) is the right approach since the driver has direct access to message events and can update state atomically.

  2. Proper Naming: Renaming hibernatableWebSockethibernatableWebSockets (plural) is more semantically correct for an array.

  3. Method Rename: Renaming handleWebSockethandleRawWebSocket (instance.ts:1657) improves clarity about the method's purpose.

  4. Good Logging: Added comprehensive debug/warn logging for tracking message indices, gaps, and state transitions.

  5. Edge Case Handling: The code properly handles the -1n sentinel value for new WebSockets (lines actor-driver.ts:191-195).


🐛 Potential Issues

1. Race Condition in Message Handler (Medium Priority)

Location: actor-driver.ts:622-628

The message event handler looks up the actor and hibernatable WebSocket on every message:

const actorHandler = this.#actors.get(actorId);
if (actorHandler?.actor) {
    const hibernatableWs = actorHandler.actor[PERSIST_SYMBOL]
        .hibernatableWebSockets.find((ws: any) => ...);

Issue: If the actor is unloaded/hibernated between messages, this lookup could fail even for valid hibernatable WebSockets. The warning at line 687-694 suggests this scenario is considered, but it might be expected behavior rather than an error condition.

Suggestion: Consider documenting when this warning is expected vs. when it indicates a real problem, or adjust the log level.

2. Missing WebSocket Cleanup Logic (High Priority)

Location: instance.ts:1675

The old code (removed in this PR) had logic to remove hibernatable WebSockets from the persist array when they closed:

// Old code (removed):
const wsIndex = this.#persist.hibernatableWebSocket.findIndex(...);
const removed = this.#persist.hibernatableWebSocket.splice(wsIndex, 1);

Issue: The new code at instance.ts:1675-1678 only removes from #activeRawWebSockets but doesn't clean up the persisted hibernatable WebSocket entry. This could lead to:

  • Memory leaks with stale WebSocket entries accumulating
  • Incorrect hibernation state on reconnection
  • Confusion about which WebSockets are actually active

Suggestion: Add cleanup logic in the close handler or implement a TTL-based cleanup mechanism. The lastSeenTimestamp field suggests TTL cleanup was intended.

3. Type Safety Issue (Low Priority)

Location: actor-driver.ts:626

.hibernatableWebSockets.find((ws: any) => ...);

Using any type defeats TypeScript's type checking. Should use proper type from PersistedHibernatableWebSocket.


⚡ Performance Considerations

  1. Array Linear Search: Lines actor-driver.ts:625-628 perform a linear search through hibernatableWebSockets array on every message. For actors with many hibernatable WebSockets, this could become a bottleneck. Consider using a Map<string, PersistedHibernatableWebSocket> keyed by requestId.

  2. Ack Queue Batching: The 1-second flush interval (actor-driver.ts:314-317) is reasonable but consider making it configurable or adaptive based on message volume.

  3. BigInt Conversions: Frequent BigInt() conversions (e.g., actor-driver.ts:633-634) could be optimized by storing timestamps as numbers if millisecond precision suffices.


🔒 Security Concerns

No significant security issues identified. The code properly:

  • Uses invariants to validate message metadata (actor-driver.ts:616-617)
  • Maintains proper buffer comparisons for request IDs
  • Doesn't expose sensitive data in logs

Minor suggestion: Consider sanitizing or truncating requestId in logs to avoid potential information leakage.


🧪 Test Coverage

CRITICAL ISSUE: No test coverage found for hibernatable WebSocket functionality.

Required Tests:

  1. Test msgIndex persistence across hibernation/reconnection cycles
  2. Test message sequence gap detection (the new logic at lines 651-668)
  3. Test WebSocket cleanup on close
  4. Test concurrent message handling with multiple hibernatable WebSockets
  5. Test edge case with msgIndex = -1 (no messages yet)
  6. Test the ack queue flush mechanism

Recommendation: Add tests in rivetkit-typescript/packages/rivetkit/tests/ following the pattern of existing driver tests. Consider using the rivetkit-driver-test-writer agent mentioned in the codebase.


📝 Additional Observations

  1. Schema Migration: The BARE schema change (v2.bare) correctly updates the field name. Verify the migration at versioned.ts:28 handles existing persisted data correctly.

  2. Unused Event Listeners: The old code had onSocketOpened and onSocketMessage handlers that tracked state. Verify that moving this logic to the driver doesn't break any dependent functionality.

  3. Documentation: The persisted.ts comment at line 22 still references the old field name in the comment: "See PersistedActor.hibernatableWebSocket" - should be updated to hibernatableWebSockets.


📋 Recommendations

Before Merge:

  1. ✅ Fix the WebSocket cleanup issue (high priority)
  2. ✅ Add test coverage for hibernatable WebSocket scenarios
  3. ✅ Update the outdated comment in persisted.ts:22

Nice to Have:
4. Replace any type with proper type annotation
5. Consider optimizing the linear search to a Map-based lookup
6. Document the expected vs. error scenarios for the warning at line 687


Verdict

The core fix is solid and addresses the msgIndex tracking issue correctly. However, the missing cleanup logic is a significant concern that should be addressed before merge. The lack of test coverage for this critical functionality is also concerning given the complexity of hibernatable WebSocket state management.

Recommendation: Request changes to add cleanup logic and basic test coverage before merging.

@NathanFlurry NathanFlurry force-pushed the 11-06-fix_rivetkit_properly_handle_msgindex_for_hibernatable_websocket_reconnection branch from 8f43e55 to c7a52ec Compare November 6, 2025 20:58
@claude
Copy link

claude bot commented Nov 6, 2025

PR Review: Fix hibernatable WebSocket msgIndex handling

Summary

This PR fixes a critical bug in the handling of msgIndex for hibernatable WebSocket reconnection. The changes properly track message indices in the engine driver and fixes a naming inconsistency (hibernatableWebSockethibernatableWebSockets).


Positive Observations ✅

1. Correct Architecture Changes

  • Moving msgIndex tracking from ActorInstance to EngineActorDriver is the right architectural decision. The driver is the proper place to handle Rivet-specific metadata like rivetRequestId and rivetMessageIndex.
  • The engine driver now has direct access to these fields and can update the persisted state immediately.

2. Naming Consistency

  • Fixed inconsistent naming: hibernatableWebSockethibernatableWebSockets (plural)
  • Applied consistently across all files including schema, persisted data, and TypeScript code

3. Good Diagnostics

The new logging and sequence checking is excellent:

if (event.rivetMessageIndex !== previousIndex + 1) {
    logger().warn({
        msg: "websocket message index out of sequence",
        sequenceType: event.rivetMessageIndex < previousIndex ? "regressed" 
                     : event.rivetMessageIndex === previousIndex ? "duplicate" 
                     : "gap/skipped",
        gap: event.rivetMessageIndex > previousIndex 
             ? event.rivetMessageIndex - previousIndex - 1 
             : 0,
    });
}

4. Proper Initial State Handling

The conversion of -1n to undefined for lastMsgIndex is correct:

const lastMsgIndex = existingWs.msgIndex >= 0n 
    ? Number(existingWs.msgIndex) 
    : undefined;

Issues & Concerns 🔴

1. Critical: Race Condition Risk

Location: actor-driver.ts:630-643

The msgIndex update happens in the message event handler, but there's no guarantee it will be persisted before actor hibernation:

hibernatableWs.msgIndex = BigInt(event.rivetMessageIndex);
hibernatableWs.lastSeenTimestamp = BigInt(Date.now());

Problem: These updates directly modify the persisted state, but ActorInstance uses throttled persistence (#savePersistThrottled()). If the actor hibernates before the throttled save occurs, the msgIndex won't be persisted.

Scenario:

  1. Message arrives with index 5
  2. Engine driver updates hibernatableWs.msgIndex = 5n
  3. Actor hibernates before throttled save completes
  4. On reconnection, actor restores with old msgIndex (e.g., 3)
  5. Messages 4-5 get replayed, causing duplicates

Recommendation:

  • Either call actor.saveState({ immediate: true }) after updating msgIndex
  • Or implement a msgIndex-specific persistence mechanism that bypasses throttling
  • Or document that this is acceptable and the application layer should handle duplicate messages

2. Medium: WebSocket Close Cleanup Missing

Location: actor-driver.ts:699-703 and instance.ts:1675-1684

The old code in ActorInstance.handleRawWebSocket (now handleRawWebSocket) removed hibernatable WebSockets from the array on close:

// OLD CODE (removed):
const wsIndex = this.#persist.hibernatableWebSocket.findIndex((ws) =>
    arrayBuffersEqual(ws.requestId, rivetRequestIdLocal)
);
const removed = this.#persist.hibernatableWebSocket.splice(wsIndex, 1);

The new code doesn't remove closed hibernatable WebSockets anywhere.

Problem: The hibernatableWebSockets array will grow indefinitely with stale entries. Even though they update lastSeenTimestamp, there's no cleanup mechanism.

Recommendation:

  • Add cleanup logic in the engine driver's close handler
  • Or implement a periodic cleanup of stale hibernatable WebSockets based on lastSeenTimestamp
  • Or document if this is intentional for reconnection scenarios

3. Medium: Missing Test Coverage

The existing WebSocket tests (raw-websocket.ts) don't cover:

  • Hibernation scenarios
  • Message index tracking and recovery
  • Reconnection with proper msgIndex restoration
  • Out-of-sequence message detection

Recommendation: Add tests for hibernatable WebSocket behavior, especially:

test("should properly restore msgIndex on reconnection", async (c) => {
    // 1. Connect WebSocket
    // 2. Send messages 0-4
    // 3. Force hibernation
    // 4. Reconnect 
    // 5. Verify messages 5+ arrive without duplicates of 0-4
});

4. Low: Inconsistent Error Handling

Location: actor-driver.ts:686-696

The warning when actor is not found should probably prevent the ack from being queued, but it still allows execution to continue:

} else {
    logger().warn({
        msg: "received websocket message but actor not found for hibernatable tracking",
        // ...
    });
}
// Code continues, but no ack is queued - is this intentional?

Recommendation: Either:

  • Return early after the warning
  • Or queue the ack anyway and document why
  • Or throw an error if this represents an invalid state

Minor Suggestions 💡

1. Type Safety

Line 626: (ws: any) could be properly typed:

const hibernatableWs = actorHandler.actor[PERSIST_SYMBOL]
    .hibernatableWebSockets.find((ws: PersistedHibernatableWebSocket) =>
        arrayBuffersEqual(ws.requestId, requestIdBuf)
    );

2. Magic Number

Line 291: -1n as initial msgIndex should be documented or use a constant:

const INITIAL_MSG_INDEX = -1n; // Indicates no messages processed yet

3. Logging Consistency

Following CLAUDE.md conventions, log messages should be lowercase unless mentioning code symbols. Most are correct, but verify all instances.


Performance Considerations

1. Ack Queue Flushing

The 1-second flush interval is reasonable given the 30-second gateway timeout. Consider making this configurable if needed for different deployment scenarios.

2. Linear Search

The .find() operations on hibernatableWebSockets arrays are O(n). For actors with many hibernatable connections, consider using a Map indexed by requestId.


Security Considerations

No security issues identified. The changes maintain proper isolation and don't introduce new attack vectors.


Verdict

Approve with reservations. The core architecture change is sound, but the race condition risk (#1) and missing cleanup logic (#2) need to be addressed or explicitly documented as acceptable trade-offs.

Required Changes:

  1. ⚠️ Address the persistence race condition - either force immediate saves or document that applications must handle duplicate messages
  2. ⚠️ Add cleanup logic for closed hibernatable WebSockets - prevent memory leaks from stale entries

Recommended Changes:

  1. Add test coverage for hibernation scenarios
  2. Fix the inconsistent error handling in the message handler
  3. Improve type safety and add constants for magic numbers

Great work on identifying and fixing this issue! The diagnostic logging additions will be invaluable for debugging production issues. Just need to close the loop on persistence timing and cleanup.

@NathanFlurry NathanFlurry force-pushed the 11-06-fix_rivetkit_properly_handle_msgindex_for_hibernatable_websocket_reconnection branch from c7a52ec to 7255d47 Compare November 6, 2025 21:06
@claude
Copy link

claude bot commented Nov 6, 2025

PR Review: Hibernatable WebSocket msgIndex Handling

Summary

This PR fixes critical issues with message index (msgIndex) handling for hibernatable WebSocket reconnections. The changes move message tracking logic from the actor instance level to the engine driver level, ensuring proper state persistence and restoration.

Code Quality ✅

Strengths:

  • Clear separation of concerns: hibernatable WebSocket tracking is now properly handled at the driver level
  • Consistent naming: hibernatableWebSockethibernatableWebSockets (plural) across the codebase
  • Method rename: handleWebSockethandleRawWebSocket improves clarity
  • Comprehensive logging with structured fields (following project conventions)
  • Proper cleanup in both close and error event handlers

Naming Convention:

  • Good adherence to project conventions with lowercase log messages (e.g., "updated hibernatable websocket msgIndex")
  • Structured logging with proper field names

Architecture Improvements 🎯

Better Design:
The refactoring significantly improves the architecture:

  1. Removed ~100 lines of complex event handling from ActorInstance.handleWebSocket (rivetkit-typescript/packages/rivetkit/src/actor/instance.ts:1654-1773)
  2. Centralized hibernatable WebSocket tracking in the engine driver where it logically belongs
  3. Simplified actor instance code by removing hibernation-specific concerns

Potential Issues ⚠️

1. Race Condition in Message Handler (actor-driver.ts:622-696)

The code accesses actorHandler.actor[PERSIST_SYMBOL].hibernatableWebSockets without synchronization. If the actor is being destroyed or modified concurrently, this could cause issues.

Recommendation: Consider adding a check or lock mechanism before accessing the actor's persist data.

2. Duplicate Cleanup Logic (actor-driver.ts:699-776)

The cleanup code for hibernatable WebSockets is duplicated in both close and error handlers. This is ~70 lines of nearly identical code.

Recommendation: Extract into a helper method like:

#cleanupHibernatableWebSocket(actorId: string, requestIdBuf: ArrayBuffer, requestId: string, event?: any)

This would reduce duplication and make the code more maintainable.

3. Missing Validation (actor-driver.ts:191-195)

The code converts msgIndex from BigInt to number without range checking. JavaScript numbers are 53-bit integers, while BigInt can be arbitrarily large.

Recommendation: Add bounds checking to ensure msgIndex doesn't exceed Number.MAX_SAFE_INTEGER

4. Type Safety (actor-driver.ts:626, 708, 745)

Multiple uses of any type: find((ws: any) => weakens type safety.

Recommendation: Define proper types or use existing types from the persist schema.

Performance Considerations 💡

Positive:

  • Ack queue debouncing (#flushHibernatableWebSocketAcks) with 1-second interval is good for reducing network overhead
  • Using Map for ack queue provides O(1) lookups

Potential Issue:
The message sequence validation (actor-driver.ts:650-673) could generate excessive warnings under normal network conditions (packet reordering, retransmissions).

Recommendation: Consider making the sequence validation less strict or rate-limiting the warnings.

Security Concerns 🔒

No major security issues found, but consider:

  1. Memory leak potential: If WebSocket cleanup fails, entries could accumulate in hibernatableWebSockets array

    • Current code has warnings for missing actors, which is good
    • Consider adding periodic cleanup or size limits
  2. msgIndex manipulation: The code trusts event.rivetRequestId and event.rivetMessageIndex from events

    • Ensure these are validated/sanitized at a lower layer

Test Coverage ❌

Critical Gap: No tests found for hibernatable WebSocket functionality or msgIndex handling.

Recommendations:

  1. Add integration tests for hibernatable WebSocket reconnection scenarios
  2. Test msgIndex persistence across actor sleep/wake cycles
  3. Test edge cases:
    • Reconnection with out-of-sequence messages
    • Multiple concurrent hibernatable WebSockets
    • Actor destruction during active WebSocket
    • msgIndex at boundary values (-1, 0, MAX_SAFE_INTEGER)

Schema Changes ✅

The BARE schema change (v2.bare) is correct:

  • hibernatableWebSockethibernatableWebSockets (plural)
  • Migration logic properly handles this in versioned.ts

Summary

Overall Assessment: This is a solid refactoring that improves architecture and fixes real bugs. The main concerns are:

  • Code duplication in cleanup handlers (should be refactored)
  • Missing test coverage (critical for reliability)
  • Minor type safety and validation issues

Recommendation: Approve with suggestions to address the cleanup duplication and add test coverage in follow-up work.


Generated with Claude Code for PR #3401

@NathanFlurry NathanFlurry force-pushed the 11-06-fix_rivetkit_properly_handle_msgindex_for_hibernatable_websocket_reconnection branch from 7255d47 to 22a431f Compare November 6, 2025 21:26
@claude
Copy link

claude bot commented Nov 6, 2025

Code Review

Overview

This PR fixes the handling of msgIndex for hibernatable WebSocket reconnections. The changes centralize WebSocket state management in the engine driver and fix a critical bug where message indices weren't being properly tracked for restoration.


Key Changes

✅ Positive Changes

  1. Field Name Consistency - Renamed hibernatableWebSocket to hibernatableWebSockets (plural) throughout the codebase for better naming convention

    • Updated in schema, TypeScript interfaces, and all references
  2. Centralized State Management - Moved hibernatable WebSocket state tracking from ActorInstance to EngineActorDriver

    • Removes ~100 lines of duplicated/problematic logic from instance.ts:1654-1790
    • Consolidates tracking in the driver where it belongs (lines 619-696 in actor-driver.ts)
  3. Proper msgIndex Handling - Fixed critical bug where msgIndex wasn't being updated

    • Now correctly updates hibernatableWs.msgIndex on every message (line 633)
    • Handles -1n as sentinel value for "no messages processed yet" (lines 189-197)
    • Converts BigInt to number properly when passing to hibernation config
  4. Enhanced Observability - Added comprehensive debug/warn logging

    • Logs msgIndex updates with old/new values (lines 636-642)
    • Warns on out-of-sequence messages with detailed diagnostics (lines 650-673)
    • Tracks gaps, regressions, and duplicates in message sequences
  5. Code Deduplication - New #cleanupHibernatableWebSocket helper (lines 728-787)

    • Eliminates duplication between close and error handlers
    • Consistent cleanup logic with proper error context

Issues & Concerns

🔴 Critical Issues

  1. Type Safety Violation (line 626)
].hibernatableWebSockets.find((ws: any) =>

Using any defeats TypeScript's type checking. This should use the proper type PersistedHibernatableWebSocket.

Recommendation:

].hibernatableWebSockets.find((ws: PersistedHibernatableWebSocket) =>
  1. Potential Race Condition - msgIndex updates without synchronization (line 633)
    The hibernatableWs.msgIndex is updated directly in the message handler. If messages arrive rapidly or out of order, this could lead to race conditions.

Current:

hibernatableWs.msgIndex = BigInt(event.rivetMessageIndex);

Concern: No locking mechanism ensures atomic updates. Consider if concurrent message handling could cause issues.

⚠️ Medium Priority Issues

  1. Missing Null Checks - Lines 687-696 warn but don't handle missing actor gracefully
} else {
    // Warn if we receive a message for a hibernatable websocket but can't find the actor
    logger().warn({...});
}

This logs a warning but continues execution. Should this early-return or throw an error instead?

  1. Inconsistent Error Handling - #cleanupHibernatableWebSocket method
  • Called for both close and error events
  • Error events include error context, but close events don't include close code/reason
  • Consider logging close code/reason for debugging
  1. Method Rename Without Clear Migration - handleWebSockethandleRawWebSocket
    While the rename is good for clarity, there's no deprecation notice or runtime warning for any external callers that might exist.

🟡 Minor Issues / Suggestions

  1. Logging Consistency
  • Mix of logger().debug() and logger().warn() calls
  • Some messages are lowercase (good per CLAUDE.md), but inconsistently applied
  • Example line 636: "updated hibernatable websocket msgIndex in engine driver" is verbose

Suggestion: Shorten to "updated hibernatable ws msgIndex" for consistency

  1. Magic Number - Line 287: msgIndex: -1n
msgIndex: -1n,

Consider extracting to a named constant:

const HIBERNATABLE_WS_NO_MESSAGES = -1n;
  1. BigInt String Conversion - Line 639: oldMsgIndex.toString()
    While correct, consider whether logging numeric values might be better for analysis tools. Alternatively, document why string conversion is necessary.

Performance Considerations

  1. Linear Search on Every Message (line 626)
hibernatableWebSockets.find((ws: any) =>
    arrayBuffersEqual(ws.requestId, requestIdBuf)
)

For actors with many hibernatable WebSockets, this O(n) search on every message could become a bottleneck.

Recommendation: Use a Map keyed by requestId (as string) for O(1) lookup:

#hibernatableWebSocketsByRequestId: Map<string, PersistedHibernatableWebSocket>
  1. ArrayBuffer Comparison Overhead - arrayBuffersEqual called repeatedly
    Consider caching string representations of requestIds if they're compared frequently.

Security Concerns

  1. No Validation on rivetMessageIndex - Lines 616-617
invariant(event.rivetRequestId, "missing rivetRequestId");
invariant(event.rivetMessageIndex, "missing rivetMessageIndex");

The code validates presence but not validity:

  • Could rivetMessageIndex be negative?
  • Could it be excessively large (DoS via memory)?
  • Should there be a max gap threshold that triggers connection termination?

Recommendation: Add range validation:

invariant(
    typeof event.rivetMessageIndex === 'number' && 
    event.rivetMessageIndex >= 0 && 
    event.rivetMessageIndex < Number.MAX_SAFE_INTEGER,
    'invalid rivetMessageIndex'
);

Test Coverage

  1. No New Tests - This PR modifies critical hibernation logic but adds no tests

Missing Test Cases:

  • ✗ Hibernatable WebSocket with msgIndex restoration
  • ✗ Message sequence gap detection
  • ✗ Out-of-order message handling
  • ✗ Reconnection with proper lastMsgIndex
  • ✗ Cleanup on close vs error events
  • ✗ Multiple hibernatable WebSockets for same actor

Recommendation: Add tests in tests/driver-engine.test.ts covering:

  1. Basic hibernation flow with message index tracking
  2. Edge cases: -1n initial state, gaps, duplicates
  3. Cleanup scenarios: normal close, error, actor not found

Code Quality

Positive

  • ✅ Follows CLAUDE.md structured logging conventions
  • ✅ Consistent naming (kebab-case in logs)
  • ✅ Good error context in warnings
  • ✅ Proper use of BigInt for large integers

Needs Improvement

  • ❌ Type safety (uses any)
  • ❌ Missing JSDoc on new helper method
  • ❌ No comments explaining the -1n sentinel value

Recommendations Summary

Must Fix (Before Merge):

  1. Remove any type, use proper PersistedHibernatableWebSocket type
  2. Add input validation for rivetMessageIndex
  3. Add test coverage for hibernatable WebSocket msgIndex tracking

Should Fix (High Priority):
4. Consider race condition implications for concurrent msgIndex updates
5. Extract magic number -1n to named constant
6. Add JSDoc to #cleanupHibernatableWebSocket method

Nice to Have (Lower Priority):
7. Optimize linear search to Map-based lookup if needed
8. Add deprecation notice for handleWebSocket rename
9. Include close code/reason in cleanup logging


Conclusion

This is a solid fix that addresses the core issue and improves code organization. The centralization of WebSocket state management is the right architectural decision. However, the lack of test coverage for such critical functionality is concerning.

Overall Assessment:Approve with recommendations - The core fix is correct, but address the type safety and validation issues before merging, and strongly consider adding test coverage.


Review generated by Claude Code following CLAUDE.md conventions

@claude
Copy link

claude bot commented Nov 6, 2025

Code Review: Hibernatable WebSocket msgIndex Handling

Summary

This PR fixes the handling of msgIndex for hibernatable WebSocket reconnections. The changes include renaming hibernatableWebSocket to hibernatableWebSockets (plural) and moving msgIndex tracking logic from ActorInstance to EngineActorDriver.


Positive Aspects

  1. Centralized Logic: Moving msgIndex tracking to the driver (actor-driver.ts:631-683) centralizes hibernatable WebSocket state management in one place, reducing duplication and complexity in ActorInstance.

  2. Improved Error Handling: The new code includes robust sequence validation (actor-driver.ts:648-666) that warns about:

    • Regressed indices
    • Duplicate messages
    • Gaps in message sequences
  3. Better Code Organization: The #cleanupHibernatableWebSocket helper method (actor-driver.ts:730-778) eliminates duplication between close and error handlers.

  4. Consistent Naming: The rename from hibernatableWebSocket (singular) to hibernatableWebSockets (plural) is more accurate for an array.

  5. Proper Type Handling: Converting msgIndex from bigint to number when needed (actor-driver.ts:191-196) with explicit handling of the -1 sentinel value.


Issues & Concerns

1. Critical: Potential msgIndex Race Condition

Location: actor-driver.ts:191-201

const lastMsgIndex = existingWs.msgIndex >= 0n 
    ? Number(existingWs.msgIndex) 
    : undefined;
hibernationConfig = {
    enabled: true,
    lastMsgIndex,
};

Issue: The msgIndex is read during WebSocket open, but could be updated by concurrent message handlers before the WebSocket is fully established. This could lead to messages being replayed or skipped.

Recommendation: Consider adding a lock or using atomic operations to ensure msgIndex is consistent during WebSocket restoration.


2. Bug: Duplicate lastSeenTimestamp Updates

Location: actor-driver.ts:276-280 and actor-driver.ts:634

The code updates lastSeenTimestamp in two places:

  • During WebSocket open (line 277)
  • During message handling (line 634)

Issue: For an existing WebSocket that reconnects, both code paths execute, causing redundant updates.

Recommendation: The open handler should only update timestamp for new WebSockets, not existing ones that already have their timestamp updated by message handlers.


3. Potential Memory Leak: Incomplete Cleanup

Location: actor-driver.ts:743-777

Issue: If an actor is destroyed or crashes while WebSockets are active, the cleanup logic only runs on socket close/error events. If the actor terminates abnormally, hibernatable WebSocket entries may remain in memory and the ack queue.

Recommendation: Add cleanup in actor shutdown/destroy path to ensure all hibernatable WebSocket state is cleared.


4. Missing Test Coverage

Issue: No test files were modified, and grep shows no existing tests for hibernatable WebSocket functionality.

Recommendation: Add tests covering:

  • WebSocket reconnection with msgIndex restoration
  • Message sequence validation (gaps, duplicates, regression)
  • Cleanup on close/error
  • Concurrent message handling during reconnection
  • Edge case: msgIndex = -1 (no messages processed yet)

5. Inconsistent Error Handling

Location: actor-driver.ts:688-696 vs actor-driver.ts:764-773

When the actor is not found:

  • Message handler logs a warning (line 688)
  • Cleanup handler logs a warning (line 764)

Issue: These warnings provide debugging info but don't handle the actual problem. The ack queue could accumulate entries for non-existent actors.

Recommendation: Consider removing stale ack queue entries when actor not found, not just logging.


6. Type Safety: Unsafe toString() on msgIndex

Location: actor-driver.ts:755

removedMsgIndex: removed[0]?.msgIndex?.toString() ?? "unknown",

Issue: Using optional chaining with toString() on bigint is correct, but the fallback "unknown" is a string in a logging context that might be used for metrics/debugging. Consider using -1 or null for consistency.

Minor: This is a minor issue but affects debugging clarity.


7. Schema Version Migration

Location: versioned.ts:31

The migration from v1 to v2 sets hibernatableWebSockets: [], but:

Question: What happens to existing v1 data that might have had hibernatable connections tracked elsewhere? Is there a migration path or is this a breaking change?

Recommendation: Add a comment explaining the migration strategy or confirm this is intentional.


Performance Considerations

  1. Debouncing Strategy: The 1-second flush interval (actor-driver.ts:316) is reasonable, but consider:

    • What's the expected message rate? High-throughput WebSockets might benefit from a shorter interval
    • Consider adaptive flushing based on queue size
  2. Linear Search: Finding hibernatable WebSockets uses findIndex (actor-driver.ts:184). For actors with many hibernatable WebSockets, this could be O(n).

    • Recommendation: Consider using a Map keyed by requestId for O(1) lookups

Security Concerns

No immediate security issues identified. The code properly validates requestId presence and uses ArrayBuffer equality checks.


Style & Convention Issues

  1. Logging Format (per CLAUDE.md):

    • Line 637: "updated hibernatable websocket msgIndex in engine driver" should be lowercase: "updated hibernatable websocket msgindex in engine driver"
    • Multiple log messages properly use structured logging with ?x format ✓
  2. Method Naming:

    • handleWebSockethandleRawWebSocket is more descriptive ✓
    • #flushWsAcks#flushHibernatableWebSocketAcks is more explicit ✓

Recommendations Summary

Must Fix:

  1. Add test coverage for hibernatable WebSocket reconnection and msgIndex handling
  2. Address potential race condition in msgIndex reading during WebSocket open
  3. Fix duplicate lastSeenTimestamp updates
  4. Add cleanup for abnormal actor termination

Should Fix:

  1. Improve error handling when actor not found (clean up ack queue)
  2. Consider performance optimization for WebSocket lookup (use Map)

Nice to Have:

  1. Document schema migration strategy
  2. Consider adaptive ack flush interval for high-throughput scenarios
  3. Improve type consistency in logging (msgIndex fallback value)

Verdict

The PR makes good architectural improvements by centralizing msgIndex tracking, but needs test coverage and has some edge case handling issues that should be addressed before merging.

Rating: Needs work ⚠️

Generated by Claude Code

@claude
Copy link

claude bot commented Nov 8, 2025

Code Review

Summary

This PR fixes the handling of msgIndex for hibernatable WebSocket reconnection. The changes focus on:

  1. Renaming hibernatableWebSocket to hibernatableWebSockets (pluralization)
  2. Moving hibernatable WebSocket tracking logic from ActorInstance to EngineActorDriver
  3. Improving message index tracking and acknowledgment handling

Positive Observations

1. Proper Separation of Concerns

  • Moving hibernatable WebSocket lifecycle management from ActorInstance.handleRawWebSocket (instance.ts:1657-1699) to EngineActorDriver is architecturally sound
  • The driver is the appropriate layer for managing engine-specific WebSocket state

2. Better Error Handling

  • Added comprehensive logging with proper context (actorId, requestId, message indices)
  • The new #cleanupHibernatableWebSocket helper (actor-driver.ts:731-778) eliminates code duplication between close and error handlers
  • Out-of-sequence message detection with detailed warnings (actor-driver.ts:648-672)

3. Correct msgIndex Initialization

  • Properly handling -1n as "no messages processed yet" when converting to lastMsgIndex (actor-driver.ts:192-194)
  • Converting to undefined rather than passing -1 to the engine is the right approach

4. Consistent Naming

  • Pluralization of hibernatableWebSocket → hibernatableWebSockets across all files improves consistency
  • Renaming handleWebSocket → handleRawWebSocket clarifies intent

Issues & Recommendations

🔴 Critical: Message Index Update Timing Issue

In actor-driver.ts:631-633, the msgIndex is updated on the message event:

hibernatableWs.msgIndex = BigInt(event.rivetMessageIndex);

However, this happens before the actor processes the message. If the actor crashes or hibernates immediately after receiving the message but before processing it, the saved msgIndex will point to an unprocessed message. On reconnection, that message would be skipped.

Recommendation: Only update msgIndex after successful message processing, or use an acknowledgment mechanism to ensure messages are processed before updating the index.

🟡 Moderate: Race Condition in Actor Lookup

In actor-driver.ts:622-624:

const actorHandler = this.#actors.get(actorId);
if (actorHandler?.actor) {

The actor could be removed from #actors between checking and using it. While the current code handles this with warnings (lines 686-695), consider:

  • Using a lock/mutex for actor lifecycle operations
  • Or documenting the expected behavior when an actor is being destroyed

🟡 Moderate: Potential Memory Leak in Ack Queue

The #hibernatableWebSocketAckQueue is only cleared in:

  1. #flushHibernatableWebSocketAcks() - every 1 second
  2. #cleanupHibernatableWebSocket() - on close/error
  3. shutdown() - on driver shutdown

Issue: If a WebSocket is tracked as hibernatable but never receives messages, it won't be added to the ack queue. However, if it does receive messages but then the cleanup fails (e.g., actor already destroyed), the entry could persist in the ack queue.

Recommendation: Add defensive cleanup to ensure ack queue entries are removed when hibernatable WebSocket entries are removed from hibernatableWebSockets array.

🟡 Moderate: Inconsistent Error Handling

In actor-driver.ts:686-695 and actor-driver.ts:764-773, the code warns when an actor isn't found but continues execution. Consider:

  • Should these be errors that prevent further processing?
  • What's the expected behavior when tracking state for a non-existent actor?

🟢 Minor: Code Clarity

Line 192-194 in actor-driver.ts:

const lastMsgIndex =
    existingWs.msgIndex >= 0n
        ? Number(existingWs.msgIndex)
        : undefined;

The comment says "treating -1 as undefined" but the condition checks >= 0n. This means -1n → undefined (correct), but it's clearer to explicitly check for -1n:

const lastMsgIndex = existingWs.msgIndex === -1n 
    ? undefined 
    : Number(existingWs.msgIndex);

🟢 Minor: Log Message Clarity

Line 749 in actor-driver.ts could be more descriptive. Consider separate messages for close vs error events for better log filtering.

🟢 Minor: Removed Code Verification

The removed code in instance.ts (lines 1675-1766) tracked socket lifecycle events. Verify that all functionality was preserved:

  • onSocketOpened handler updated lastSeenTimestamp - confirmed this still happens in the driver at line 633 ✅
  • All cleanup logic appears to be preserved ✅

Testing Concerns

❗ No test coverage visible - This is a complex change to hibernation logic. Consider adding tests for:

  1. Message index tracking across hibernation/wake cycles
  2. Out-of-sequence message handling
  3. Cleanup on various error conditions
  4. Race conditions with actor lifecycle

Performance Considerations

Ack Queue Batching - The 1-second flush interval (line 316) is well-reasoned given the 30s gateway timeout
No Blocking Operations - All operations are async where appropriate

Security Considerations

✅ No security issues identified - proper validation of rivetRequestId and rivetMessageIndex with invariant

Summary

This is a solid refactoring that improves code organization and observability. The main concern is the message index update timing which could cause message loss during failures.

Recommendations:

  1. Before merging: Address the critical msgIndex timing issue - ensure messages are processed before updating the index
  2. Nice to have: Add test coverage for hibernation scenarios
  3. Nice to have: Review and document the race condition behavior with actor lifecycle

Overall Assessment: Approve with changes requested - the msgIndex timing issue should be addressed before merging.

@NathanFlurry NathanFlurry force-pushed the 11-06-fix_rivetkit_properly_handle_msgindex_for_hibernatable_websocket_reconnection branch from ce9544b to 9f1d38e Compare November 13, 2025 02:47
@NathanFlurry NathanFlurry force-pushed the 11-08-chore_rivetkit_rename_onstart_onstop_-_onwake_onsleep branch from 1386f17 to 8a57285 Compare November 13, 2025 02:47
@claude
Copy link

claude bot commented Nov 13, 2025

PR Review: Fix hibernatable WebSocket msgIndex handling

Summary

This PR fixes the message index (msgIndex) handling for hibernatable WebSocket reconnections. The changes properly track and restore message indices when WebSockets hibernate and reconnect, which is critical for ensuring message delivery guarantees in the actor system.

Positive Aspects

1. Improved Architecture

  • The new createRawWebSocketSocket driver properly separates concerns for raw WebSocket connections
  • Good use of dependency injection with the closePromise parameter for graceful shutdown coordination
  • Clear documentation explaining why raw WebSockets don't need sendMessage

2. Better Message Index Tracking

  • Lines 633-687 in actor-driver.ts: The message handler now properly updates msgIndex on each message
  • Lines 189-202: Correctly converts msgIndex from bigint to number, treating -1 as "no messages yet"
  • The fix ensures that when a WebSocket hibernates and reconnects, it can resume from the correct message index

3. Enhanced Debugging

  • Excellent out-of-sequence detection (lines 654-676) with detailed logging for:
    • Regressed indices
    • Duplicate indices
    • Gaps/skipped indices
  • Comprehensive logging throughout helps diagnose hibernation issues

4. Code Quality Improvements

  • Lines 736-780: The #cleanupHibernatableWebSocket helper eliminates duplication between close and error handlers
  • Consistent error handling with proper cleanup in all paths
  • Fixed schema naming from hibernatableWebSocket to hibernatableWebSockets (plural)

Issues & Concerns

1. Critical: Race Condition in Connection Creation 🔴

Location: router-endpoints.ts:435-464

const conn = await actor.createConn(
    createRawWebSocketSocket(...),
    {},
    newRequest,
);
createdConn = conn;
actor.handleWebSocket(adapter, { request: newRequest });

Problem: The WebSocket onMessage and onClose handlers can fire before createdConn is set, especially under high load or if createConn is slow. This could lead to:

  • Messages arriving before the connection is tracked
  • Close events firing before cleanup is registered

Recommendation: Restructure to ensure atomic setup:

createdConn = await actor.createConn(...);
// Now that createdConn is set, it's safe to call handleWebSocket
actor.handleWebSocket(adapter, { request: newRequest });

2. Potential Memory Leak 🟡

Location: router-endpoints.ts:390-391

let createdConn: AnyConn | undefined;

Problem: If an exception occurs after createConn succeeds but before the WebSocket handlers are set up, createdConn is never cleaned up. The connection remains tracked but orphaned.

Recommendation: Add cleanup in the catch block:

} catch (error) {
    if (createdConn) {
        actor.connDisconnected(createdConn, false);
    }
    actor.rLog.error({...});
    ws.close(1011, "Failed to create connection");
}

3. Type Safety: Unsafe Type Casting 🟡

Location: raw-websocket.ts:42

terminate: () => {
    (websocket as any).terminate?.();
},

Problem: Using any bypasses type safety. The terminate method may not exist on all WebSocket implementations.

Recommendation: Use proper type guard:

terminate: () => {
    if ('terminate' in websocket && typeof websocket.terminate === 'function') {
        websocket.terminate();
    }
},

4. Incomplete Cleanup in Error Path 🟡

Location: actor-driver.ts:718-729

The error handler calls #cleanupHibernatableWebSocket but doesn't flush acks like the close handler does. This could leave stale acks in the queue.

Recommendation: Add ack flushing to error handler:

websocket.addEventListener("error", (event) => {
    this.#flushHibernatableWebSocketAcks(); // Add this
    this.#cleanupHibernatableWebSocket(...);
    wsHandlerPromise.then((x) => x.onError?.(event, wsContext));
});

5. Logging: Missing BigInt Conversion 🟢

Location: actor-driver.ts:642

oldMsgIndex: oldMsgIndex.toString(),

If oldMsgIndex is a bigint, this will work, but the type is inconsistent. The code later converts to number (line 636) so logging before conversion could be confusing.

Recommendation: Be consistent with types throughout the flow.

Performance Considerations

1. Ack Queue Flushing Strategy ℹ️

The 1-second interval (line 313) is reasonable for debouncing, but under very high message rates, the queue could grow large between flushes. Consider:

  • Implementing a max queue size with early flush
  • Making the interval configurable for different deployment scenarios

2. Array Search Performance ℹ️

Lines 182-184, 627-631: Using find() on hibernatableConns array is O(n). For actors with many hibernatable connections, consider using a Map indexed by requestId.

Security Considerations

No major security issues found

  • Proper close codes used (1000 for normal, 1011 for errors)
  • No sensitive data logged
  • Good input validation with invariant checks

Test Coverage

⚠️ No test changes included

This is a critical bug fix for message ordering in hibernatable WebSockets. Please add tests covering:

  1. Message index restoration after hibernation
  2. Out-of-sequence message detection
  3. Race conditions in connection setup
  4. Cleanup on error paths
  5. Multiple hibernatable WebSockets per actor

Additional Recommendations

Code Style

  1. Naming: createRawWebSocketSocket has "Socket" twice - consider createRawWebSocketDriver for clarity
  2. Error Messages: Line 463 could include the error type: Failed to create connection: ${error.constructor.name}

Documentation

Consider adding inline comments explaining:

  • Why -1 is used as sentinel value for "no messages"
  • The relationship between msgIndex tracking in multiple locations
  • The WebSocket lifecycle state machine

Verdict

Conditional Approval - The PR fixes an important bug and improves the codebase, but the race condition in router-endpoints.ts should be addressed before merging.

Priority Fixes (Blocking):

  1. Fix race condition in onOpen handler (Issue [SVC-2555] Set up issue templates #1)
  2. Add connection cleanup in error catch block (Issue [SVC-2479] Send cluster events to PostHog #2)

Recommended Fixes (Non-blocking):

  1. Add type-safe terminate() implementation
  2. Add ack flushing to error handler
  3. Add test coverage

🤖 Generated with Claude Code

@claude
Copy link

claude bot commented Nov 13, 2025

Code Review: fix(rivetkit): properly handle msgIndex for hibernatable websocket reconnection

This PR properly handles msgIndex for hibernatable WebSocket reconnections. The changes look solid overall with good improvements to state management and logging.


Strengths

1. Improved Architecture

  • New createRawWebSocketSocket driver provides proper separation of concerns
  • Connection lifecycle managed through createConn flow enables deduplication for hibernatable connections
  • Cleanup code well-factored into cleanupHibernatableWebSocket

2. Better State Tracking

  • msgIndex properly updated in driver message handler (actor-driver.ts:636)
  • Added lastSeenTimestamp updates alongside msgIndex
  • Proper handling of msgIndex = -1 for no messages processed yet

3. Enhanced Observability

  • Excellent logging with structured fields
  • Out-of-sequence message detection (lines 654-676) with detailed warnings
  • Debug logs track msgIndex state transitions

4. Schema Consistency

  • Fixed typo: hibernatableWebSocket → hibernatableWebSockets (plural)

Issues & Concerns

1. Race Condition in onOpen (router-endpoints.ts:395-464)
actor.handleWebSocket is called without await, so async errors won't be caught by try-catch.
Recommendation: Either await the call or add separate error handling.

2. Type Safety Issues

  • Line 417: (ws as any).__adapter bypasses type safety
  • Line 42: (websocket as any).terminate?.()
    Recommendation: Use WeakMap for adapter association or define proper interface.

3. Incomplete Error Cleanup (router-endpoints.ts:458-464)
Close promise not resolved in catch block could cause deadlocks.
Recommendation: Add closePromiseResolvers.resolve() in catch block.

4. BigInt to Number Conversion (actor-driver.ts:189-191)
Schema uses BigInt but runtime uses Number - potential precision loss.
Recommendation: Use BigInt consistently or document safe range.

5. Missing Boundary Check
No upper boundary check on message indices.
Recommendation: Add sanity check or high-water mark warning.

6. Schema Migration
Question: Verify migration path for hibernatableWebSocket → hibernatableWebSockets field rename won't break existing actors.

7. Removed Sleep Prevention (instance/mod.ts:671-686)
Question: Is removing activeRawWebSockets sleep prevention intentional? Should be documented if so.

8. Error Handler Cleanup
Question: Should connDisconnected be called in onError handler similar to onClose?


Recommendations

  1. Add await or error handling for actor.handleWebSocket
  2. Resolve close promise in error path
  3. Use WeakMap instead of any casts
  4. Add test coverage for msgIndex restoration, out-of-sequence messages, error cleanup
  5. Document sleep behavior change
  6. Add JSDoc to createRawWebSocketSocket
  7. Verify schema migration path

Minor Nits

  • Line 372: typo Remvoe → Remove (pre-existing)
  • Extract isHibernatable check to helper
  • Simplify sequenceType ternary with helper function

Security

No security concerns identified.

Performance

Improvements from debouncing WebSocket acks and proper hibernatable connection tracking.

Overall

Well-thought-out fix for hibernatable WebSocket state management. Main concerns are error handling and type safety.

Recommendation: Request changes for error handling, then approve after fixes.

@graphite-app
Copy link
Contributor

graphite-app bot commented Nov 13, 2025

Merge activity

  • Nov 13, 8:47 PM UTC: NathanFlurry added this pull request to the Graphite merge queue.
  • Nov 13, 8:48 PM UTC: CI is running for this pull request on a draft pull request (#3463) due to your merge queue CI optimization settings.
  • Nov 13, 8:50 PM UTC: Merged by the Graphite merge queue via draft PR: #3463.

graphite-app bot pushed a commit that referenced this pull request Nov 13, 2025
@graphite-app graphite-app bot closed this Nov 13, 2025
@graphite-app graphite-app bot deleted the 11-06-fix_rivetkit_properly_handle_msgindex_for_hibernatable_websocket_reconnection branch November 13, 2025 20:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants