chore(pegboard-gateway): add new message id format & add deprecated tunnel ack #3492

NathanFlurry · 2025-11-19T21:22:25Z

No description provided.

vercel · 2025-11-19T21:22:30Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
rivetkit-serverless	Error			Nov 20, 2025 7:38pm

3 Skipped Deployments

Project	Deployment	Preview	Updated (UTC)
rivet-cloud	Ignored		Nov 20, 2025 7:38pm
rivet-inspector	Ignored	Preview	Nov 20, 2025 7:38pm
rivet-site	Ignored	Preview	Nov 20, 2025 7:38pm

NathanFlurry · 2025-11-19T21:22:43Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

fix(tunnel): implement ping pong #3491 : 2 dependent PRs (#3475 , #3494 )
chore(pegboard-gateway): add new message id format & add deprecated tunnel ack #3492 👈 (View in Graphite)
chore(pegboard-gateway): remove TunnelAck #3488
chore(engine-runner): remove echo close on ToClientWebSocketClose #3487
chore(rivetkit): implement new hibernating ws protocol #3464 : 1 other dependent PR (#3452 )
fix(actor-kv): fix listing keys #3484
fix(guard): handle actor stopped during ws/req #3481
chore: rename retry -> hibernate in runner protocol #3473
fix(udb): update backoff algo #3469
chore: write/update docs #3468
chore: add debug scripts #3467
main

How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

_{An organization admin has enabled the Graphite Merge Queue in this repository.} _{Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.}

This stack of pull requests is managed by Graphite. Learn more about stacking.

claude · 2025-11-19T21:25:27Z

Code Review

This PR introduces a significant architectural improvement to the tunnel messaging system by implementing a more efficient message ID format and removing the need for explicit tunnel acknowledgments. Overall, the changes are well-structured and maintain backward compatibility. Here are my findings:

✅ Strengths

Smart ID Design: The new message ID format (gateway_id + request_id + message_index) is efficient at 12 bytes total and provides all necessary routing information in a single identifier.
Backward Compatibility: The DeprecatedTunnelAck approach with version_needs_tunnel_ack() is a clean way to maintain compatibility with older protocol versions while removing unnecessary overhead.
Improved Efficiency: Removing the explicit tunnel acknowledgment and GC cycles for non-WebSocket messages reduces message overhead and simplifies the state tracking logic.
Type Safety: Using distinct type aliases (GatewayId, RequestId, MessageIndex, MessageId) provides better type safety across Rust and TypeScript implementations.
Consistent Implementation: The Rust (tunnel/id.rs) and TypeScript (tunnel-id.ts) implementations are well-aligned with proper validation.

⚠️ Issues & Concerns

1. Resource Leak: Request ID Not Released ⚠️ CRITICAL

Location: engine/packages/guard-core/src/proxy_service.rs:1889

The request ID generated in the WebSocket custom serve handler is never released. The generate_unique_request_id() adds the ID to the inflight_requests HashSet, but there's no corresponding release_request_id() call after the WebSocket loop completes.

Impact: This will cause a slow memory leak where request IDs accumulate in the HashSet indefinitely. Over time with many connections, this could lead to:

Increased collision probability when generating new IDs
Memory growth (though minimal at 4 bytes per ID + HashMap overhead)
Eventually exhausting the retry limit (100 attempts) when the HashSet becomes too full

Fix: Add cleanup after line 2093:

anyhow::Ok(())
}
.instrument(tracing::info_span\!("handle_ws_task_custom_serve"))
// Add RAII guard or explicit cleanup
.then(|result| async move {
    state.release_request_id(request_id).await;
    result
})

Or better, implement a RAII guard that releases on drop.

2. Missing Cleanup for HTTP Custom Serve

Location: engine/packages/guard-core/src/proxy_service.rs:1086

Similar to the WebSocket issue, the HTTP custom serve path generates a request ID but doesn't appear to clean it up after the request completes. The scope ends at line ~1130, but there's no explicit release_request_id() call.

Fix: Ensure cleanup after the request completes, possibly with a scope guard or defer-like pattern.

3. Potential Panic in TupleUnpack

Location: engine/packages/pegboard/src/keys/hibernating_request.rs:56-64

let gateway_id: GatewayId = gateway_id_bytes
    .as_slice()
    .try_into()
    .expect("invalid gateway_id length");  // ⚠️ Panic

Using .expect() in deserialization code can panic if database corruption or version mismatches occur. This is inside TupleUnpack, so corruption could cause runtime panics.

Recommendation: Convert to proper error handling:

let gateway_id: GatewayId = gateway_id_bytes
    .as_slice()
    .try_into()
    .map_err(|_| PackError::InvalidFormat)?;

4. Serialization Size Verification

Location: engine/packages/pegboard/src/tunnel/id.rs:34-39

The build_message_id() function verifies that exactly 12 bytes were written, which is good. However, this check happens at runtime. Consider:

Is the 12-byte size guaranteed by the BARE schema? (Looks like yes: 4 + 4 + 2 = 10... wait)
Actually reviewing the struct: GatewayId (4 bytes) + RequestId (4 bytes) + MessageIndex (u16 = 2 bytes) = 10 bytes, not 12

Potential Issue: The serialization may be adding 2 bytes of padding or metadata. Verify this is intentional and documented.

5. Wrapping Arithmetic Documentation

Location: engine/packages/pegboard-gateway/src/shared_state.rs:332

hs.pending_ws_msgs.retain(|msg| {
    wrapping_gt(msg.message_index, ack_index)
});

The use of wrapping_gt for u16 indices is correct for handling wraparound, but there's no comment explaining this design choice. Add a comment about why wrapping comparison is needed (message indices can wrap around after 65535).

6. Missing Test Coverage

I don't see test files in the diff for:

Request ID collision handling (the 100-attempt retry logic)
Message ID serialization/deserialization edge cases
WebSocket hibernation with message index wraparound
Backward compatibility with older protocol versions

Recommendation: Add unit tests, especially for the collision handling logic in generate_unique_request_id().

📝 Minor Observations

Logging: Good use of structured logging with ?request_id and proper formatting functions like tunnel_id::request_id_to_string().
Protocol Schema: The v3.bare schema is well-documented with comments explaining deprecated types.
Error Messages: Error messages are descriptive and include expected vs. actual values (e.g., byte length mismatches).
Code Consistency: The removal of index field from ToClientWebSocketMessage and moving it to the message ID level is architecturally cleaner.

🎯 Recommendations

Before Merging:

✅ Fix the request ID resource leak (critical)
✅ Add error handling instead of .expect() in TupleUnpack
✅ Verify the 12-byte message ID size (10 bytes expected from types)
✅ Add test coverage for edge cases

Future Improvements:

Consider implementing a RAII guard for request ID management
Add integration tests for protocol version compatibility
Document the message index wrapping behavior

Summary

This is a solid architectural improvement that reduces message overhead and simplifies the tunnel messaging protocol. The main blocker is the request ID leak which must be fixed before merge. The other issues are important for robustness but less critical.

Great work on maintaining backward compatibility and providing clear migration paths! 🚀

claude · 2025-11-19T21:26:07Z

Code Review

I've reviewed this PR which adds a new message ID format and deprecates the old tunnel ack mechanism. Here's my feedback:

Positive Aspects

✅ Well-structured message ID system: The new MessageId format combining gateway_id (4 bytes) + request_id (4 bytes) + message_index (2 bytes) is compact and efficient at 12 bytes total. The use of serde_bare for serialization is appropriate.

✅ Backwards compatibility: The PR thoughtfully handles protocol versioning with version_needs_tunnel_ack() (engine/sdks/rust/runner-protocol/src/compat.rs:5-7) to support older clients. The deprecated DeprecatedTunnelAck is properly ignored by newer code.

✅ Clean architecture: The new pegboard::tunnel::id module (engine/packages/pegboard/src/tunnel/id.rs) provides well-organized utility functions with proper error handling. The TypeScript implementation mirrors the Rust implementation well.

✅ Improved state management: Removing the pending_msgs tracking and MESSAGE_ACK_TIMEOUT logic simplifies the codebase significantly. The new approach using message_index for WebSocket message acknowledgment is cleaner.

✅ Request ID collision handling: Good defensive programming in generate_unique_request_id() (engine/packages/guard-core/src/proxy_service.rs:657-683) with collision detection, though collisions should be extremely rare with 4 billion possibilities.

Issues & Concerns

🔴 Critical: Request ID Not Released After Completion

Location: engine/packages/guard-core/src/proxy_service.rs:1889

The generate_unique_request_id() adds request IDs to inflight_requests, but I don't see a corresponding release_request_id() call in the WebSocket handling task. This will cause a memory leak where request IDs are never removed from the HashSet.

// At line 1889, request_id is generated
let request_id = state.generate_unique_request_id().await?;

// ... but I don't see state.release_request_id(request_id) anywhere in the task

Recommendation: Add a release_request_id() call in the cleanup section of the WebSocket task, similar to how it should be done after HTTP request completion.

🟡 Medium: Missing Request ID Release in HTTP Path

Location: engine/packages/guard-core/src/proxy_service.rs:1086

Similar to above, the HTTP request path generates a unique request ID but doesn't appear to release it when the request completes. The request ID should be released after the response is sent.

Recommendation: Wrap the request handling in a scope guard or defer pattern to ensure release_request_id() is called on all code paths (success, error, panic).

🟡 Medium: Serialization Size Not Verified at Compile Time

Location: engine/packages/pegboard/src/tunnel/id.rs:34-39

The code uses a runtime assertion to verify the serialized size is exactly 12 bytes:

ensure!(
    written == 12,
    "message id serialization produced wrong size: expected 12 bytes, got {}",
    written
);

While this is good defensive programming, if the BARE schema changes, this will only be caught at runtime.

Recommendation: Consider adding a compile-time test or const assertion to verify the size. At minimum, add a unit test that exercises this code path.

🟡 Medium: Wrapping Arithmetic Without Overflow Documentation

Location: engine/packages/pegboard-gateway/src/shared_state.rs:136

The message index uses wrapping arithmetic:

req.message_index = req.message_index.wrapping_add(1);

This is correct for a u16 counter, but there's no documentation about what happens when the index wraps from 65535 to 0, especially for WebSocket message acknowledgments.

Recommendation: Add a comment explaining the wrapping behavior and confirming that the wrapping_gt comparison (line 327) handles this correctly across the wrap boundary.

🟢 Minor: Inconsistent Error Message Formatting

Location: engine/packages/pegboard/src/tunnel/id.rs:80-81 vs TypeScript implementation

The Rust error messages use lowercase ("invalid request id length") while some TypeScript errors might use different casing. While minor, consistency helps with debugging.

Recommendation: Standardize error message formats across Rust and TypeScript implementations.

🟢 Minor: Missing Documentation on MessageId Format

Location: engine/sdks/schemas/runner-protocol/v3.bare:217

The BARE schema defines MessageId as data[12] but doesn't document the internal structure. Future maintainers would benefit from a comment explaining the layout.

Recommendation: Add a comment in the BARE schema:

# MessageId is a 12-byte array containing:
# - 4 bytes: gateway_id (identifies the gateway instance)
# - 4 bytes: request_id (unique to this gateway)  
# - 2 bytes: message_index (u16, wraps at 65535)
# - 2 bytes: reserved/padding from BARE encoding

Performance Considerations

✅ The new format is actually more efficient - the old approach with UUID-based IDs used 16 bytes per message ID, while the new format uses 12 bytes. Over millions of messages, this saves bandwidth.

✅ Removing the MESSAGE_ACK_TIMEOUT GC check eliminates unnecessary processing overhead.

⚠️ The inflight_requests HashSet lock is held briefly during insert/remove operations, which is acceptable. However, with the memory leak mentioned above, this HashSet will grow unbounded.

Security Considerations

✅ Using random u32 for both gateway_id and request_id provides sufficient entropy for non-cryptographic uniqueness.

⚠️ The collision retry mechanism uses a hardcoded MAX_TRIES = 100. While collisions are astronomically rare (1 in 4 billion), in theory an attacker could try to DoS by triggering many requests simultaneously. However, this is extremely unlikely to be exploitable in practice.

Test Coverage

❌ Missing tests: I found no test files for the new tunnel ID functionality. Given this is a protocol-level change, unit tests are strongly recommended.

Recommended tests:

Test build_message_id and parse_message_id round-trip
Test message index wrapping from 65535 to 0
Test collision handling in generate_unique_request_id
Test wrapping_gt comparison across wrap boundary
Integration test verifying old protocol clients still work with DeprecatedTunnelAck

Summary

This is a solid refactoring that simplifies the protocol and improves efficiency. The main concern is the request ID memory leak which must be fixed before merging. The other issues are lower priority but should still be addressed.

Required before merge:

Fix request ID memory leak (add release_request_id() calls)

Recommended before merge:

Add unit tests for tunnel ID module
Document message index wrapping behavior
Add schema documentation

Nice to have:

Standardize error messages
Add compile-time size verification test

Great work on the backwards compatibility handling! The deprecated tunnel ack approach is clean and will make rolling this out much safer.

Review generated with Claude Code

MasterPtato · 2025-11-19T21:38:49Z

engine/packages/guard-core/src/proxy_service.rs

 	route_cache: RouteCache,
 	rate_limiters: Cache<(Id, std::net::IpAddr), Arc<Mutex<RateLimiter>>>,
 	in_flight_counters: Cache<(Id, std::net::IpAddr), Arc<Mutex<InFlightCounter>>>,
+	inflight_requests: Arc<Mutex<HashSet<RequestId>>>,


use scc HashSet

MasterPtato · 2025-11-19T21:41:52Z

engine/packages/pegboard/src/tunnel/id.rs

@@ -0,0 +1,86 @@
+use anyhow::{Context, Result, ensure};


Should be in runner protocol crate

MasterPtato · 2025-11-19T21:41:56Z

engine/packages/pegboard/src/tunnel/mod.rs

@@ -0,0 +1 @@
+pub mod id;


Should be in runner protocol crate

claude · 2025-11-19T21:57:31Z

PR Review: New Message ID Format & Deprecated Tunnel Ack

I have completed a thorough review of this PR. Overall, this is a solid refactoring that improves the message ID system and maintains backwards compatibility. Here are my findings:

Strengths

Well-designed Message ID Format: The new message ID structure (gateway_id + request_id + message_index) is well thought out and uses efficient 4-byte IDs with proper serialization via serde_bare.
Backwards Compatibility: Excellent handling of protocol version compatibility with version_needs_tunnel_ack() and DeprecatedTunnelAck messages for older clients.
Improved WebSocket Hibernation: The message tracking logic is now simpler and more maintainable - storing message_index directly in PendingWebsocketMessage is much cleaner than the previous last_ws_msg_index approach.
Request ID Collision Detection: The generate_unique_request_id() implementation properly prevents collisions with retry logic and warning logs.
Database Schema Updates: The hibernating request keys properly incorporate both gateway_id and request_id for better data modeling.

Critical Issues

1. Memory Leak: Request IDs Never Released

The release_request_id() method is defined in proxy_service.rs:687-689 but never called. This means the inflight_requests HashSet will grow indefinitely, memory usage will increase over time, and eventually could lead to performance degradation.

Fix needed: Add cleanup calls after custom serve request completes (both success and error paths) and after WebSocket task completion.

2. Unsafe .expect() Calls in Database Code

In engine/packages/pegboard/src/keys/actor.rs there are .expect() calls that will panic if the database contains invalid data. Per CLAUDE.md error handling guidelines, these should return proper Result<> with context instead of panicking in production code paths.

Moderate Issues

3. Missing Test Coverage - No tests added for the new message ID format, collision detection logic, or backwards compatibility handling.

4. Message Index Overflow - proxy_service.rs uses wrapping_add(1) for message_index (u16). After 65,535 messages it wraps to 0. Consider adding a comment or warning log.

5. Import Organization - Some files have imports that should be moved to the top of the file per CLAUDE.md style.

Performance & Security

Positive: Smaller message IDs, removed unnecessary timeout tracking
Security: Random ID generation is appropriate for non-cryptographic use
No security vulnerabilities introduced

Summary

This is a strong refactoring with good architectural decisions, but has two critical issues that must be fixed before merge:

MUST FIX: Implement release_request_id() calls to prevent memory leak
SHOULD FIX: Replace .expect() with proper error handling in database code

Once these are addressed, this PR will be ready to merge. The new message ID format is well-designed and the backwards compatibility approach is solid.

claude · 2025-11-19T21:58:53Z

PR Review: New Message ID Format & Deprecated Tunnel Ack

This is a well-structured refactoring that modernizes the tunnel message ID system. The changes introduce a more efficient message identification scheme while maintaining backward compatibility. Here's my detailed review:

✅ Strengths

1. Improved Message ID Design

The new compact 12-byte message ID format (gateway_id: 4 bytes + request_id: 4 bytes + message_index: 2 bytes) is much more efficient than the previous UUID-based approach
Message IDs are now deterministically constructed from components rather than randomly generated, which improves debuggability
The serialization using serde_bare ensures consistent encoding across Rust and TypeScript implementations

2. Strong Backward Compatibility

Protocol version converters (v2_to_v3, v3_to_v2) properly handle the migration between message formats
The DeprecatedTunnelAck mechanism ensures older runners continue to work
The version_needs_tunnel_ack() check in engine/sdks/rust/runner-protocol/src/compat.rs:5 cleanly segregates legacy behavior

3. Good Type Safety

Type aliases (GatewayId, RequestId, MessageIndex) provide semantic clarity
Fixed-size arrays prevent size-related bugs at compile time
The TypeScript implementation mirrors the Rust types well

4. Hibernation Request Improvements

The updated HibernatingRequest structure now properly tracks both gateway_id and request_id (engine/sdks/schemas/runner-protocol/v3.bare:175-178)
Database operations correctly use composite keys for hibernating request tracking

⚠️ Issues & Suggestions

1. Request ID Collision Handling ⚠️ Medium Priority

In engine/packages/guard-core/src/proxy_service.rs:657-684, the collision detection loop only tries 100 times before giving up. While collisions are extremely unlikely with 4 billion possibilities:

const MAX_TRIES: u32 = 100;

Issue: If the system has sustained high load with many concurrent requests, there's a non-zero chance of exhausting attempts, which would cause request failures.

Suggestions:

Consider logging metrics when collisions occur to monitor if this becomes an issue in production
Alternatively, increase MAX_TRIES to a higher value (e.g., 1000) for better safety margin
The collision warning at line 674 is good but could include the current inflight count for debugging

2. Message Index Wrapping Logic ℹ️ Low Priority

The message index uses u16::wrapping_add for incrementing (engine/packages/pegboard-gateway/src/shared_state.rs:150):

req.message_index = req.message_index.wrapping_add(1);

While wrapping is intentional, there's no explicit comment explaining this design choice. After ~65k messages, the index wraps to 0.

Suggestion: Add a comment explaining the wrapping behavior is intentional and acceptable for the use case.

3. WebSocket Message Index Extraction 🐛 Potential Bug

In engine/sdks/rust/runner-protocol/src/versioned.rs:1295-1296, when converting v3 to v2, the code extracts the message index using:

let index = u16::from_be_bytes([message_id[8], message_id[9]]);

Issue: This hardcodes big-endian extraction, but the MessageId is serialized using serde_bare. You should verify that serde_bare serializes u16 as big-endian. If it uses little-endian or a variable-length encoding, this extraction will produce incorrect values.

Recommendation:

Verify the byte order used by serde_bare for u16 serialization
Consider using the parseMessageId function instead of manual byte extraction for consistency
Add a test case to ensure round-trip conversion preserves the message index correctly

4. Cleanup in GC Logic ✅ Good Fix

The removal of MESSAGE_ACK_TIMEOUT and associated pending_msgs tracking in shared_state.rs:377-467 is correct since v3 no longer uses tunnel acks. However, I noticed a potential typo at line 410:

if now.duration_since(earliest_pending_ws_msg.send_instant)
    > HWS_MESSAGE_ACK_TIMEOUT  // Changed from <= to >
{

This is actually correct - the comment just needs to clarify this was intentional (garbage collect if timeout exceeded).

5. Missing Error Context ℹ️ Low Priority

In engine/packages/pegboard-gateway/src/shared_state.rs:212, when parsing fails:

Err(err) => {
    tracing::error!(?err, "failed to parse message id");
    continue;
}

Suggestion: Include the raw message_id bytes in the error log for debugging:

tracing::error!(?err, message_id=?msg.message_id, "failed to parse message id");

6. Request ID Cleanup ⚠️ Medium Priority

The release_request_id function is added but I don't see it being called in the diff. The inflight_requests HashSet will grow unbounded if IDs aren't released.

Action Required: Verify that release_request_id is called when:

A request completes successfully
A request fails/times out
A WebSocket connection closes

If this is missing, it's a memory leak.

📝 Minor Code Quality Notes

Logging consistency (engine/packages/pegboard-gateway/src/lib.rs:216, 352, 418): Using tunnel_id::request_id_to_string for debug formatting is good for human readability
Empty PR body: The PR has no description explaining the motivation and changes. Consider adding context for future reference.
Hard tabs: Code follows the rustfmt.toml convention correctly ✅

🧪 Testing Recommendations

Add a test verifying buildMessageId → parseMessageId round-trip in both Rust and TypeScript
Test protocol version conversion with actual WebSocket messages containing indices
Verify request ID collision handling under high concurrent load
Test hibernation request tracking with the new composite keys

Summary

This is a solid refactoring with thoughtful backward compatibility. The main concerns are:

Critical: Verify release_request_id() is being called (potential memory leak)
Important: Verify the u16 byte order assumption in the v3→v2 converter
Nice-to-have: Improve collision handling metrics and documentation

Overall: ✅ Approve with requested verification of the critical issues above

claude · 2025-11-19T23:30:02Z

PR Review: Add New Message ID Format & Deprecated Tunnel Ack

Overview

This PR introduces a significant refactor to the message ID system in the Pegboard gateway, moving from UUID-based IDs to a more compact and structured format. The changes touch critical paths in the proxy service and gateway communication.

Code Quality & Best Practices ✅

Strengths:

Well-structured message ID system with clear component separation (GatewayId, RequestId, MessageIndex)
Good use of type aliases for clarity (type GatewayId = [u8; 4], etc.)
Proper error handling with context using anyhow
Clean separation of concerns between ID generation and parsing

Areas for Improvement:

1. Inconsistent Whitespace (engine/packages/guard-core/src/proxy_service.rs:641)

// Generate unique request ID
        let request_id = Some(self.generate_unique_request_id().await?);

Mixed tabs and spaces. The leading whitespace appears to be spaces instead of tabs. Per CLAUDE.md, this project uses hard tabs for Rust formatting.

2. Missing Import Consolidation (engine/packages/guard-core/src/proxy_service.rs:16)

The import for pegboard::tunnel::id is added separately:

use pegboard::tunnel::id::{RequestId, generate_request_id};

Consider grouping this with other pegboard imports if any exist nearby.

Potential Bugs & Issues ⚠️

1. Critical: Resource Leak in Error Paths (engine/packages/guard-core/src/proxy_service.rs:1094-1173)

In the CustomServe HTTP request handler, there's a potential resource leak:

let request_id = match self
    .state
    .acquire_in_flight(client_ip, &actor_id, &req_headers)
    .await?
{
    Some(id) => id,
    None => {
        return Err(errors::RateLimit { /* ... */ }.build());
    }
};

// ... request processing loop ...
while attempts < max_attempts {
    let res = handler.handle_request(req_collected.clone(), request_context, request_id).await;
    if should_retry_request(&res) {
        // ... retry logic ...
        continue; // ⚠️ Resource still held during retries!
    }
    
    // Release only happens on non-retry path
    self.state.release_in_flight(client_ip, &actor_id, request_id).await;
    return res;
}

// Release after all retries
self.state.release_in_flight(client_ip, &actor_id, request_id).await;

Issue: The acquire_in_flight call increments both the in-flight counter AND inserts the request_id into inflight_requests HashSet. However, during the retry loop, these resources remain held. If the retry loop iterates many times (especially with backoff delays), this could:

Artificially inflate in-flight counts
Prevent request ID reuse unnecessarily
Cause downstream systems to see inflated connection counts

Recommendation: Consider whether the in-flight tracking should be per-attempt or per-request. If per-request is correct (current design), this is fine. If per-attempt is desired, you'll need to release/reacquire between attempts.

2. Duplicate acquire_in_flight Call (engine/packages/guard-core/src/proxy_service.rs:809 & 1100)

The acquire_in_flight is called at line 809 for all requests:

let request_id = match self.state.acquire_in_flight(client_ip, &actor_id, req.headers()).await? {
    Some(id) => id,
    None => { return Err(errors::RateLimit { /* ... */ }.build()); }
};

Then for CustomServe routes, it's called AGAIN at line 1100:

let request_id = match self.state.acquire_in_flight(client_ip, &actor_id, &req_headers).await? {
    // ...
}

Issue: This will:

Increment the in-flight counter twice
Generate two different request IDs (the first one is overwritten)
Insert two request IDs into inflight_requests but only release one

Impact: This is a resource leak that will eventually exhaust the request ID space and cause the generate_unique_request_id function to fail after MAX_TRIES attempts.

Recommendation: Remove the first acquire_in_flight call (line 809) for CustomServe routes, or ensure it's only called once per request lifecycle.

3. Collision Handling May Cause Performance Issues (engine/packages/pegboard/src/tunnel/id.rs:67-69 & engine/packages/guard-core/src/proxy_service.rs:667-694)

pub fn generate_request_id() -> RequestId {
    rand::random::<u32>().to_le_bytes()
}

async fn generate_unique_request_id(&self) -> anyhow::Result<RequestId> {
    const MAX_TRIES: u32 = 100;
    let mut requests = self.inflight_requests.lock().await;
    
    for attempt in 0..MAX_TRIES {
        let request_id = generate_request_id();
        if !requests.contains(&request_id) {
            requests.insert(request_id);
            return Ok(request_id);
        }
        // Collision logging...
    }
    bail!("failed to generate unique request id after {} attempts", MAX_TRIES);
}

Issue: With 4-byte request IDs (2^32 ≈ 4 billion possibilities), collisions become likely when you have ~65,536 concurrent requests (birthday paradox). The mutex lock is held during the entire retry loop, which could become a bottleneck under high concurrency.

Recommendations:

Monitor collision rates in production (good that you're logging them)
Consider 8-byte request IDs if collision rates become problematic
Consider lock-free alternatives or reducing lock contention (acquire lock per attempt instead of holding across all attempts)

4. WebSocket Resource Leak in Error Path (engine/packages/guard-core/src/proxy_service.rs:1920-2145)

Similar to issue #1, the WebSocket handling in CustomServe acquires resources but may not release them properly in all error paths. Specifically, the spawned task acquires request_id but the release only happens at the very end (line 2142). If the task panics or is cancelled before reaching line 2142, the resources leak.

Recommendation: Use a Drop guard or ensure release happens in a finally-equivalent pattern (e.g., using scopeguard crate or manual defer pattern).

Performance Considerations 🔍

1. Serialization on Hot Path (engine/packages/pegboard/src/tunnel/id.rs:28-31)

let mut message_id = [0u8; 12];
let mut cursor = std::io::Cursor::new(&mut message_id[..]);
serde_bare::to_writer(&mut cursor, &parts).context("failed to serialize message id parts")?;

Observation: Message IDs are built on every request/response message, making this a hot path. Using serde_bare serialization adds overhead compared to manual byte packing.

Measurement Needed: Profile whether this is a bottleneck. If message throughput is high, consider:

pub fn build_message_id(gateway_id: GatewayId, request_id: RequestId, message_index: MessageIndex) -> MessageId {
    let mut id = [0u8; 12];
    id[0..4].copy_from_slice(&gateway_id);
    id[4..8].copy_from_slice(&request_id);
    id[8..12].copy_from_slice(&message_index.to_le_bytes());
    id
}

This would be ~10x faster but loses the benefit of having serde maintain the format contract.

2. Lock Contention on inflight_requests (engine/packages/guard-core/src/proxy_service.rs:662-663, 669-678)

The inflight_requests HashSet is protected by a single Mutex, which is acquired for every request start and end. Under high request rates, this could become a bottleneck.

Recommendations:

Consider sharding the HashSet (e.g., 16 shards based on request_id hash)
Use dashmap::DashSet for lock-free concurrent access
Profile to confirm this is actually a bottleneck before optimizing

Security Concerns 🔒

1. Request ID Predictability

Using rand::random::<u32>() without specifying a cryptographically secure RNG could make request IDs predictable if the RNG state is compromised.

Impact: Low to Medium - An attacker who can predict request IDs could potentially:

Inject messages into other users' request streams
Cause targeted collisions to DoS specific requests

Recommendation: Use rand::thread_rng().gen::<u32>() or explicitly use a CSPRNG like getrandom if security is a concern for request routing.

2. No Validation of Message Index

The MessageIndex (u16) can wrap around after 65,536 messages. While wrapping_gt is used for comparison, there's no validation that prevents reuse of the same (gateway_id, request_id, message_index) tuple if a request lives long enough.

Impact: Low - Most requests won't send >65k messages, but long-lived WebSocket connections could.

Recommendation: Document the wrap-around behavior and ensure it's tested.

Test Coverage 📊

Observations:

The PR modifies 50 files with significant logic changes
Test file engine/packages/guard-core/tests/proxy.rs was not modified
No new tests added for:
- Request ID generation and collision handling
- Message ID serialization/deserialization
- Resource cleanup in error paths
- WebSocket hibernation with new message format

Recommendations:

Add unit tests for pegboard/src/tunnel/id.rs:
- Test build_message_id and parse_message_id round-trip
- Test edge cases (max values for each component)
- Test error handling for malformed message IDs
Add integration tests for resource management:
- Verify acquire_in_flight / release_in_flight pairing
- Test request ID uniqueness under concurrent load
- Verify no leaks when requests fail/timeout
Add WebSocket hibernation tests:
- Test message replay after hibernation with new message format
- Test message acknowledgment with wrapping indices
- Test cleanup when hibernation fails

Summary

Overall Assessment: This is a well-architected refactor that improves message ID structure and reduces overhead. However, there are critical resource leaks that need to be addressed before merging.

Must Fix Before Merge:

⚠️ Issue [SVC-2479] Send cluster events to PostHog #2: Duplicate acquire_in_flight calls causing resource leaks
⚠️ Issue [SVC-2555] Set up issue templates #1/[SVC-2483] Remove hardcoded uses of rivet.gg #4: Ensure proper resource cleanup in all error paths

Recommended Improvements:

Add test coverage for new message ID format
Consider performance optimization for message ID serialization if profiling shows it's a bottleneck
Review cryptographic security requirements for request ID generation
Fix whitespace inconsistencies per CLAUDE.md formatting rules

Nice to Have:

Monitor collision rates in production
Add metrics for request ID generation attempts
Document wrap-around behavior for MessageIndex

Great work on the refactor! The new message ID format is much more efficient than UUIDs. Just needs the resource leak issues addressed.

🤖 Generated with Claude Code

claude · 2025-11-19T23:30:56Z

Pull Request Review

Summary

This PR introduces a new message ID format for the tunnel protocol and adds backward compatibility for deprecated tunnel acknowledgments. The changes migrate from a 16-byte UUID-based message ID to a more efficient 12-byte structured format.

Code Quality - Strengths

Well-structured protocol versioning with clean v2 to v3 migration
Clear separation of concerns with new tunnel/id.rs module
Backward compatibility maintained with DeprecatedTunnelAck
Consistent error handling using anyhow::Result

CRITICAL BUG - Double In-Flight Acquisition

Location: engine/packages/guard-core/src/proxy_service.rs:1097-1114

The CustomServe path acquires in-flight limits TWICE:

First at line 812-828 in handle_request()
Again at line 1097-1114 inside handle_http_request()

Impact: CustomServe HTTP requests count twice against the in-flight limit, reducing effective capacity by 50%.

Fix: Remove the second acquire_in_flight call in the CustomServe branch.

Other Issues

1. Request ID Collision Detection (proxy_service.rs:465-495)

Uses 4-byte random space with 100 max retries
Birthday paradox suggests collisions likely around 65K concurrent requests
Recommendation: Add metrics to track collision rates

2. Message Index Wrap-Around (pegboard-gateway/src/shared_state.rs:136-151)

u16 wraps after 65,535 messages
Uses wrapping_gt for comparison (correct)
Recommendation: Add comment and test case for wrap-around behavior

3. Lock Contention (proxy_service.rs:381, 465)

Global mutex on inflight_requests HashSet
Potential bottleneck at high request rates
Recommendation: Consider DashSet or RwLock

4. Test Coverage
Missing edge case tests for:

Collision handling
Message index wrap-around
Concurrent request ID generation
In-flight limit edge cases

Recommendations

High Priority:

Fix double in-flight acquisition (BLOCKING)
Evaluate lock contention under load
Add test coverage for edge cases

Medium Priority:
4. Document wrap-around behavior
5. Add collision rate metrics

Low Priority:
6. Clean up unused imports
7. Document RNG security properties

Conclusion

Solid refactoring with good protocol versioning. However, the double in-flight acquisition bug should be fixed before merging. The code follows Rivet conventions well. Great work on backward compatibility!

claude · 2025-11-20T03:09:35Z

PR Review: Tunnel Message ID Format & Deprecated Tunnel Ack

I've reviewed this PR which introduces a new message ID format and adds deprecated tunnel acknowledgments. Overall, the implementation is solid with good architectural decisions. Here are my findings:

✅ Strengths

Well-designed ID system - The new MessageId format (gateway_id + request_id + message_index) provides good uniqueness guarantees:
- Gateway IDs (4 bytes) ensure global uniqueness across gateways
- Request IDs (4 bytes) ensure uniqueness per gateway
- Message indices (u16) track order per request
Collision handling - The generate_unique_request_id() function properly handles potential collisions with retry logic and warning logs (proxy_service.rs:667-694)
Resource management - In-flight request tracking has been improved:
- Proper cleanup via release_in_flight()
- Request IDs are tracked in inflight_requests HashSet and released when done
- Two-phase garbage collection prevents premature cleanup
Protocol backward compatibility - Added DeprecatedTunnelAck to maintain compatibility with older protocols while transitioning to the new format
Structured logging - Consistent use of tracing with proper structured fields instead of string interpolation

⚠️ Issues & Concerns

1. Whitespace formatting issue (`proxy_service.rs:641`)

// Generate unique request ID
let request_id = Some(self.generate_unique_request_id().await?);

This line has mixed tabs/spaces (leading spaces visible). Should use hard tabs per rustfmt.toml convention.

2. Duplicate in-flight acquisition in CustomServe path (`proxy_service.rs:1099-1115`)

The CustomServe route acquires in-flight limit twice:

Once in handle_request() at line ~809
Again in handle_http_request() at line ~1100

This double-acquisition will:

Consume 2x the in-flight slots per request
Potentially reject valid requests prematurely
Leak request IDs if the second acquisition fails

Recommendation: Remove the duplicate acquisition in the CustomServe branch since the outer handle_request already acquired it.

3. WebSocket double-acquisition pattern (`proxy_service.rs:1922-1937`)

Similar issue for WebSocket upgrades in CustomServe path - acquiring in-flight twice, once in the outer handler and once in the spawned task.

Recommendation: Pass the already-acquired request_id from the outer scope instead of re-acquiring.

4. Request ID not released on early errors (`proxy_service.rs:1094+`)

In the CustomServe HTTP path, if body collection fails or other early errors occur after acquiring the request ID, the ID may not be released. Consider wrapping the entire block in a guard or using RAII pattern.

5. Missing test coverage

No new tests found for:

Request ID collision handling
Message ID serialization/deserialization edge cases
In-flight request tracking with new format
Wrapping behavior of MessageIndex (u16 overflow)

The wrapping_gt function (shared_state.rs:468-470) handles u16 wraparound, but lacks tests verifying correctness near boundaries (e.g., comparing 65535 with 0).

6. Hard-coded retry limit (`proxy_service.rs:668`)

const MAX_TRIES: u32 = 100;

With 4-byte request IDs (4 billion possibilities), 100 retries should be adequate, but consider:

Making this configurable for testing
Adding metrics to track collision frequency
The collision warning should use structured logging: tracing::warn!(attempt, "request id collision") instead of ?request_id which may fail to serialize

7. Serialization size verification (`tunnel/id.rs:34-39`)

Good defensive check! However, the error path uses ensure! which may panic in some contexts. Consider:

anyhow::ensure!(written == 12, "...");

This is actually correct usage - just noting for awareness.

🔒 Security Considerations

Request ID uniqueness - Using random u32s is acceptable for preventing collisions, but not cryptographically secure. This is fine for your use case (preventing accidental collisions, not malicious attacks).
Base64 encoding - Using standard base64 for IDs in logs/debugging is good. No security issues there.

🎯 Performance Considerations

Lock contention - generate_unique_request_id() holds inflight_requests lock for entire retry loop. Under high collision rates (unlikely but possible), this could create contention. Consider:
- Generate ID outside lock
- Lock only to check + insert
- Retry loop outside lock
HashSet overhead - Tracking all in-flight request IDs in a global HashSet adds memory overhead. With 4-byte IDs, this is ~minimal impact.

📝 Code Quality

Good separation of concerns - ID generation isolated to tunnel/id.rs module
Proper error handling - Using anyhow::Context for error propagation
Following conventions - Lowercase log messages, structured logging per CLAUDE.md
Type safety - Type aliases for GatewayId, RequestId, etc. improve readability

✅ Recommendations

Critical:

✅ Fix duplicate in-flight acquisition in CustomServe HTTP path
✅ Fix duplicate in-flight acquisition in CustomServe WebSocket path
✅ Ensure request IDs are released on all error paths
✅ Fix whitespace/indentation issue at proxy_service.rs:641

Important:
5. Add tests for message ID serialization edge cases
6. Add tests for wrapping arithmetic in wrapping_gt
7. Add metrics for request ID collision frequency
8. Consider optimizing lock contention in generate_unique_request_id()

Nice to have:
9. Document the message ID format in code comments
10. Add rustdoc examples for ID parsing/building functions

📊 Summary

Code Quality: 8/10
Test Coverage: 6/10 (needs tests for new ID system)
Security: 9/10
Performance: 8/10
Overall: ✅ Approve with requested changes

The core architecture is sound, but the duplicate in-flight acquisitions need to be fixed before merge. Once addressed, this will be a solid improvement to the tunnel protocol.

claude · 2025-11-20T03:15:12Z

PR Review: Message ID Format & Deprecated Tunnel Ack

Summary

This PR introduces a new compact message ID format and deprecates the TunnelAck mechanism. The changes are well-structured with good backward compatibility support through protocol versioning.

✅ Strengths

Clean Architecture: The new message ID format (gateway_id + request_id + message_index = 12 bytes) is well-designed and compact
Backward Compatibility: Excellent version conversion logic in versioned.rs maintains compatibility between v1/v2/v3
Type Safety: Strong typing with [u8; 4] and [u8; 12] arrays prevents accidental misuse
Consistent Implementation: Both Rust and TypeScript implementations follow the same patterns

🔍 Issues Found

1. Whitespace Formatting Issue (guard-core/src/proxy_service.rs:641)

// Generate unique request ID
        let request_id = Some(self.generate_unique_request_id().await?);

Issue: Mixed tabs and spaces - line uses spaces instead of hard tabs
Fix: Replace leading spaces with a hard tab to match project rustfmt.toml

2. Request ID Lifecycle Bug (guard-core/src/proxy_service.rs:1094-1173)

let request_id = match self.state.acquire_in_flight(client_ip, &actor_id, &req_headers).await? {
    Some(id) => id,
    None => { return Err(errors::RateLimit { /* ... */ }.build()); }
};
// ... later in the loop
self.state.release_in_flight(client_ip, &actor_id, request_id).await;

Issue: acquire_in_flight is called twice for the same request:

First at line 809-827 (before WebSocket/HTTP routing)
Second at line 1107-1126 (inside CustomServe handler)

This causes:

Request ID to be acquired twice
In-flight counter incremented twice
Potential request ID leak if the first acquisition succeeds but second fails

Fix: Remove the duplicate acquire_in_flight call at lines 1107-1126. The request ID is already acquired at line 809-827 and should be passed down to handle_request instead of being re-acquired.

3. Collision Handling Concerns (guard-core/src/proxy_service.rs:667-694)

const MAX_TRIES: u32 = 100;
for attempt in 0..MAX_TRIES {
    let request_id = generate_request_id();
    if !requests.contains(&request_id) {
        requests.insert(request_id);
        return Ok(request_id);
    }
    tracing::warn!(?request_id, attempt, "request id collision");
}

Issue: With 4 bytes (4 billion possibilities), collisions should be extremely rare. If hitting 100 attempts, there is likely a deeper issue (not cleaning up IDs, memory leak, etc.)

Suggestion:

Lower MAX_TRIES to 10-20 (if legitimate collisions, this is enough; if not, fail faster)
Add metrics to track collision rate
Log warning at first collision, error after multiple attempts

4. Message Index Type Mismatch (pegboard-gateway/src/shared_state.rs:39)

message_index: tunnel_id::MessageIndex,  // u16

But in v3.bare:

type MessageIndex u16

And in shared_state.rs:316:

fn wrapping_gt(msg.message_index, ack_index)

Issue: The code uses wrapping arithmetic for u16 indices, which is correct, but there is inconsistency in how indices are used vs. how they are documented.

Suggestion: Add a comment explaining the wrapping behavior for message indices (e.g., "Indices wrap at u16::MAX")

5. Missing Error Context (pegboard/src/tunnel/id.rs:31)

serde_bare::to_writer(&mut cursor, &parts).context("failed to serialize message id parts")?;

Suggestion: Add the actual values to the error context for debugging:

.with_context(|| format!("failed to serialize message id parts: gateway_id={:?}, request_id={:?}, message_index={}", 
    gateway_id, request_id, message_index))?;

⚠️ Security Considerations

Request ID Predictability: Using rand::random::<u32>() is fine for collision avoidance but should not be relied on for cryptographic purposes (which is not the case here - good!)
DoS via ID Exhaustion: The inflight_requests HashSet could theoretically be filled by an attacker, but this is mitigated by:
- In-flight limits per actor
- Automatic cleanup when requests complete
- Still, consider adding a global limit on total inflight_requests size

🧪 Test Coverage

Missing: No unit tests found for:

generate_request_id() collision handling
build_message_id() / parse_message_id() round-trip
Request ID lifecycle (acquire → release)
Message index wrapping behavior

Recommendation: Add tests in:

engine/packages/pegboard/src/tunnel/id.rs (add #[cfg(test)] mod tests)
engine/packages/guard-core/src/proxy_service.rs (test request ID lifecycle)

📝 Code Quality

Logging: Good use of structured logging with tracing ✓
Error Handling: Proper use of custom errors and anyhow ✓
Documentation: Missing doc comments on public functions (e.g., generate_request_id, build_message_id)
Naming: Consistent and clear naming conventions ✓

🔄 Protocol Migration

The version migration logic is excellent:

v2 → v3: Extracts 12-byte message_id from v2 16-byte message_id
v3 → v2: Pads with zeros for backward compatibility
Deprecated TunnelAck properly handled

Note: Ensure deployment strategy accounts for mixed-version runners during rollout.

Recommendations

Critical (must fix before merge):

Fix duplicate acquire_in_flight call (Issue [SVC-2479] Send cluster events to PostHog #2)
Fix whitespace formatting (Issue [SVC-2555] Set up issue templates #1)

High Priority:
3. Add unit tests for request ID generation and lifecycle
4. Lower MAX_TRIES and add collision metrics

Nice to Have:
5. Add doc comments on public API functions
6. Add error context to serialization failures
7. Document message index wrapping behavior

Great work on the protocol evolution! The core design is solid, just needs a few fixes to prevent the request ID lifecycle bug. 🎉

claude · 2025-11-20T04:55:57Z

PR Review: Add new message ID format & deprecated tunnel ack

I've reviewed this PR and have the following feedback:

Code Quality & Architecture ✅

The refactoring to use a structured message ID format is well-designed:

Strengths:

Strong type safety: Using type aliases (GatewayId, RequestId, MessageIndex, MessageId) makes the code self-documenting and prevents mixing up byte arrays
Centralized ID generation: The pegboard/src/tunnel/id.rs module provides a clean API for ID management
Backwards compatibility: Protocol version conversion handlers properly map between v2 and v3 formats with appropriate padding

Potential Issues ⚠️

1. Whitespace formatting issue (guard-core/src/proxy_service.rs:641)

// Generate unique request ID
        let request_id = Some(self.generate_unique_request_id().await?);

This line has inconsistent indentation (leading spaces instead of a tab). Per CLAUDE.md, the codebase uses hard tabs. This should be:

		// Generate unique request ID
		let request_id = Some(self.generate_unique_request_id().await?);

2. Duplicate in-flight tracking logic (guard-core/src/proxy_service.rs)

In handle_http_request around line 1094, there's duplicate logic for acquiring in-flight limits and generating request IDs:

First acquisition happens in the outer handle_request function (line 812)
Second acquisition happens within the CustomServe branch (line 1100)

This could lead to:

Double-counting in-flight requests for custom serve routes
Request ID being generated twice
Potential race conditions

Recommendation: Remove the duplicate acquisition in the CustomServe branch since handle_request already handles this for all request types.

3. Message ID serialization size verification

In tunnel/id.rs:34-39, the code verifies exactly 12 bytes were written. While this is good defensive programming, consider:

The serialization uses serde_bare which is deterministic, but the check relies on implementation details
If the BARE schema changes, this could silently fail

Recommendation: Add a compile-time assertion or unit test to verify the serialized size matches expectations.

4. Collision handling strategy

In generate_unique_request_id (proxy_service.rs:667-694):

Uses 4-byte (u32) request IDs → 4 billion possibilities
With the birthday paradox, collisions become likely after ~65k concurrent requests
MAX_TRIES of 100 seems arbitrary

Questions:

What's the expected concurrent request load?
Should this use a larger ID space (8 bytes) or a different generation strategy?
Consider using a monotonic counter + random component for better distribution

5. Error handling in message ID parsing

In shared_state.rs:216-221, failed message ID parsing logs an error but continues processing:

let parts = match tunnel_id::parse_message_id(msg.message_id) {
    Ok(p) => p,
    Err(err) => {
        tracing::error!(?err, message_id=?msg.message_id, "failed to parse message id");
        continue;
    }
};

Recommendation: Consider metrics/alerts for parsing failures, as this could indicate version mismatch or corruption issues.

Security Considerations 🔒

1. Request ID predictability

Using rand::random::<u32>() for request/gateway IDs could be predictable depending on the RNG:

If using a weak PRNG, IDs could be guessable
For security-sensitive applications, consider rand::thread_rng() with explicit cryptographic randomness

2. In-flight request tracking

The inflight_requests HashSet properly prevents ID collisions, but:

IDs are removed in release_in_flight which is spawned as a separate task
There's a small window where an ID could be reused before cleanup completes
Consider using epoch-based IDs or ensuring synchronous cleanup

Performance Considerations ⚡

1. Lock contention

Multiple mutex locks on inflight_requests (proxy_service.rs:669, 662):

Generation holds lock during random generation loop
Release holds lock for removal
Consider using a lock-free data structure or finer-grained locking

2. Message index wrapping

Using u16 for message index (65k messages):

Wraps using wrapping_add(1) (shared_state.rs:150)
The wrapping_gt comparison logic (shared_state.rs:468) handles wraparound
Ensure tests cover wraparound scenarios

Test Coverage 🧪

Missing test scenarios:

Message index wraparound behavior
Request ID collision handling (though rare)
Protocol version conversion with edge cases (e.g., max message indices)
Concurrent request ID generation
Race conditions in in-flight tracking

Recommendation: Add integration tests for hibernation/reconnection with message index continuity.

Documentation 📝

Missing documentation:

The structured message ID format (gateway_id + request_id + message_index) should be documented at the module level
The BARE serialization format dependency should be noted
Migration guide for any clients using old UUID-based request IDs

Code Style 📐

Following CLAUDE.md conventions well overall, but note:

Logging follows lowercase convention: ✅
Structured logging used correctly: ✅
Imports at file top: ✅
Hard tabs: ⚠️ (one violation noted above)

Summary

This is a solid refactoring that improves type safety and reduces message size. The main concerns are:

Fix duplicate in-flight tracking logic
Fix whitespace/indentation issue
Consider request ID collision strategy for high-scale scenarios
Add tests for wraparound and edge cases

The protocol versioning and backwards compatibility handling is well done. The new message ID format is more efficient and maintainable than the previous UUID-based approach.

Recommendation: Approve after addressing the duplicate acquisition logic and indentation issues. Consider the collision handling and test coverage improvements as follow-up work.

claude · 2025-11-20T04:57:45Z

Code Review - PR 3492

Summary

This PR introduces a new message ID format for the pegboard-gateway tunnel protocol and adds support for deprecated tunnel acknowledgments. The changes involve significant refactoring of the message ID system from UUID-based identifiers to a structured format with gateway ID, request ID, and message index components.

Strengths

Well-structured message ID system - The new MessageIdParts struct provides better traceability
Proper error handling - Good use of anyhow::Context and ensure! macros
Backward compatibility - The inclusion of DeprecatedTunnelAck shows thoughtful deprecation
Collision detection - The generate_unique_request_id() includes retry logic with warnings
Consistent logging - Proper use of structured logging with tracing

Critical Issues

1. Potential Race Condition in Request ID Management (engine/packages/guard-core/src/proxy_service.rs:656-692)

The code acquires an in-flight slot and generates a request ID, but releases it in a spawned task. If the request fails before the spawned task runs, the request ID leaks from the inflight_requests HashSet.

Recommendation: Use RAII pattern or ensure release happens in all code paths. Consider using a guard struct that releases on drop.

2. Missing Tests

No tests found for the new tunnel::id module. Required coverage:

build_message_id() and parse_message_id() round-trip
ID serialization produces exactly 12 bytes
String conversion functions
Collision detection retry logic
wrapping_gt() function behavior with wraparound

3. Commented-out Code (shared_state.rs:472-474)

Malformed commented code should be removed or properly formatted.

4. Duplicate In-Flight Acquisition (proxy_service.rs:809, 1097-1113)

In handle_http_request, acquire_in_flight is called twice - at top level and within CustomServe handler. This could cause double-counting or double-release. Needs verification.

Other Concerns

5. Memory Leak Potential

If the release task panics, request ID remains in HashSet forever, leading to memory exhaustion.

6. Message Index Overflow (shared_state.rs:150)

No documentation about wraparound behavior when message index wraps from u16::MAX to 0.

7. Hard-coded Magic Numbers

Consider documenting or making configurable: HWS_MESSAGE_ACK_TIMEOUT (30s), HWS_MAX_PENDING_MSGS_SIZE_PER_REQ (1MB), MAX_TRIES (100).

Security Considerations

ID Collision Resistance: 4-byte IDs have ~4.3B values. Birthday paradox suggests collisions after ~65k requests. Monitor collision rates.
rand::random() may not be cryptographically secure. Use rand::thread_rng() if unpredictability is needed.

Performance Considerations

Arc<Mutex<HashSet>> locked on every request could bottleneck under high concurrency. Consider DashMap or sharding.
MAX_TRIES=100 retry loop could add latency. Add metrics for collision rates.

Style & Best Practices

Import organization follows CLAUDE.md correctly
Proper structured logging with lowercase messages
Correct use of anyhow errors and Result types

Recommendations

Must Fix:

Add comprehensive tests for tunnel::id module
Fix potential race condition in request ID release
Remove malformed commented code
Review duplicate in-flight acquisition logic

Should Fix:

Add RAII guard for request ID cleanup
Document message index wraparound behavior
Add metrics for ID collision rates
Consider more scalable data structure than Mutex

Overall: Solid refactoring improving message ID structure, but resource cleanup and test coverage issues need addressing before merge. Request changes.

…unnel ack

claude · 2025-11-20T19:44:17Z

Code Review: PR #3492 - Add new message ID format & deprecated tunnel ack

Overview

This PR refactors the tunnel message ID system from UUID-based identifiers to a compact, structured format. The new format embeds gateway ID, request ID, and message index into a 12-byte message ID, improving efficiency and eliminating redundant fields.

Strengths

1. Architecture & Design

Clean abstraction: New tunnel::id module provides well-defined type aliases and utilities
Backwards compatibility: Excellent handling via DeprecatedTunnelAck and compat::version_needs_tunnel_ack()
Protocol versioning: Proper v2 to v3 conversion with clear mapping logic

2. Code Quality

Type safety: Uses [u8; N] arrays instead of Vec where appropriate
Clear separation: Message ID construction/parsing isolated in dedicated module
Consistent error handling: Good use of anyhow::Context for error messages

3. Performance Improvements

Reduced message size by eliminating redundant gateway_id and request_id fields from ToClientTunnelMessage
More efficient collision detection with HashSet
Removed unnecessary pending_msgs tracking (replaced with message index counter)

claude · 2025-11-20T19:44:25Z

Issues & Recommendations

CRITICAL: Double In-Flight Acquisition Bug

Location: engine/packages/guard-core/src/proxy_service.rs:809 and :1100

There is a double acquire bug for CustomServe routes. The code acquires in-flight slot twice:

First at line 809 in handle_request
Second at line 1100 inside handle_http_request for CustomServe routes

Impact:

In-flight limits incorrectly enforced (consuming 2 slots per request)
Request ID collision risk increases
Memory leak: First request_id is never released

Fix: The acquisition at line 809 should be removed, OR the second acquisition for CustomServe should be skipped. Acquisition should happen at the routing destination only.

MEDIUM: Request ID Collision Handling

Location: engine/packages/guard-core/src/proxy_service.rs:665-692

The collision retry logic holds a Mutex lock for up to 100 iterations. Issues:

Lock held during entire retry loop could cause contention
bail! after 100 attempts could be DoS vector under extreme load
No metrics to track collision frequency

Recommendations:

Consider DashMap for better concurrency
Add metric to track collision frequency
Consider warning after 10 attempts instead of 100

claude · 2025-11-20T19:44:40Z

Security Considerations

Request ID Generation: Uses rand::random which is cryptographically secure via getrandom. Good.
DoS via Collisions: 100-retry limit prevents infinite loops, but held locks could cause thread starvation. Consider rate limiting.
Hibernation State: gateway_id properly isolates state between gateways. Good.

Test Coverage

Issue: No new tests detected in the PR.

Recommendations:

Add unit tests for tunnel::id module: build_message_id + parse_message_id round-trip, collision handling, base64 encoding edge cases
Add integration tests: Protocol v2 to v3 conversion, WebSocket message index during hibernation, in-flight request lifecycle
Add property tests: Message ID components round-trip, wrapping behavior of message_index (u16)

Performance Analysis

Positive:

Reduced wire protocol overhead (approximately 24 bytes per message)
Simplified hibernation state tracking
Removed timestamp-based GC for tunnel acks

Concerns:

Mutex contention on inflight_requests could become bottleneck under high concurrency
Double acquisition issue compounds this

claude · 2025-11-20T19:44:47Z

Summary Verdict

Overall Assessment: Good refactoring with excellent backwards compatibility, but contains a CRITICAL double-acquisition bug that must be fixed before merge.

Priority Fixes:

CRITICAL: Fix double acquire_in_flight for CustomServe routes (engine/packages/guard-core/src/proxy_service.rs:809 and :1100)
MEDIUM: Review lock contention for inflight_requests HashSet
LOW: Add test coverage for new message ID format
LOW: Fix whitespace formatting at line 638

Code Style: Mostly follows conventions - uses hard tabs, proper structured logging with tracing, good error handling with anyhow::Context.

Recommendation: Request changes to fix the double-acquisition bug, then approve after verification.

Great work on the protocol migration and backwards compatibility! The structured message ID format is a solid improvement.

claude · 2025-11-20T19:45:33Z

Code Review: Message ID Format & Deprecated Tunnel Ack

I've completed a thorough review of this PR. Overall, the implementation is solid with good backward compatibility handling. Here are my findings:

✅ Strengths

Protocol Evolution - Excellent backward compatibility with v2 → v3 conversion logic that properly handles the migration from 16-byte to 12-byte message IDs
Message ID Design - The new compact format (4-byte gateway + 4-byte request + 2-byte index = 12 bytes) is more efficient than the previous 16-byte format
Collision Handling - Good safeguard with retry logic for request ID generation (max 100 attempts)
Deprecation Strategy - Proper use of DeprecatedTunnelAck maintains compatibility while signaling the obsolescence
Error Handling - Following the project's custom error patterns correctly with proper anyhow::Result usage

🔍 Issues & Concerns

High Priority

1. Resource Leak in In-Flight Request Management (proxy_service.rs:809-894)

The acquire_in_flight method is called twice for the same request in the CustomServe path:

First at line 809 when entering handle_request
Second at line 1099 inside the CustomServe handler

This causes:

Double acquisition of in-flight slots (potential deadlock/resource exhaustion)
Request ID inserted twice into the HashSet (second insert returns false)

Fix: Remove the duplicate acquire_in_flight call inside the CustomServe handler since it's already acquired before route resolution.

2. Missing gateway_id Field in Hibernating Request Operations

In pegboard/src/ops/actor/hibernating_request/delete.rs, the operation still needs to be updated to include gateway_id in the input (similar to the upsert operation). The key structure uses both gateway_id and request_id, so the delete operation should match.

3. Potential Integer Overflow in Message Index (shared_state.rs:130)

req.message_index = req.message_index.wrapping_add(1);

While wrapping_add is used correctly, there's no documentation or handling for when the u16 wraps around. With high-throughput WebSocket connections sending >65k messages, this could cause:

Acknowledgment confusion
Message ordering issues

Recommendation: Add a comment explaining the wrap-around behavior is intentional and safe due to the acknowledgment window size.

Medium Priority

4. Mutex Contention on inflight_requests (proxy_service.rs:669)

let mut requests = self.inflight_requests.lock().await;

The collision detection holds a global mutex for the entire retry loop. Under high concurrency, this could become a bottleneck. Consider using a lock-free data structure or reducing the critical section.

5. Hardcoded Retry Limit (proxy_service.rs:668)

const MAX_TRIES: u32 = 100;

This should probably be configurable or at least documented why 100 is the right number. With 4 billion possible IDs (2^32), the birthday paradox suggests collisions become probable around sqrt(4B) ≈ 65k concurrent requests, which is achievable in production.

6. Missing Validation in TypeScript Implementation

tunnel-id.ts:27-31 validates the message index range, but the Rust implementation (tunnel/id.rs) doesn't have equivalent runtime validation. Consider adding:

ensure!(message_index <= u16::MAX, "message index out of range");

Low Priority

7. Inconsistent Logging

Some places use ?request_id (debug format) and others use request_id=?tunnel_id::request_id_to_string(&request_id). Standardize on one approach for consistency.

8. Test Coverage

I don't see test files for the new tunnel ID collision handling logic. Consider adding:

Unit tests for generate_unique_request_id with mock collision scenarios
Integration tests for message ID parsing/building round-trips
Tests for u16 message index wrap-around behavior

📊 Performance Considerations

Positive: Reducing message ID from 16 to 12 bytes saves 25% bandwidth on every tunnel message
Positive: Moving message index into the message ID eliminates per-message metadata
Concern: The HashSet lookup on every request could impact latency at high scale

🔐 Security Review

No security vulnerabilities identified. The random ID generation uses cryptographically secure RNG which is appropriate.

📝 Code Quality

✅ Follows Rivet error handling conventions
✅ Proper use of workspace dependencies
✅ Structured logging with tracing
✅ Good separation of concerns (tunnel ID logic in dedicated module)
⚠️ Missing inline documentation for complex protocol conversion logic

Recommendations

Fix the double acquisition bug in proxy_service.rs
Add the missing gateway_id to the delete operation
Add test coverage for collision handling
Consider making MAX_TRIES configurable
Document the message index wrap-around behavior
Add inline comments to the protocol version conversion functions explaining the byte layout transformations

Verdict

✅ Approve with changes requested - The core implementation is solid, but the double acquisition bug needs to be fixed before merging as it could cause resource leaks in production.

NathanFlurry mentioned this pull request Nov 19, 2025

chore(pegboard-gateway): remove TunnelAck #3488

Open

NathanFlurry mentioned this pull request Nov 19, 2025

chore(engine-runner): remove echo close on ToClientWebSocketClose #3487

Open

NathanFlurry mentioned this pull request Nov 19, 2025

fix(tunnel): implement ping pong #3491

Open

vercel bot had a problem deploying to Preview – rivetkit-serverless November 19, 2025 21:23 Failure

NathanFlurry force-pushed the 11-19-chore_pegboard-gateway_add_new_message_id_format_add_deprecated_tunnel_ack branch from c020fe1 to e47241f Compare November 19, 2025 21:23

vercel bot had a problem deploying to Preview – rivetkit-serverless November 19, 2025 21:24 Failure

MasterPtato requested changes Nov 19, 2025

View reviewed changes

MasterPtato force-pushed the 11-19-chore_pegboard-gateway_remove_tunnelack_ branch from 4adef6f to 759df3e Compare November 19, 2025 21:54

MasterPtato force-pushed the 11-19-chore_pegboard-gateway_add_new_message_id_format_add_deprecated_tunnel_ack branch from e47241f to 6521b98 Compare November 19, 2025 21:54

vercel bot had a problem deploying to Preview – rivetkit-serverless November 19, 2025 21:54 Failure

NathanFlurry force-pushed the 11-19-chore_pegboard-gateway_add_new_message_id_format_add_deprecated_tunnel_ack branch from 6521b98 to d6a2e79 Compare November 19, 2025 23:25

NathanFlurry force-pushed the 11-19-chore_pegboard-gateway_remove_tunnelack_ branch from 759df3e to 4adef6f Compare November 19, 2025 23:25

vercel bot deployed to Preview – rivetkit-serverless November 19, 2025 23:28 View deployment

MasterPtato force-pushed the 11-19-chore_pegboard-gateway_add_new_message_id_format_add_deprecated_tunnel_ack branch from d6a2e79 to adeb05d Compare November 20, 2025 03:07

MasterPtato force-pushed the 11-19-chore_pegboard-gateway_remove_tunnelack_ branch from 4adef6f to 759df3e Compare November 20, 2025 03:07

vercel bot deployed to Preview – rivetkit-serverless November 20, 2025 03:08 View deployment

NathanFlurry force-pushed the 11-19-chore_pegboard-gateway_add_new_message_id_format_add_deprecated_tunnel_ack branch from adeb05d to d6a2e79 Compare November 20, 2025 04:51

NathanFlurry force-pushed the 11-19-chore_pegboard-gateway_remove_tunnelack_ branch from 759df3e to 4adef6f Compare November 20, 2025 04:51

NathanFlurry mentioned this pull request Nov 20, 2025

chore(rivetkit): fix hibernation implementation #3494

Open

chore(pegboard-gateway): add new message id format & add deprecated t…

9d2a169

…unnel ack

MasterPtato force-pushed the 11-19-chore_pegboard-gateway_add_new_message_id_format_add_deprecated_tunnel_ack branch from d6a2e79 to 9d2a169 Compare November 20, 2025 19:38

MasterPtato force-pushed the 11-19-chore_pegboard-gateway_remove_tunnelack_ branch from 4adef6f to bdf8c0e Compare November 20, 2025 19:38

vercel bot had a problem deploying to Preview – rivetkit-serverless November 20, 2025 19:38 Failure

MasterPtato mentioned this pull request Nov 20, 2025

fix(pb): stop actors from destroying on goingaway #3501

Draft

chore(pegboard-gateway): add new message id format & add deprecated tunnel ack #3492

Are you sure you want to change the base?

chore(pegboard-gateway): add new message id format & add deprecated tunnel ack #3492

Conversation

NathanFlurry commented Nov 19, 2025

Uh oh!

vercel bot commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NathanFlurry commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to use the Graphite Merge Queue

Uh oh!

claude bot commented Nov 19, 2025

Code Review

✅ Strengths

⚠️ Issues & Concerns

1. Resource Leak: Request ID Not Released ⚠️ CRITICAL

2. Missing Cleanup for HTTP Custom Serve

3. Potential Panic in TupleUnpack

4. Serialization Size Verification

5. Wrapping Arithmetic Documentation

6. Missing Test Coverage

📝 Minor Observations

🎯 Recommendations

Summary

Uh oh!

claude bot commented Nov 19, 2025

Code Review

Positive Aspects

Issues & Concerns

🔴 Critical: Request ID Not Released After Completion

🟡 Medium: Missing Request ID Release in HTTP Path

🟡 Medium: Serialization Size Not Verified at Compile Time

🟡 Medium: Wrapping Arithmetic Without Overflow Documentation

🟢 Minor: Inconsistent Error Message Formatting

🟢 Minor: Missing Documentation on MessageId Format

Performance Considerations

Security Considerations

Test Coverage

Summary

Uh oh!

MasterPtato Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

MasterPtato Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

MasterPtato Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot commented Nov 19, 2025

PR Review: New Message ID Format & Deprecated Tunnel Ack

Strengths

Critical Issues

Moderate Issues

Performance & Security

Summary

Uh oh!

claude bot commented Nov 19, 2025

PR Review: New Message ID Format & Deprecated Tunnel Ack

✅ Strengths

⚠️ Issues & Suggestions

📝 Minor Code Quality Notes

🧪 Testing Recommendations

Summary

Uh oh!

claude bot commented Nov 19, 2025

PR Review: Add New Message ID Format & Deprecated Tunnel Ack

Overview

Code Quality & Best Practices ✅

1. Inconsistent Whitespace (engine/packages/guard-core/src/proxy_service.rs:641)

2. Missing Import Consolidation (engine/packages/guard-core/src/proxy_service.rs:16)

Potential Bugs & Issues ⚠️

1. Critical: Resource Leak in Error Paths (engine/packages/guard-core/src/proxy_service.rs:1094-1173)

2. Duplicate acquire_in_flight Call (engine/packages/guard-core/src/proxy_service.rs:809 & 1100)

3. Collision Handling May Cause Performance Issues (engine/packages/pegboard/src/tunnel/id.rs:67-69 & engine/packages/guard-core/src/proxy_service.rs:667-694)

4. WebSocket Resource Leak in Error Path (engine/packages/guard-core/src/proxy_service.rs:1920-2145)

Performance Considerations 🔍

1. Serialization on Hot Path (engine/packages/pegboard/src/tunnel/id.rs:28-31)

vercel bot commented Nov 19, 2025 •

edited

Loading

NathanFlurry commented Nov 19, 2025 •

edited

Loading

1. Whitespace formatting issue (`proxy_service.rs:641`)

2. Duplicate in-flight acquisition in CustomServe path (`proxy_service.rs:1099-1115`)

3. WebSocket double-acquisition pattern (`proxy_service.rs:1922-1937`)

4. Request ID not released on early errors (`proxy_service.rs:1094+`)

6. Hard-coded retry limit (`proxy_service.rs:668`)

7. Serialization size verification (`tunnel/id.rs:34-39`)